Unless you’ve been completely disconnected from the world recently (or even for the past few years), you’re likely aware of the rapid advancements in AI.
The LangChain framework in particular is creating a lot of excitement. It is a Python framework to develop applications powered by Large Language Models (LLMs) such as GPT-4 and has integration with OpenAI, HuggingFace, SageMaker, etc.
It is attempting nothing less than creating a new programming paradigm, one that directly manipulates meaning and knowledge through language, embeddings, and LLMs. Among other things, it makes it incredibly simple to add a chatbot interface to a knowledge base, create smart bots/agents à la AutoGPT, and make life working with LLMs much simpler in general.
However, not all current applications directly interact with embeddings and may necessitate users to provide their preferences or various inputs in a specific format.
In this tutorial, we’ll demonstrate a practical use case of processing natural language user input and transforming it into a structured format usable by our Python application.
We’ll do so with the help of Pydantic, which allows defining types or “models” in Python and therefore provides type hints, and enforces validation at runtime. I find it particularly useful to serialize and deserialize data.
We’ll use the context of a music app, where understanding a user’s music preferences is essential for making relevant recommendations. Here’s what we want to achieve:
# User input
natural_language_input = "I love pop music from the 80s, especially Madonna"
# Expected output after parsing:
{
"genres": ["pop"],
"bands": ["Madonna"],
"albums": null,
"year_range": [1980, 1989]
}
Setting up the Environment
Before we start, make sure to install LangChain and Pydantic:
pip3 install langchain python-dotenv openai pydantic
You’ll also need your OpenAI API key. You can obtain it from your Account on OpenAI. Then, create .env file in your directory and add your API key:
OPENAI_API_KEY="<your api key>"
Create main.py script file, and load the env vars:
from dotenv import load_dotenv
load_dotenv()
Now let’s initialize our LLM.
In this example, we’ll be using GPT-3 as it works well enough for our needs, but you could use newer GPT versions or HuggingFace models.
from langchain.llms import OpenAI
# ...
model_name = 'text-davinci-003'
llm = OpenAI(model_name=model_name, temperature=0)
How it works
In essence, our task involves constructing a prompt that includes the user’s input, instructions about the required format for the output, and some examples to guide the LLM toward generating a valid output. This output will then be parsed with Pydantic.
What’s great about LangChain is that it alleviates the burden of creating this entire prompt manually.
Its integration with Pydantic significantly simplifies the process of generating the part of the prompt that specifies the output requirements, and LangChain assists in managing the example data. It can even be configured to select the most pertinent examples based on the current user input!
Moreover, LangChain facilitates the ease of switching between different LLMs.
Defining the Pydantic model
We’ll be using a Pydantic model to describe the structure of the data we want to parse. In this case, we want to capture information about a user’s music taste, including the genres, bands, and albums they like. Here’s how to define such a model:
from pydantic import BaseModel, Field, conlist
from typing import List, Optional, Tuple
# ...
class MusicTasteDescriptionResult(BaseModel):
genres: Optional[conlist(str, min_items=1, max_items=5)] = Field(description="Music genres liked by the user. Must contain between 1 and 5 genres")
bands: Optional[conlist(str, min_items=1, max_items=5)] = Field(description="Specific bands or artists liked by the user. If provided, must contain between 1 and 5 bands or artists")
albums: Optional[conlist(str, min_items=1, max_items=5)] = Field(description="Specific albums liked by the user. If provided, must contain between 1 and 5 albums")
year_range: Optional[conlist(int, min_items=2, max_items=2)] = Field(description="Year range of music liked by the user. If provided, must contain exactly 2 years indicating the start and end of the range")
This class defines a Pydantic model with three fields: genres, bands, and albums. Each field is optional and can contain between 1 and 5 strings.
Note that we provide a description for each field. This is important as this will be used by LangChain to generate a prompt.
Now let’s create a parser to integrate Pydantic with Langchain:
from langchain.output_parsers import PydanticOutputParser
# ...
parser = PydanticOutputParser(pydantic_object=MusicTasteDescriptionResult)
Creating some examples and the prompt
Let’s define some examples to guide our LLM in outputting the expected object.
examples = [
{
"music taste description": "I like rock such as Rolling Stones or The Ramones, or the album London Calling from the clash",
"result": MusicTasteDescriptionResult.parse_obj({
"genres": ["rock"],
"bands": ["Rolling Stones", "The Ramones", "The Clash"],
"albums": ["London Calling"]
}).json().replace("{", "{{").replace("}", "}}"),
},
{
"music taste description": "I enjoy rock music from the 70s like Led Zeppelin",
"result": MusicTasteDescriptionResult.parse_obj({
"genres": ["rock"],
"bands": ["Led Zeppelin"],
"year_range": [1970, 1979]
}).json().replace("{", "{{").replace("}", "}}"),
},
# add more examples as needed...
]
Here, I could simply pass a JSON string for the result key as it is what we want the LLM to output, but this would be prone to errors and make managing our examples more difficult.
Instead, I’m using MusicTasteDescriptionResult to validate those examples and generate the json strings.
We need to add .replace(“{“, “{{“).replace(“}”, “}}”) after serialising as JSON so that the curly brackets in the JSON won’t be mistaken for prompt variables by LangChain. This piece simply escapes those curly brackets.
Now let’s define the prompt. We’ll use a few-shot prompt template to guide the GPT-3 model.
from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.prompts.prompt import PromptTemplate
# ...
example_prompt = PromptTemplate(input_variables=["music taste description", "result"], template="Query: {music taste description}nResult:n{result}")
prompt = FewShotPromptTemplate(
examples=examples,
example_prompt=example_prompt,
prefix="""Given a query describing a user's music taste, transform it into a structured object.
{format_instructions}
""",
suffix="Query: {input}nResult:n",
input_variables=["input"],
partial_variables={"format_instructions": parser.get_format_instructions()}
)
A few things are happening here:
- We create an example prompt, where we describe which keys of the examples corresponds to our prompt variables, as well as a basic template to describe how our string will look for each of them.
- We instantiate a FewShotPromptTemplate class that will populate our examples in the final prompt, build and provide the output formatting instructions based on the Pydantic parser, and finally passes the user input in the same format as our example prompt so that the LLM only has to complete with the most likely output.
Let’s look at the string that both prompts create:
# call .format() to generate the string, and pass the input variables as keyword arguments
print(example_prompt.format(**examples[0]))
print("n#######n")
print(prompt.format(input="My favorite band is The Beatles"))
Output:
Query: I like rock such as Rolling Stones or The Ramones, or the album London Calling from the clash
Result:
{{"genres": ["rock"], "bands": ["Rolling Stones", "The Ramones", "The Clash"], "albums": ["London Calling"], "year_range": null}}
#######
Given a query describing a user's music taste, transform it into a structured object.
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Here is the output schema:
```
{"properties": {"genres": {"title": "Genres", "description": "Music genres liked by the user. Must contain between 1 and 5 genres", "minItems": 1, "maxItems": 5, "type": "array", "items": {"type": "string"}}, "bands": {"title": "Bands", "description": "Specific bands or artists liked by the user. If provided, must contain between 1 and 5 bands or artists", "minItems": 1, "maxItems": 5, "type": "array", "items": {"type": "string"}}, "albums": {"title": "Albums", "description": "Specific albums liked by the user. If provided, must contain between 1 and 5 albums", "minItems": 1, "maxItems": 5, "type": "array", "items": {"type": "string"}}, "year_range": {"title": "Year Range", "description": "Year range of music liked by the user. If provided, must contain exactly 2 years indicating the start and end of the range", "minItems": 2, "maxItems": 2, "type": "array", "items": {"type": "integer"}}}}
```
Query: I like rock such as Rolling Stones or The Ramones, or the album London Calling from the clash
Result:
{"genres": ["rock"], "bands": ["Rolling Stones", "The Ramones", "The Clash"], "albums": ["London Calling"], "year_range": null}
Query: I enjoy rock music from the 70s like Led Zeppelin
Result:
{"genres": ["rock"], "bands": ["Led Zeppelin"], "albums": null, "year_range": [1970, 1979]}
Query: My favorite band is The Beatles
Result:
As you can see, we end up with a full prompt containing a description of the goal, instructions automatically generated for the format of the output, a few examples, and the user input.
Running the Chain
Next, we’ll create a chain using the LLMChain class from LangChain. Chains are at the core of LangChain, if you’re not familiar with them I would encourage you to look at the documentation. I’m not gonna try to explain what they are here — for our purposes, we can just assume that Chains are the basic unit of functionality in a LangChain application.
Let’s create our chain:
from langchain.chains import LLMChain
# ...
chain = LLMChain(llm=llm, prompt=prompt)
Now let’s test it with an input, parse it with our Pydantic parser and print the output:
output = chain.run("I love pop music from the 80s, especially Madonna")
print(output)
try:
parsed_taste = parser.parse(output)
print(f"""
Genres: {", ".join(parsed_taste.genres) if parsed_taste.genres else 'Not specified'}
Bands: {", ".join(parsed_taste.bands) if parsed_taste.bands else 'Not specified'}
Albums: {", ".join(parsed_taste.albums) if parsed_taste.albums else 'Not specified'}
Year Range: {f"{parsed_taste.year_range[0]} - {parsed_taste.year_range[1]}" if parsed_taste.year_range else 'Not specified'}
""")
except Exception as e:
print(e)
Output:
{"genres": ["pop"], "bands": ["Madonna"], "albums": null, "year_range": [1980, 1989]}
Genres: pop
Bands: Madonna
Albums: Not specified
Year Range: 1980 - 1989
That’s it! You can now use the Pydantic model in your application. You can customize this script to suit your own needs by changing the Pydantic model, the examples, and the prompt template.
A potential next step would be to properly handle validation errors, in order to give feedback to your users if what they wrote can not be understood.
Give LangChain a go, and let me know what you think in the comments;)
Thanks for reading!
I’m Olivier Ramier, CTO at TelescopeAI. If you’re passionate about Machine Learning, LangChain, or LLMs, please reach out — we’re always looking for talented people to help us in our mission to help companies find the right customers at the right time.
How to Add Natural Language Input to an Existing Python Application With LangChain and Pydantic was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.