SelfQueryRetriever Is a search tool with the ability to create and solve questions on its own.
This is based on the natural language query provided by the user, query-constructing Use LLM chain to create structured queries. Subsequently, this structured query is applied to the default vector data store (VectorStore) to perform the search.
Through this process, SelfQueryRetriever Beyond simply comparing the user's input query with the content of the stored document, the user's query is about the document's metadata. Extract filter You can find related documents by running this filter.
[Note]
LangChain supports self-query Retriever list here Please check at
# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv
# API Load key information
load_dotenv()
True
# LangSmith Set up tracking. https://smith.langchain.com# !pip install langchain-teddynotefrom langchain_teddynote import logging# Enter a project name.logging.langsmith("CH11-Retriever")
Sample data generation
Based on the description and metadata of cosmetic products, we build a vector repository with similar search.
SelfQueryRetriever
You can now instantiate retriever. To do this, the document supports Metadata field And the content of the document Provide a brief description in advance Should do.
AttributeInfo Classes are used to define information about cosmetic metadata fields.
Category ( category ): Indicates the string type, the category of cosmetics, and has the value of one of ['skincare','makeup','closing','selection'].
year ( year ): Indicates the integer type, the year the cosmetic was released.
User rating ( user_rating ): Real type, representing user ratings in the range 1-5.
SelfQueryRetriever.from_llm() Using methods retriever Create an object.
llm : Language model
vectorstore : Vector repository
document_contents : Description of the contents of the documents
metadata_field_info : Metadata field information
Query test
Search by entering the query to hang the filter.
You can perform a search using complex filters.
k means the number of documents to import.
SelfQueryRetriever Using k You can also specify This is on the constructor enable_limit=True You can do it by passing.
There are three products released in 2023, but we specify the "k" value as 2 to return only 2.
But explicitly by code search_kwargs In query without specifying 1개, 2개 You can use numbers such as to limit your search results.
Enter deeper
To see what happens inside and to have more custom control, we can reconstruct retriever from scratch.
This course query-construction chain Start by creating.
Generating structured queries query_constructor Generate chain. get_query_constructor_prompt Use the function to get the query generator prompt.
query_constructor.invoke() Call the method to perform processing for a given query.
Let's check the generated query.
A key element of the Self-query retriever is the query constructor. In order to create a great search system, you need to make the query configor work fine.
To do this Adjust prompt (Prompt), example within prompt, attribute description, etc. Should do.
Convert to structured queries using structured Query Translator
The next important factor is the structured query translator.
This is common StructuredQuery It is responsible for converting objects into metadata filters that fit the syntax of the vector store in use.
retriever.invoke() Use methods to generate answers to a given question.
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
# Create descriptions and metadata for cosmetic products
docs = [
Document(
page_content="Hydrate deep into your skin with this moisture-rich hyaluronic acid serum.",
metadata={"year": 2024, "category": "Skincare", "user_rating": 4.7},
),
Document(
page_content="A foundation with a matte finish that lasts 24 hours, covers pores and gives a natural skin look.
.",
metadata={"year": 2023, "category": "makeup", "user_rating": 4.5},
),
Document(
page_content="A hypoallergenic cleansing oil made from plant-based ingredients that gently removes makeup and impurities.",
metadata={"year": 2023, "category": "cleansing", "user_rating": 4.8},
),
Document(
page_content="Brightening cream with vitamin C, brightens dull skin tone.",
metadata={"year": 2023, "category": "skincare", "user_rating": 4.6},
),
Document(
page_content="Long lasting lipstick, Vivid color and moist feel for comfortable use all day long.",
metadata={"year": 2024, "category": "makeup", "user_rating": 4.4},
),
Document(
page_content="Tone-up sunscreen with UV protection, SPF50+/PA++++ Protects skin with high UV protection factor.",
metadata={"year": 2024, "category": "sun care", "user_rating": 4.9},
),
]
# Create a vector store
vectorstore = Chroma.from_documents(
docs, OpenAIEmbeddings(model="text-embedding-3-small")
)
from langchain.chains.query_constructor.base import AttributeInfo
# Generate metadata field information
metadata_field_info = [
AttributeInfo(
name="category",
description="The category of the cosmetic product. One of ['Skincare', 'Makeup', 'Cleansing', 'Suncare']",
type="string",
),
AttributeInfo(
name="year",
description="The year the cosmetic product was released",
type="integer",
),
AttributeInfo(
name="user_rating",
description="A user rating for the cosmetic product, ranging from 1 to 5",
type="float",
),
]
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import ChatOpenAI
# LLM definition
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# SelfQueryRetriever generation
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents="Brief summary of a cosmetic product",
metadata_field_info=metadata_field_info,
)
# Self-query search
retriever.invoke("Please recommend products with a rating of 4.8 or higher")
[Document (metadata={'category':'Care','user_rating': 4.9,'year': 2024}, page_content=' Tone-up sunscreen with UV protection, SPF50+/++++ High UV
# Self-query search
retriever.invoke("Please recommend products released in 2023")
[Document (metadata={'category':'Makeup','user_rating': 2023}, page_content='24 hours lasting foundation, covering pores and natural skin expression possible'), Document (Gently remove hypoallergenic cleansing oil, makeup and waste made from vegetable ingredients.')]
# Self-query search
retriever.invoke("Please recommend a product in the sun care category")
[Document (metadata={'category':'Prepair','user_rating': 4.9,'year': 2024}, page_content=' Ton-up sunscreen with UV protection, SPF50+/++++ High UV protection.
# Self-query search
retriever.invoke(
"Please recommend products with a rating of 4.5 or higher among products in the makeup category."
)
[Document (metadata={'category':'makeup','user_rating': 4.5,'year': 2023}, page_content='24 hours lasting foundation of matte finish, covering pores and natural skin expression Is possible.')]
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents="Brief summary of a cosmetic product",
metadata_field_info=metadata_field_info,
enable_limit=True, # Enables search result limiting feature.
search_kwargs={"k": 2}, # k Limit the search results to 2 by specifying a value of 2.
)
# Self-query 검색
retriever.invoke("Please recommend products released in 2023")
[Document (metadata={'category':'Makeup','user_rating': 4.5,'year': 2023}, page_content='24 hours lasting foundation of matte finish, covering pores and natural skin expression Is possible.'), Document (
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents="Brief summary of a cosmetic product",
metadata_field_info=metadata_field_info,
enable_limit=True, # Enables search result limiting feature.
)
# Self-query search
retriever.invoke("Recommend one product released in 2023")
[Document (metadata={'category':'makeup','user_rating': 4.5,'year': 2023}, page_content='24 hours lasting foundation of matte finish, covering pores and natural skin expression Is possible.')]
# Self-query search
retriever.invoke("Recommend 2 products released in 2023")
[Document (metadata={'category':'Makeup','user_rating': 4.5,'year': 2023}, page_content='24 hours lasting foundation of matte finish, covering pores and natural skin expression Is possible.'), Document (
from langchain.chains.query_constructor.base import (
StructuredQueryOutputParser,
get_query_constructor_prompt,
)
# Description of document content and metadata field information
prompt = get_query_constructor_prompt(
"Brief summary of a cosmetic product", # Description of document contents
metadata_field_info, # Metadata field information
)
# StructuredQueryOutputParser create
output_parser = StructuredQueryOutputParser.from_components()
# query_constructor chain create
query_constructor = prompt | llm | output_parser
query_output = query_constructor.invoke(
{
# Calls the query generator to generate a query for the given question.
"query": "2Please recommend a skincare product among the products released in 2023 with a rating of 4.5 or higher."
}
)