# 01. VectorStore-backed Retriever

## VectorStore-backed Retriever <a href="#vectorstore-backed-retriever" id="vectorstore-backed-retriever"></a>

**VectorStore Support Finder** is a retriever that searches for documents using the vector store.

Vector store **Similarity search** Ina **MMR** Query text within the vector store using the same search method.

Run the code below to generate VectorStore

```
# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()
```

```
True
```

```
# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH11-Retriever")
```

```
 Start tracking LangSmith. 
[Project name] 
CH11-Retriever 
```

```
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# TextLoader Load the file using.
loader = TextLoader("./data/appendix-keywords.txt")

# Load the document.
documents = loader.load()

# Create a CharacterTextSplitter that splits text based on characters. The chunk size is 300 and there is no overlap between chunks..
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)

# Split the loaded document.
split_docs = text_splitter.split_documents(documents)

# OpenAI Generate embeddings.
embeddings = OpenAIEmbeddings()

# Create a FAISS vector database using segmented text and embeddings.
db = FAISS.from_documents(split_docs, embeddings)

```

#### VectorStoreRetriever initialization at VectorStore (as\_retriever) <a href="#vectorstore-vectorstoreretriever-as_retriever" id="vectorstore-vectorstoreretriever-as_retriever"></a>

`as_retriever` The method initializes and returns VectorStoreRetriever based on the VectorStore object. This method allows you to set up various search options to perform document searches tailored to your needs.

**Parameters**

* `**kwargs` : Keyword factor to pass to search function
* `search_type` : Search type ("similarity", "mmr", "similarity\_score\_threshold")
* `search_kwargs` : Additional search options
  * `k` : Number of documents to return (default: 4)
  * `score_threshold` : minimum similarity threshold for similarity\_score\_threshold search
  * `fetch_k` : Number of documents to pass to MMR algorithm (default: 20)
  * `lambda_mult` : Diversity regulation of MMR results (between 0-1, default: 0.5)
  * `filter` : Document metadata based filtering

**Return value**

* `VectorStoreRetriever` : Initialized VectorStoreRetriever object

**Reference**

* Various search strategies can be implemented (similarity, MMR, threshold based)
* MMR (Maximal Marginal Relevance) algorithm allows you to regulate the diversity of search results
* Metadata filtering allows only documents with specific conditions to be retrieved
* `tags` Tagging can be added to the searcher via parameters

**caution**

* `search_type` and `search_kwargs` Proper combination required
* When using MMR `fetch_k` Wow `k` Need to balance values
* `score_threshold` Values that are too high at the time of setting may not have search results
* When using the filter, it is necessary to pinpoint the metadata structure of the dataset.
* `lambda_mult` The closer the value is to 0, the higher the diversity, the closer to 1, the higher the similarity.

```
# Assign the database to the retriever variable to use it as a search engine.
retriever = db.as_retriever()
```

## Retriever invoke( ) <a href="#retriever-invoke" id="retriever-invoke"></a>

`invoke` The method is Retriever's main entry point, used to retrieve related documents. This method synchronously calls Retriever to return relevant documents for a given query.

**Parameters**

* `input` : Search query string
* `config` : Retriever configuration (Optional\[RunnableConfig])
* `**kwargs` : Additional factors to pass to Retriever

**Return value**

* `List[Document]` : List of related documents

```
# Search for related documents
docs = retriever.invoke(" What is Embedding ?")

for doc in docs:
    print(doc.page_content)
    print("=========================================================")
```

```
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
========================================================= 
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
========================================================= 
Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 
========================================================= 
Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 
========================================================= 

```

#### Max Marginal Relevance (MMR) <a href="#max-marginal-relevance-mmr" id="max-marginal-relevance-mmr"></a>

`MMR(Maximal Marginal Relevance)` The way the documents retrieved when searching for related items for queries **Duplicate** This is one way to avoid.

Instead of simply searching for only the most relevant items, MMR is about queries **Document relevance** And already selected **simultaneously consider discrimination against documents** To.

* `search_type` parameter `"mmr"` By setting **MMR (Maximal Marginal Relevance)** Use search algorithms.
* `k` : Number of documents to return (default: 4)
* `fetch_k` : Number of documents to pass to MMR algorithm (default: 20)
* `lambda_mult` : Diversity control of MMR results (0\~1, default: 0.5, 0: Similarity score only, 1: Diversity only)

```
# MMR(Maximal Marginal Relevance) Specify the search type.
retriever = db.as_retriever(
    search_type="mmr", search_kwargs={"k": 2, "fetch_k": 10, "lambda_mult": 0.6}
)

# Search for related documents.
docs = retriever.invoke("What is Embedding?")

# Search for related documents
for doc in docs:
    print(doc.page_content)
    print("=========================================================")

```

```
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
========================================================= 
Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 
========================================================= 
```

#### Similarity score threshold search (similarity\_score\_threshold) <a href="#similarity_score_threshold" id="similarity_score_threshold"></a>

You can set a similarity score threshold and set a search method that returns only documents with points above that threshold.

By setting the threshold appropriately **Filter less relevant documents** Do, **Screening only the most similar documents** You can. - `search_type` parameter `"similarity_score_threshold"` Set to perform a search based on the similarity score threshold.

* `search_kwargs` In parameters `{"score_threshold": 0.8}` Pass the similarity score threshold to 0.8. This is the search result **Only documents with a similarity score of 0.8 or higher are returned** Means.

```python
retriever = db.as_retriever(
    # Search type "similarity_score_threshold set to
    search_type="similarity_score_threshold",
    # Setting the threshold
    search_kwargs={"score_threshold": 0.8},
)

#  Search for related documents

for doc in retriever.invoke("Word2Vec 은 무엇인가요?"):
    print(doc.page_content)
    print("=========================================================")

```

```
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
========================================================= 
```

#### top\_k setting <a href="#top_k" id="top_k"></a>

Use when searching `k` You can specify search keyword factors (kwargs) like this.

`k` The parameter represents the number of parent results to return from the search results. - `search_kwargs` in `k` Set the parameter to 1 to specify the number of documents to return as search results.

```
# k setting
retriever = db.as_retriever(search_kwargs={"k": 1})

# Search for related documents
docs = retriever.invoke("What is Embedding??")

#Search for related documents
for doc in docs:
    print(doc.page_content)
    print("=========================================================")
```

```
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
========================================================= 
```

### Dynamic settings (Configurable) <a href="#configurable" id="configurable"></a>

* To dynamically adjust search settings `ConfigurableField` Use.
* `ConfigurableField` Is the role of setting the unique identifier, name, and description of the search parameter.
* To adjust search settings `config` Specify search settings using parameters.
* Search settings `config` Of the dictionary passed to the parameter `configurable` Stored in the key.
* Search settings are passed along with search queries, dynamically adjusted according to search queries.

```
from langchain_core.runnables import ConfigurableField

# k setting
retriever = db.as_retriever(search_kwargs={"k": 1}).configurable_fields(
    search_type=ConfigurableField(
        id="search_type",
        name="Search Type",
        description="The search type to use",
    ),
    search_kwargs=ConfigurableField(
        # Set a unique identifier for the search parameters
        id="search_kwargs",
        # Set the name of the search parameter
        name="Search Kwargs",
        # Write a description for your search parameters
        description="The search kwargs to use",
    ),
)

```

Below is an example with dynamic search settings.

```
#Specify search settings. Faiss Set k=3 in the search to return the 3 most similar documents.
config = {"configurable": {"search_kwargs": {"k": 3}}}

# Search for related documents

docs = retriever.invoke("임베딩(Embedding)은 무엇인가요?", config=config)

# Search for related documents
for doc in docs:     
    print(doc.page_content)
    print("=========================================================")
```

```
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
========================================================= 
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
========================================================= 
Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 
========================================================= 
```

```
# Specify search settings. score_threshold 0.8 Only documents with a score above will be counted.
config = {
    "configurable": {
        "search_type": "similarity_score_threshold",
        "search_kwargs": {
            "score_threshold": 0.8,
        },
    }
}

# Search for related documents
docs = retriever.invoke("Word2Vec The best way to get started?", config=config)

# The best way to get started
for doc in docs:
    print(doc.page_content)
    print("=========================================================")

```

```
 Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
========================================================= 
```

```
# Specify search settings. mmr search settings..
config = {
    "configurable": {
        "search_type": "mmr",
        "search_kwargs": {"k": 2, "fetch_k": 10, "lambda_mult": 0.6},
    }
}

# Search for related documents
docs = retriever.invoke("Word2Vec The best way to get started?", config=config)

# Search for related documents
for doc in docs:
    print(doc.page_content)
    print("=========================================================")

```

```
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
========================================================= 
Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 
========================================================= 
```

### Query & Passage embedding model separated, such as Upstage embedding <a href="#upstage-query-passage-embedding-model" id="upstage-query-passage-embedding-model"></a>

The default retriever uses the same embedding model for queries and documents.

However, there are cases where different embedding models are used for queries and documents.

In these cases, the query is embedded using the query embedding model, and the document is embedded using the document embedding model.

This allows you to use different embedding models for queries and documents.

```
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_upstage import UpstageEmbeddings

# TextLoader Load the file using .
loader = TextLoader("./data/appendix-keywords.txt")

# Load the document.
documents = loader.load()

# Splitting text based on characters CharacterTextSplitter It generates chunks with a chunk size of 300 and no duplication between chunks..
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)

# Split the loaded document.
split_docs = text_splitter.split_documents(documents)

# Upstage Generate embeddings using the document model.
doc_embedder = UpstageEmbeddings(model="solar-embedding-1-large-passage")

# Create a FAISS vector database using segmented text and embeddings.
db = FAISS.from_documents(split_docs, doc_embedder)
```

Below is an example of creating an Upstage embedding for queries and converting query sentences to vectors to perform vector similarity searches.

```
# Generate upstage embeddings for queries. Use the model for queries.
query_embedder = UpstageEmbeddings(model="solar-embedding-1-large-query")

# Converts query sentences into vectors.
query_vector = query_embedder.embed_query("What is Embedding?")

# Performs a vector similarity search to return the two most similar documents.
db.similarity_search_by_vector(query_vector, k=2)
```

```
[Document (metadata={'source':'./data/appendix-keywords.txt'}, page_content=' Definition: Embedding is the process of converting text data such as words or sentences into a continuous vector of low dimensions. This allows the computer to understand and process the text.\n Example: Expresses the word "apple" in a vector such as [0.65, -0.23, 0.17].\n.Keyword: Natural language processing, vectorization, deep learning\n\nToken'), Document (metadata={'source':'app. This creates a vector based on the contextual similarity of the word.\n Example: In the Word2Vec model, "king" and "Queen" are represented by vectors in close positions with each other.\nangi-keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)')] 
```