# 01. Chroma

## Chroma <a href="#chroma" id="chroma"></a>

This laptop covers how to start the Chroma vector store.

Chroma is an AI-native open source vector database focused on developer productivity and happiness. Chroma is licensed according to Apache 2.0.

**Note link**

* [Chroma LangChain document](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/)
* [Chroma Official Document](https://docs.trychroma.com/getting-started)
* [LangChain support VectorStore list](https://python.langchain.com/v0.2/docs/integrations/vectorstores/)

```
# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()
```

```
True
```

```
# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH10-VectorStores")
```

```
 Start tracking LangSmith. 
[Project name] 
CH10-VectorStores 
```

Load the sample dataset.

```
from langchain_community.document_loaders import TextLoader
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma


# Text Splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=0)

# Text file load -> List[Document] Convert to form
loader1 = TextLoader("data/nlp-keywords.txt")
loader2 = TextLoader("data/finance-keywords.txt")

# Split document
split_doc1 = loader1.load_and_split(text_splitter)
split_doc2 = loader2.load_and_split(text_splitter)

# Check the number of documents
len(split_doc1), len(split_doc2)
```

```
 (11, 6) 
```

### VectorStore creation <a href="#vectorstore" id="vectorstore"></a>

#### Vector repository creation (from\_documents) <a href="#from_documents" id="from_documents"></a>

`from_documents` Class methods create vector repositories from document listings.

**parameter**

* `documents` (List\[Document]): List of documents to add to the vector repository
* `embedding` (Optional\[Embeddings]): Embedding function. The default is None
* `ids` (Optional\[List\[str]]): Document ID list. The default is None
* `collection_name` (str): The name of the collection to be created.
* `persist_directory` (Optional\[str]): Directory to store collections. The default is None
* `client_settings` (Optional \[chromadb.config.Settings]): Chroma client setup
* `client` (Optional \[chromadb.Client]): Chroma client instance
* `collection_metadata` (Optional\[Dict]): Collection composition information. The default is None

**Reference**

* `persist_directory` If specified, the collection is stored in that directory. If not specified, data is temporarily stored in memory.
* This method is internally `from_texts` Create a vector repository by calling the method.
* Document `page_content` In text, `metadata` Is used as a metadata.

**return value**

* `Chroma` : Created Chroma vector repository instance When generating `documents` As a parameter `Document` Pass the list. Specifies the embedding model to use for embedding, `namespace` Playing the role `collection_name` You can specify.

```
# DB generation
db = Chroma.from_documents(
    documents=split_doc1, embedding=OpenAIEmbeddings(), collection_name="my_db"
)
```

`persist_directory` When specified, disk saves it in file form.

```
# Specify the path to save to
DB_PATH = "./chroma_db"

# Save the document to disk. Specify the path to save to persist_directory when saving.
persist_db = Chroma.from_documents(
    split_doc1, OpenAIEmbeddings(), persist_directory=DB_PATH, collection_name="my_db"
)
```

By running the code below `DB_PATH` Load the data stored in.

```
# Loads a document from disk.
persist_db = Chroma(
    persist_directory=DB_PATH,
    embedding_function=OpenAIEmbeddings(),
    collection_name="my_db",
)
```

Check the stored data in the called VectorStore.

```
# Check saved data
persist_db.get()
```

```
{'ids': ['0e99026d-a1a9-410a-9eb8-8486b6f0194a', '1ec3599-0c1a-43b9-b98c-ff0b3d-b48d3a325-d9c1-4c2c-916a-94acecf3b8f7', '524a2023-2c19-4a3c-8f3e-8307df211023', '57cfa0e87a3d-df0c-4fe2-bbe7-f402bcf0da96','a4e42628-a051-42a2-a214-bd2bc79ad5b9','d2bce541-ea54-4435-af02-32539  e253f1e2-a34a-43f1-80d7-19477f26680b','e98ae104-3227-4626-90dc-2334e3c86747','f213b754-d23c-41c9-bffd'1x-nlp-keywords.txt'}, {'source':'data/nlp-keywords.txt'}, {'source':'data/nlp-keywords.txt'},'data/  '1's'G,'G'ssource':'data/nlp-keywords.txt'}, {'source':'data/nlp-keywords.txt'}],'documents': ['Definition: Open source is free to use, modify, and distribute by anyone means of software. This plays an important role in promoting collaboration and innovation.\n Example: The Linux operating system is a representative open source project.\nAssociation: Software Development, Community, Technology Collaboration\n\nStructured Data\n\n Definition:  Structured data is data organized according to a defined format or schema. This can be easily retrieved and analyzed from databases, spreadsheets, etc..\n Example: A customer information table stored in a relational database is an example of structured data.\nAssociation: Database, data analysis, data modeling\n\nParser\n\n Definition: Parser is given data (String, file, etc.) is a tool to analyze and convert it into a structured form. This is used for parsing or processing file data in programming languages.\n Example: Parsing HTML documents to create a DOM structure for a web page is an example of parsing.\nAssociation: parsing, compiler, data processing\n\nTF-IDF ( Term Frequency-Inverse Document Frequency)','Definition: Word2Vec maps words to vector space to represent a meaningful relationship between words.   Data processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)','Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words.   Data processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)','Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words.  
... 
(meditation) 
... 
Associated Keywords: Search Engine, Data Search, Information Search\n\nPage Rank\n\n Definition: Page rank is an algorithm that evaluates the importance of a web page, mainly used to rank search engine results. This analyzes and evaluates the link structure between web pages.\n Example: Google search engines use page rank algorithms to rank search results.\nAssociation: Search engine optimization, web analytics, link analysis\n\ndata mining\n\n Definition: Data mining is the process of discovering useful information from large amounts of data. It utilizes technologies such as statistics, machine learning, pattern recognition, etc..\n Example: It is an example of data mining that retailers analyze customer purchase data to develop sales strategies.\nAssociate Keyword: Big Data, Pattern Recognition, Predictive Analysis\n\n Multimodal )'],'uris': None,'data': None} 
```

if `collection_name` If you specify it differently, you will get no results because there is no stored data.

```
# Loads a document from disk.
persist_db2 = Chroma(
    persist_directory=DB_PATH,
    embedding_function=OpenAIEmbeddings(),
    collection_name="my_db2",
)

# Check saved data
persist_db2.get()
```

```
 {'ids': [],'embeddings': None,'metadatas': [],'documents': [],'uris': None,'data': None} 
```

#### Vector repository creation (from\_texts) <a href="#from_texts" id="from_texts"></a>

`from_texts` Class methods create vector repositories from text listings.

**parameter**

* `texts` (List\[str]): Text list to add to the collection
* `embedding` (Optional\[Embeddings]): Embedding function. The default is None
* `metadatas` (Optional\[List\[dict]]): Metadata list. The default is None
* `ids` (Optional\[List\[str]]): Document ID list. The default is None
* `collection_name` (str): The name of the collection to be created. The default is'\_LANGCHAIN\_DEFAULT\_COLLECTION\_NAME'
* `persist_directory` (Optional\[str]): Directory to store collections. The default is None
* `client_settings` (Optional \[chromadb.config.Settings]): Chroma client setup
* `client` (Optional \[chromadb.Client]): Chroma client instance
* `collection_metadata` (Optional\[Dict]): Collection composition information. The default is None

**Reference**

* `persist_directory` If specified, the collection is stored in that directory. If not specified, data is temporarily stored in memory.
* `ids` If not provided, it is automatically generated using UUID.

**return value**

* Created vector repository instance

```
# Create a list of strings
db2 = Chroma.from_texts(
    ["Hello, it's really nice to meet you.", "My name is Teddy."],
    embedding=OpenAIEmbeddings(),
)
```

```
# Query the data.
db2.get()
```

```
{'ids': ['40a857ba-16ab-4dbb-b518-f88a34ba383c', '5927395f-6a75-49a5-861f-a946ccb72c0c'],'embeddings': None,' Nice to meet you.','My name is Teddy.'],'uris': None,'data': None} 
```

#### Similarity search <a href="#id-1" id="id-1"></a>

`similarity_search` The method performs a similarity search in the Chroma database. This method returns the documents most similar to the given query.

**parameter**

* `query` (str): Query text to search
* `k` (int, optional): Number of results to return. The default is 4.
* `filter` (Dict\[str, str], optional): Filter by metadata. The default is None.

**Reference**

* `k` You can adjust the value to get the desired number of results.
* `filter` You can use parameters to search only documents that meet certain metadata conditions.
* This method only returns this document without score information. Score information is also required `similarity_search_with_score` Use the method yourself.

**return value**

* `List[Document]` : List of documents most similar to query text

```
db.similarity_search("TF IDF tell me about")
```

```
[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database  
... 
(meditation) 
... 
Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: CSV (Comma-Separated Values) is a file format that stores data, each data value is comma Is separated by Used to simply store and exchange data in tabular form.\n Example: CSV files with headers called name, age, job may contain data such as Hong Gil-dong, 30, developer.\NAssociation keyword: data format, file processing, data exchange\n\nJSON\n\n Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format, using readable text for both people and machines}is data in JSON format.\nAssociation: Data exchange, web development, API\n\nTransformer\n\n Definition: Transformers are a type of deep learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism.\n Example: Google Translator uses a transformer model to perform translations between different languages.\nAssociation Keywords: deep learning, natural language processing, Attention\n\nHuggingFace')] 
```

`k` You can specify the number of search results in the value.

```
db.similarity_search("TF IDF tell me about", k=2)
```

```
 [Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Open source means software that has source code released and can be freely used, modified, and distributed by anyone. This plays an important role in promoting collaboration and innovation.\n Example: The Linux operating system is a representative open source project.  \nAssociation keyword: software development, community, technical collaboration\n\nStructured Data\n\n Definition: Structured data is data organized according to a defined format or schema. This can be easily retrieved and analyzed from databases, spreadsheets, etc..\n Example: A customer information table stored in a relational database is an example of structured data.\nAssociation: Database, data analysis, data modeling\n\nParser\n\n Definition: Parser is given data (String, file, etc.) is a tool to analyze and convert it into a structured form. It is used for parsing of programming languages or processing file data.\n Example: Parsing HTML documents to create a DOM structure for a web page is an example of parsing.\nAssociation: parsing, compiler, data processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]  Compiler, Data Processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]  Compiler, Data Processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')] 
```

`filter` on `metadata` You can use the information to filter your search results.

```
# use filter
db.similarity_search(
    "TF IDF 에 대하여 알려줘", filter={"source": "data/nlp-keywords.txt"}, k=2
)
```

```
[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Open source means software that has source code released and can be freely used, modified, and distributed by anyone. This plays an important role in promoting collaboration and innovation.\n Example: The Linux operating system is a representative open source project.  \nAssociation keyword: software development, community, technical collaboration\n\nStructured Data\n\n Definition: Structured data is data organized according to a defined format or schema. This can be easily retrieved and analyzed from databases, spreadsheets, etc..\n Example: A customer information table stored in a relational database is an example of structured data.\nAssociation: Database, data analysis, data modeling\n\nParser\n\n Definition: Parser is given data (String, file, etc.) is a tool to analyze and convert it into a structured form. It is used for parsing of programming languages or processing file data.\n Example: Parsing HTML documents to create a DOM structure for a web page is an example of parsing.\nAssociation: parsing, compiler, data processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]  Compiler, Data Processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]  Compiler, Data Processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')] 
```

next `filter` Other in `source` Use to confirm the results you searched for.

```
# use filter 
db.similarity_search(
    "TF IDF 에 대하여 알려줘", filter={"source": "data/finance-keywords.txt"}, k=2
)
```

```
 [] 
```

#### Add documents to vector storage <a href="#id-2" id="id-2"></a>

`add_documents` The method adds or updates documents to the vector repository.

**parameter**

* `documents` (List\[Document]): List of documents to add to the vector repository
* `**kwargs` : Additional keyword factors
* `ids` : Document ID list (priority over the ID of the document at the time of delivery)

**Reference**

* `add_texts` The method should be implemented.
* Document `page_content` In text, `metadata` Is used as a metadata.
* The document has an ID `kwargs` If no ID is provided, the document's ID is used.
* `kwargs` ValueError occurs if the ID and number of documents do not match.

**return value**

* `List[str]` : ID list of added text

**exception**

* `NotImplementedError` : `add_texts` Occurs when the method is not implemented

```python
from langchain_core.documents import Document

# page_content, metadata, specify id
db.add_documents(
    [
        Document(
            page_content="Hello! This time I will add a new document",
            metadata={"source": "mydata.txt"},
            id="1",
        )
    ]
)
```

```
 ['One'] 
```

```
# Search document with id=1
db.get("1")
```

```
{'ids': ['1'],'embeddings': None,'metadatas': [{'source':'mydata.txt'}],'documents': ['Hello! I'm going to add a new docue this time.'],'uris': None,'data': None} 
```

`add_texts` The method embeds the text and adds it to the vector repository.

**parameter**

* `texts` (Iterable\[str]): Text list to add to the vector repository
* `metadatas` (Optional\[List\[dict]]): Metadata list. The default is None
* `ids` (Optional\[List\[str]]): Document ID list. The default is None

**Reference**

* `ids` If not provided, it is automatically generated using UUID.
* If the embedding function is set, the text is embedded.
* If metadata is provided:
* Separate and process text with and without metadata.
* For text without metadata, fill it with an empty dictionary.
* Perform upsert tasks on the collection to add text, embedding, and metadata.

**return value**

* `List[str]` : ID list of added text

**exception**

* `ValueError` : When an error occurs due to a complex metadata, it occurs with a filtering method guide message When adding to an existing ID `upsert` Is performed, and existing documents are replaced.

```
# Add new data. Existing data with id=1 will be overwritten.
db.add_texts(
    [" will overwrite the previously added Document.", "What is the result of overwriting?"],
    metadatas=[{"source": "mydata.txt"}, {"source": "mydata.txt"}],
    ids=["1", "2"],
)
```

```
['1', '2'] 
```

```
# id=1 view
db.get(["1"])
```

```
{'ids': ['1'],'embeddings': None,'metadatas': [{'source':'mydata.txt'}],'documents': ['Documents previously added I'll cover it.'],'uris': None,'data 
```

#### Delete documents from vector storage <a href="#id-3" id="id-3"></a>

`delete` The method deletes the document of the specified ID from the vector repository.

**parameter**

* `ids` (Optional\[List\[str]]): ID list of documents to be deleted. The default is None

**Reference**

* This method is internally collected `delete` Call the method.
* `ids` If it's None, it doesn't do anything.

**return value**

* None

```
# delete id 1
db.delete(ids=["1"])
```

```
# Document lookup
db.get(["1", "2"])
```

```
{'ids': ['2'],'embeddings': None,'metadatas': [{'source':'mydata.txt'}],'documents': ['What are the overworked results?'],'uris': None,'data': None< 
```

```
# Where condition metadata query
db.get(where={"source": "mydata.txt"})
```

```
{'ids': ['2'],'embeddings': None,'metadatas': [{'source':'mydata.txt'}],'documents': ['What are the overworked results?'],'uris': None,'data': None< 
```

#### reset\_collection <a href="#reset_collection" id="reset_collection"></a>

`reset_collection` The method initializes the collection of vector repositories.

```
# Initialize collection
db.reset_collection()
```

```
# View document after initialization
db.get()
```

```
{'ids': [],'embeddings': None,'metadatas': [],'documents': [],'uris': None,'data': None} 
```

#### Convert vector storage to Retriever <a href="#retriever" id="retriever"></a>

`as_retriever` The method produces VectorStoreRetriever based on the vector repository.

**parameter**

* `**kwargs` : Keyword factor to pass to search function
* `search_type` (Optional\[str]): Search type ( `"similarity"` , `"mmr"` , `"similarity_score_threshold"` )
* `search_kwargs` (Optional\[Dict]): Additional factors to pass to the search function
  * `k` : Number of documents to return (default: 4)
  * `score_threshold` : Minimum similarity threshold
  * `fetch_k` : Number of documents to pass to MMR algorithm (default: 20)
  * `lambda_mult` : Diversity regulation of MMR results (0\~1, default: 0.5)
  * `filter` : Filter document metadata

**return value**

* `VectorStoreRetriever` : Vector repository based searcher instance `DB` Generate.

```
# Create DB 
db = Chroma.from_documents(
    documents=split_doc1 + split_doc2,
    embedding=OpenAIEmbeddings(),
    collection_name="nlp",
)
```

Four documents set to default values are viewed by performing a similar search.

```
retriever = db.as_retriever()
retriever.invoke("Word2Vec 에 대하여 알려줘")
```

```
[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source'),  
... 
(meditation) 
... 
Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database 
```

Search for more documents with high diversity

* `k` : Number of documents to return (default: 4)
* `fetch_k` : Number of documents to pass to MMR algorithm (default: 20)
* `lambda_mult` : Diversity regulation of MMR results (0\~1, default: 0.5)

```
retriever = db.as_retriever(
    search_type="mmr", search_kwargs={"k": 6, "lambda_mult": 0.25, "fetch_k": 10}
)
retriever.invoke("Word2Vec 에 대하여 알려줘")
```

```
[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source'),  
... 
(Fortune) 
... 
```

Get more documents for the MMR algorithm, but only return the top two

```
retriever = db.as_retriever(search_type="mmr", search_kwargs={"k": 2, "fetch_k": 10})
retriever.invoke("Word2Vec The first time I met him")
```

```
[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. This creates a vector based on the contextual similarity of the word.\n Example: In the Word2Vec model, "king" and "kingdom" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively searching for similar vectors in large vectors. FAISS can be used to quickly find similar images among millions of image vectors.\nAssociate: Vector Search, Machine Learning, Database Optimization\n\nOpen Source'), Document (metadata={'source':'data/nlp-keywords.txt'}, page_content='GPT  GPT is a proactive language model pre-trained with a large dataset, utilized for a variety of text-based tasks. This can generate a natural language based on the text entered.\n Example: A chatbot that generates detailed answers to questions provided by the user can use the GPT model.\nAssociation Keywords: natural language processing, text generation, deepening\n\nInstructGPT\n\n Definition: InstructGPT is a GPT model optimized to perform specific tasks according to the user's instructions. This model is designed to produce more accurate and relevant results.\n Example: If a user provides specific instructions such as "draft email", InstructGPT will create an email based on the relevant content.\NAssociation keyword: Artificial, natural language understanding, Command-based processing\n\nKeyword Search')]  Command-based processing\n\nKeyword Search')]  Command-based processing\n\nKeyword Search')] 
Search
```

Search only documents with similarities above a certain threshold

```
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8}
)

retriever.invoke("Word2Vec 에 대하여 알려줘")
```

```
[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source')] 
```

Search only the single most similar document

```
retriever = db.as_retriever(search_kwargs={"k": 1})

retriever.invoke("Word2Vec The first time I met him")
```

```
[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source')] 
```

Apply specific metadata filters

```
retriever = db.as_retriever(
    search_kwargs={"filter": {"source": "data/finance-keywords.txt"}, "k": 2}
)
retriever.invoke("ESG tell me about")
```

```
[Document (metadata={'source':'data/finance-keywords.txt'}, page_content=' Definition: ESG is an investment approach that takes into account the environmental, social and governance aspects of the enterprise.\N Example: The S&P 500 ESG index is an index consisting of companies with excellent ESG performance\nP 500 companies have the largest purchase of their own shares.\n Equestrian keyword: shareholder value, capital management, stock price stimulus\n\nCyclical Stocks\n\n Definition: The circulatory state refers to the shares of companies whose performance varies greatly depending on the economic situation. \N Example: Ford, General Motors Auto companies like are representative recalculators included in the S&P 500. Defensive shares are stocks of companies with stable performance regardless of economic fluctuations.\n Example: Life-must-have companies such as Procter & Bl, Johnson & Johnson are referred to as representative defenses within the S&P 500.\N.Keyword: stable return, low volatility, risk management'), Document (metadata={'source':'data/finance-key  It's an activity that analyzes competitiveness, etc. to help you make investment decisions. \n Example: Goldmanx analysts have announced quarterly earnings prospects for S&P 500 companies. \n Associate Keyword: Investment Analysis, Corporate Valuation, Market Outlook\n\nCorporate Governance\n\n Definition: Corporate Governance Means systems and processes for corporate management and control.\n Example: S&P 500 companiesnMergers and Acquisitions (M&A)\n\n Definition: The merger refers to the process by which companies buy or merge with other companies.\n Example: As Microsoft acquired the activity blizzard, the fando of the game industry within the S&P 500 has changed.\Non-guide keyword: Corporate strategy, synergy, corporate value\n\nESG (Environmental, Social and  As Microsoft acquired Activation Blizzard, the game industry in the S&P 500 has changed.\NAssociation Keyword: Corporate Strategy, Synergy, Corporate Value\n\nESG (Environmental, Social, and Governance)']  As Microsoft acquired Activation Blizzard, the game industry in the S&P 500 has changed.\NAssociation Keyword: Corporate Strategy, Synergy, Corporate Value\n\nESG (Environmental, Social, and Governance)'] 
```

### Multimodal Search <a href="#id-4" id="id-4"></a>

Chroma supports a multi-modal collection, a collection that can contain and query multiple forms of data.

### Data set <a href="#id-5" id="id-5"></a>

Hosted in a Hugging Face [coco object detection dataset ](https://huggingface.co/datasets/detection-datasets/coco)Use a small subset of.

Only some of all the images in the dataset are downloaded locally and used to create a multi-modal collection.

```
import os
from datasets import load_dataset
from matplotlib import pyplot as plt

# Load COCO dataset
dataset = load_dataset(
    path="detection-datasets/coco", name="default", split="train", streaming=True
)

# Set the image storage folder and number of images
IMAGE_FOLDER = "tmp"
N_IMAGES = 20

# Settings for plotting graphs
plot_cols = 5
plot_rows = N_IMAGES // plot_cols
fig, axes = plt.subplots(plot_rows, plot_cols, figsize=(plot_rows * 2, plot_cols * 2))
axes = axes.flatten()

# Save images to a folder and display them on a graph
dataset_iter = iter(dataset)
os.makedirs(IMAGE_FOLDER, exist_ok=True)
for i in range(N_IMAGES):
    # Extract images and labels from the dataset
    data = next(dataset_iter)
    image = data["image"]
    label = data["objects"]["category"][0]  # 첫 번째 객체의 카테고리를 레이블로 사용

    # Displaying images and adding labels to a graph
    axes[i].imshow(image)
    axes[i].set_title(label, fontsize=8)
    axes[i].axis("off")

    # Save as image file
    image.save(f"{IMAGE_FOLDER}/{i}.jpg")

# Adjusting and displaying the graph layout
plt.tight_layout()
plt.show()
```

#### Multimodal Embeddings <a href="#multimodal-embeddings" id="multimodal-embeddings"></a>

Utilize Multimodal Embeddings to create Embedding for images and text.

In this tutorial, we use OpenClipEmbeddingFunction to embed the image.

* [OpenCLIP](https://github.com/mlfoundations/open_clip/tree/main)

#### Model benchmark <a href="#model" id="model"></a>

| Model                                                                                   | Training data | Resolution | # of samples seen | ImageNet zero-shot acc. |
| --------------------------------------------------------------------------------------- | ------------- | ---------- | ----------------- | ----------------------- |
| ConvNext-Base                                                                           | LAION-2B      | 256px      | 13B               | 71.5%                   |
| ConvNext-Large                                                                          | LAION-2B      | 320px      | 29B               | 76.9%                   |
| ConvNext-XXLarge                                                                        | LAION-2B      | 256px      | 34B               | 79.5%                   |
| ViT-B/32                                                                                | DataComp-1B   | 256px      | 34B               | 72.8%                   |
| ViT-B/16                                                                                | DataComp-1B   | 224px      | 13B               | 73.5%                   |
| ViT-L/14                                                                                | LAION-2B      | 224px      | 32B               | 75.3%                   |
| ViT-H/14                                                                                | LAION-2B      | 224px      | 32B               | 78.0%                   |
| ViT-L/14                                                                                | DataComp-1B   | 224px      | 13B               | 79.2%                   |
| ViT-G/14                                                                                | LAION-2B      | 224px      | 34B               | 80.1%                   |
| ViT-L/14 ( [Original CLIP ](https://openai.com/research/clip))                          | WIT           | 224px      | 13B               | 75.5%                   |
| ViT-SO400M/14 ( [SigLIP ](https://github.com/mlfoundations/open_clip))                  | WebLI         | 224px      | 45B               | 82.0%                   |
| ViT-SO400M-14-SigLIP-384 ( [SigLIP ](https://github.com/mlfoundations/open_clip))       | WebLI         | 384px      | 45B               | 83.1%                   |
| ViT-H/14-quickgelu ( [DFN ](https://www.deeplearning.ai/glossary/neural-networks/))     | DFN-5B        | 224px      | 39B               | 83.4%                   |
| ViT-H-14-378-quickgelu ( [DFN ](https://www.deeplearning.ai/glossary/neural-networks/)) | DFN-5B        | 378px      | 44B               | 84.4%                   |

In the example below `model_name` and `checkpoint` Set and use.

* `model_name` : OpenCLIP model name
* `checkpoint` : Of the OpenCLIP model `Training data` Name

```
import open_clip
import pandas as pd

# Output available models/checkpoints
pd.DataFrame(open_clip.list_pretrained(), columns=["model_name", "checkpoint"]).head(10)
```

| <p><br>model\_name</p> | checkpoint      |         |
| ---------------------- | --------------- | ------- |
| 0                      | RN50            | openai  |
| One                    | RN50            | yfcc15m |
| 2                      | RN50            | cc12m   |
| 3                      | RN50-quickgelu  | openai  |
| 4                      | RN50-quickgelu  | yfcc15m |
| 5                      | RN50-quickgelu  | cc12m   |
| 6                      | RN101           | openai  |
| 7                      | RN101           | yfcc15m |
| 8                      | RN101-quickgelu | openai  |
| 9                      | RN101-quickgelu | yfcc15m |

```
from langchain_experimental.open_clip import OpenCLIPEmbeddings

# Creating an OpenCLIP embedding function object
image_embedding_function = OpenCLIPEmbeddings(
    model_name="ViT-H-14-378-quickgelu", checkpoint="dfn5b"
)
```

Save the path of the image as list.

```
# Save image paths as a list
image_uris = sorted(
    [
        os.path.join("tmp", image_name)
        for image_name in os.listdir("tmp")
        if image_name.endswith(".jpg")
    ]
)

image_uris
```

```
 ['tmp/0.jpg','tmp/1.jpg','tmp/10.jpg','tmp/11.jpg','tmp/12.jpg','tmp/13.jpg', 'tmp/14.jpg','tmp/ 
```

```
from langchain_teddynote.models import MultiModal
from langchain_openai import ChatOpenAI

# Initializing the ChatOpenAI model
llm = ChatOpenAI(model="gpt-4o-mini")

# MultiModal Model Setup
model = MultiModal(
    model=llm,
    system_prompt="Your mission is to describe the image in detail",  # 시스템 프롬프트: 이미지를 상세히 설명하도록 지시
    user_prompt="Description should be written in one sentence(less than 60 characters)",  # 사용자 프롬프트: 60자 이내의 한 문장으로 설명 요청
)
```

Create a description for image.

```
# Generate image descriptions
model.invoke(image_uris[0])
```

```
# Image Description
descriptions = dict()

for image_uri in image_uris:
    descriptions[image_uri] = model.invoke(image_uri, display_image=False)

# Output the generated results
descriptions
```

```
'A colorful lunchbox with various healthy snacks and months.','tmp/1.jpg':'Two giraffes near a tree, one reaching for leaves.','tmp/10.jpg':'Two giraffesA skater performs tricks on a graffiti-covered ramp.','tmp/14.jpg':'An owl-shaped candle beside an ornate clock.','tmp/15.jpg':'An Air France Airbus A380 flying through cloudy skies'  and tiled backsplash.','tmp/18.jpg':'A layered chocolate cake slice on a white plate.','tmp/19.jpg':'Deserted street scene with shops and a hotel sign.','tmp/2.jpg':'A white vase filleA curly-haired dog is sleeping on a pile of shoes.','tmp/6.jpg':'Two horses rearing up on a grassy field with riders.','tmp/7.jpg':'Two elephants carry riders through a dense jungle'  A train curves along tracks near a cityscape backdrop.'} 
```

<pre><code>import os
from PIL import Image
import matplotlib.pyplot as plt

# Initialize a list to store the original image, processed image, and text description.
original_images = []
images = []
texts = []

# Set graph size (20x10 inches)
plt.figure(figsize=(20, 10))

# 'tmp' Process image files stored in a directory
for i, image_uri in enumerate(image_uris):
    # Open image file and convert to RGB mode
    image = Image.open(image_uri).convert("RGB")

    # Create subplots in a 4x5 grid
    plt.subplot(4, 5, i + 1)

    # Display image
    plt.imshow(image)

    # Set the image file name and description as the title
    plt.title(f"{os.path.basename(image_uri)}\n{descriptions[image_uri]}", fontsize=8)

    #Remove tick marks from x and y axes
    plt.xticks([])
    plt.yticks([])
<strong>
</strong>    #Add original image, processed image, and text description to each list
    original_images.append(image)
    images.append(image)
    texts.append(descriptions[image_uri])

# Adjust spacing between subplots
plt.tight_layout()
</code></pre>

Below we calculate the similarity between the image description and the text you created.

```
import numpy as np

# Image and text embedding
# Extract image features using image URI
img_features = image_embedding_function.embed_image(image_uris)
# Adding a “This is” prefix to text descriptions and extracting text features
text_features = image_embedding_function.embed_documents(
    ["This is " + desc for desc in texts]
)

# Convert list to numpy array for matrix operations
img_features_np = np.array(img_features)
text_features_np = np.array(text_features)

# Similarity calculation
# Compute cosine similarity between text and image features
similarity = np.matmul(text_features_np, img_features_np.T)
```

Seek and visualize similarities between text versus image description.

```
# Create a plot to visualize the similarity matrix
count = len(descriptions)
plt.figure(figsize=(20, 14))

# Displaying the similarity matrix as a heatmap
plt.imshow(similarity, vmin=0.1, vmax=0.3, cmap="coolwarm")
plt.colorbar()  # Add color bar

# Show text description on y-axis
plt.yticks(range(count), texts, fontsize=18)
plt.xticks([])  # Remove x-axis tick marks

# Display original image below x-axis
for i, image in enumerate(original_images):
    plt.imshow(image, extent=(i - 0.5, i + 0.5, -1.6, -0.6), origin="lower")

# Display similarity values ​​as text on top of the heatmap
for x in range(similarity.shape[1]):
    for y in range(similarity.shape[0]):
        plt.text(x, y, f"{similarity[y, x]:.2f}", ha="center", va="center", size=12)

# Remove plot border
for side in ["left", "top", "right", "bottom"]:
    plt.gca().spines[side].set_visible(False)

# Set the plot range
plt.xlim([-0.5, count - 0.5])
plt.ylim([count + 0.5, -2])

# Add a title
plt.title( "Cosine similarity between text and image features", size=20)
```

## Vectorstore creation and image addition <a href="#vectorstore_1" id="vectorstore_1"></a>

Generate Vectorstore and add images.

```
# Create a DB 
image_db = Chroma(
    collection_name="multimodal",
    embedding_function=image_embedding_function,
)

# Add image
image_db.add_images(uris=image_uris)
```

```
['cdd41dc4-e890-4de8-9035-9b5cd6603405','bb49e9af-54ac-4ed2-94a4-a5212b20d898', '204ecff3-c94e-464b-ab896aaf13e1-cccd-463d-bdab-811498959c2a','d882bf32-2983-4bd5-b7d9-be113f76cbe3', 'd5f4c365b-9579618a830-2207-4878-b127-d87c857fcb1d', '4998ff09-c6dc-4651-ab06-0f51ebf56c91','a6688148-ff30-4d25-85a4-0  480d7a24-f217-4455-9e85-9f83e1a32aee', '83016e20-e971-46aa-8e98-60db810434bc', '2785ce6d-70c5-4b18-b5e2- 
```

Below is the helper class to output the image retrieved results into the image.

```
import base64
import io
from PIL import Image
from IPython.display import HTML, display
from langchain.schema import Document


class ImageRetriever:
    def __init__(self, retriever):
        """    
        Initializes the image finder.
        
        factor:
        retriever: LangChain의 retriever object
        """
        self.retriever = retriever

        def invoke(self, query):
        """
        Search and display images using queries.

        factor:
        query (str): Search Query
        """
        docs = self.retriever.invoke(query)
        if docs and isinstance(docs[0], Document):
            self.plt_img_base64(docs[0].page_content)
        else:
            print("no images found.")
        return docs

    @staticmethod
    def resize_base64_image(base64_string, size=(224, 224)):
        """
        Resizes an image encoded as a Base64 string.

        factor:
        base64_string (str): Base64 string of the original image.
        size (tuple): Desired image size expressed as (width, height)

        return:
        str: Base64 string of the resized image.
        """
        img_data = base64.b64decode(base64_string)
        img = Image.open(io.BytesIO(img_data))
        resized_img = img.resize(size, Image.LANCZOS)
        buffered = io.BytesIO()
        resized_img.save(buffered, format=img.format)
        return base64.b64encode(buffered.getvalue()).decode("utf-8")

    @staticmethod
    def plt_img_base64(img_base64):
        """
        Displays an image encoded in Base64.

        factor:
        img_base64 (str): Base64 encoded image string

        """
        image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
        display(HTML(image_html))
```

```
# Image Retriever generation
retriever = image_db.as_retriever(search_kwargs={"k": 3})
image_retriever = ImageRetriever(retriever)
```

```
# Image lookup
result = image_retriever.invoke("A Dog on the street")
```

```
# Image lookup
result = image_retriever.invoke("Motorcycle with a man")
```
