01. Chroma

Chroma

This laptop covers how to start the Chroma vector store.

Chroma is an AI-native open source vector database focused on developer productivity and happiness. Chroma is licensed according to Apache 2.0.

Note link

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

True

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH10-VectorStores")

 Start tracking LangSmith. 
[Project name] 
CH10-VectorStores

Load the sample dataset.

from langchain_community.document_loaders import TextLoader
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma


# Text Splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=0)

# Text file load -> List[Document] Convert to form
loader1 = TextLoader("data/nlp-keywords.txt")
loader2 = TextLoader("data/finance-keywords.txt")

# Split document
split_doc1 = loader1.load_and_split(text_splitter)
split_doc2 = loader2.load_and_split(text_splitter)

# Check the number of documents
len(split_doc1), len(split_doc2)

 (11, 6)

VectorStore creation

Vector repository creation (from_documents)

from_documents Class methods create vector repositories from document listings.

parameter

documents (List[Document]): List of documents to add to the vector repository
embedding (Optional[Embeddings]): Embedding function. The default is None
ids (Optional[List[str]]): Document ID list. The default is None
collection_name (str): The name of the collection to be created.
persist_directory (Optional[str]): Directory to store collections. The default is None
client_settings (Optional [chromadb.config.Settings]): Chroma client setup
client (Optional [chromadb.Client]): Chroma client instance
collection_metadata (Optional[Dict]): Collection composition information. The default is None

Reference

persist_directory If specified, the collection is stored in that directory. If not specified, data is temporarily stored in memory.
This method is internally from_texts Create a vector repository by calling the method.
Document page_content In text, metadata Is used as a metadata.

return value

Chroma : Created Chroma vector repository instance When generating documents As a parameter Document Pass the list. Specifies the embedding model to use for embedding, namespace Playing the role collection_name You can specify.

# DB generation
db = Chroma.from_documents(
    documents=split_doc1, embedding=OpenAIEmbeddings(), collection_name="my_db"
)

persist_directory When specified, disk saves it in file form.

# Specify the path to save to
DB_PATH = "./chroma_db"

# Save the document to disk. Specify the path to save to persist_directory when saving.
persist_db = Chroma.from_documents(
    split_doc1, OpenAIEmbeddings(), persist_directory=DB_PATH, collection_name="my_db"
)

By running the code below DB_PATH Load the data stored in.

# Loads a document from disk.
persist_db = Chroma(
    persist_directory=DB_PATH,
    embedding_function=OpenAIEmbeddings(),
    collection_name="my_db",
)

Check the stored data in the called VectorStore.

# Check saved data
persist_db.get()

{'ids': ['0e99026d-a1a9-410a-9eb8-8486b6f0194a', '1ec3599-0c1a-43b9-b98c-ff0b3d-b48d3a325-d9c1-4c2c-916a-94acecf3b8f7', '524a2023-2c19-4a3c-8f3e-8307df211023', '57cfa0e87a3d-df0c-4fe2-bbe7-f402bcf0da96','a4e42628-a051-42a2-a214-bd2bc79ad5b9','d2bce541-ea54-4435-af02-32539  e253f1e2-a34a-43f1-80d7-19477f26680b','e98ae104-3227-4626-90dc-2334e3c86747','f213b754-d23c-41c9-bffd'1x-nlp-keywords.txt'}, {'source':'data/nlp-keywords.txt'}, {'source':'data/nlp-keywords.txt'},'data/  '1's'G,'G'ssource':'data/nlp-keywords.txt'}, {'source':'data/nlp-keywords.txt'}],'documents': ['Definition: Open source is free to use, modify, and distribute by anyone means of software. This plays an important role in promoting collaboration and innovation.\n Example: The Linux operating system is a representative open source project.\nAssociation: Software Development, Community, Technology Collaboration\n\nStructured Data\n\n Definition:  Structured data is data organized according to a defined format or schema. This can be easily retrieved and analyzed from databases, spreadsheets, etc..\n Example: A customer information table stored in a relational database is an example of structured data.\nAssociation: Database, data analysis, data modeling\n\nParser\n\n Definition: Parser is given data (String, file, etc.) is a tool to analyze and convert it into a structured form. This is used for parsing or processing file data in programming languages.\n Example: Parsing HTML documents to create a DOM structure for a web page is an example of parsing.\nAssociation: parsing, compiler, data processing\n\nTF-IDF ( Term Frequency-Inverse Document Frequency)','Definition: Word2Vec maps words to vector space to represent a meaningful relationship between words.   Data processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)','Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words.   Data processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)','Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words.  
... 
(meditation) 
... 
Associated Keywords: Search Engine, Data Search, Information Search\n\nPage Rank\n\n Definition: Page rank is an algorithm that evaluates the importance of a web page, mainly used to rank search engine results. This analyzes and evaluates the link structure between web pages.\n Example: Google search engines use page rank algorithms to rank search results.\nAssociation: Search engine optimization, web analytics, link analysis\n\ndata mining\n\n Definition: Data mining is the process of discovering useful information from large amounts of data. It utilizes technologies such as statistics, machine learning, pattern recognition, etc..\n Example: It is an example of data mining that retailers analyze customer purchase data to develop sales strategies.\nAssociate Keyword: Big Data, Pattern Recognition, Predictive Analysis\n\n Multimodal )'],'uris': None,'data': None}

if collection_name If you specify it differently, you will get no results because there is no stored data.

# Loads a document from disk.
persist_db2 = Chroma(
    persist_directory=DB_PATH,
    embedding_function=OpenAIEmbeddings(),
    collection_name="my_db2",
)

# Check saved data
persist_db2.get()

 {'ids': [],'embeddings': None,'metadatas': [],'documents': [],'uris': None,'data': None}

Vector repository creation (from_texts)

from_texts Class methods create vector repositories from text listings.

parameter

texts (List[str]): Text list to add to the collection
embedding (Optional[Embeddings]): Embedding function. The default is None
metadatas (Optional[List[dict]]): Metadata list. The default is None
ids (Optional[List[str]]): Document ID list. The default is None
collection_name (str): The name of the collection to be created. The default is'_LANGCHAIN_DEFAULT_COLLECTION_NAME'
persist_directory (Optional[str]): Directory to store collections. The default is None
client_settings (Optional [chromadb.config.Settings]): Chroma client setup
client (Optional [chromadb.Client]): Chroma client instance
collection_metadata (Optional[Dict]): Collection composition information. The default is None

Reference

persist_directory If specified, the collection is stored in that directory. If not specified, data is temporarily stored in memory.
ids If not provided, it is automatically generated using UUID.

return value

Created vector repository instance

# Create a list of strings
db2 = Chroma.from_texts(
    ["Hello, it's really nice to meet you.", "My name is Teddy."],
    embedding=OpenAIEmbeddings(),
)

# Query the data.
db2.get()

{'ids': ['40a857ba-16ab-4dbb-b518-f88a34ba383c', '5927395f-6a75-49a5-861f-a946ccb72c0c'],'embeddings': None,' Nice to meet you.','My name is Teddy.'],'uris': None,'data': None}

Similarity search

similarity_search The method performs a similarity search in the Chroma database. This method returns the documents most similar to the given query.

parameter

query (str): Query text to search
k (int, optional): Number of results to return. The default is 4.
filter (Dict[str, str], optional): Filter by metadata. The default is None.

Reference

k You can adjust the value to get the desired number of results.
filter You can use parameters to search only documents that meet certain metadata conditions.
This method only returns this document without score information. Score information is also required similarity_search_with_score Use the method yourself.

return value

List[Document] : List of documents most similar to query text

db.similarity_search("TF IDF tell me about")

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database  
... 
(meditation) 
... 
Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: CSV (Comma-Separated Values) is a file format that stores data, each data value is comma Is separated by Used to simply store and exchange data in tabular form.\n Example: CSV files with headers called name, age, job may contain data such as Hong Gil-dong, 30, developer.\NAssociation keyword: data format, file processing, data exchange\n\nJSON\n\n Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format, using readable text for both people and machines}is data in JSON format.\nAssociation: Data exchange, web development, API\n\nTransformer\n\n Definition: Transformers are a type of deep learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism.\n Example: Google Translator uses a transformer model to perform translations between different languages.\nAssociation Keywords: deep learning, natural language processing, Attention\n\nHuggingFace')]

k You can specify the number of search results in the value.

db.similarity_search("TF IDF tell me about", k=2)

 [Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Open source means software that has source code released and can be freely used, modified, and distributed by anyone. This plays an important role in promoting collaboration and innovation.\n Example: The Linux operating system is a representative open source project.  \nAssociation keyword: software development, community, technical collaboration\n\nStructured Data\n\n Definition: Structured data is data organized according to a defined format or schema. This can be easily retrieved and analyzed from databases, spreadsheets, etc..\n Example: A customer information table stored in a relational database is an example of structured data.\nAssociation: Database, data analysis, data modeling\n\nParser\n\n Definition: Parser is given data (String, file, etc.) is a tool to analyze and convert it into a structured form. It is used for parsing of programming languages or processing file data.\n Example: Parsing HTML documents to create a DOM structure for a web page is an example of parsing.\nAssociation: parsing, compiler, data processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]  Compiler, Data Processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]  Compiler, Data Processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]

filter on metadata You can use the information to filter your search results.

# use filter
db.similarity_search(
    "TF IDF 에 대하여 알려줘", filter={"source": "data/nlp-keywords.txt"}, k=2
)

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Open source means software that has source code released and can be freely used, modified, and distributed by anyone. This plays an important role in promoting collaboration and innovation.\n Example: The Linux operating system is a representative open source project.  \nAssociation keyword: software development, community, technical collaboration\n\nStructured Data\n\n Definition: Structured data is data organized according to a defined format or schema. This can be easily retrieved and analyzed from databases, spreadsheets, etc..\n Example: A customer information table stored in a relational database is an example of structured data.\nAssociation: Database, data analysis, data modeling\n\nParser\n\n Definition: Parser is given data (String, file, etc.) is a tool to analyze and convert it into a structured form. It is used for parsing of programming languages or processing file data.\n Example: Parsing HTML documents to create a DOM structure for a web page is an example of parsing.\nAssociation: parsing, compiler, data processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]  Compiler, Data Processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]  Compiler, Data Processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]

next filter Other in source Use to confirm the results you searched for.

# use filter 
db.similarity_search(
    "TF IDF 에 대하여 알려줘", filter={"source": "data/finance-keywords.txt"}, k=2
)

[]

Add documents to vector storage

add_documents The method adds or updates documents to the vector repository.

parameter

documents (List[Document]): List of documents to add to the vector repository
**kwargs : Additional keyword factors
ids : Document ID list (priority over the ID of the document at the time of delivery)

Reference

add_texts The method should be implemented.
Document page_content In text, metadata Is used as a metadata.
The document has an ID kwargs If no ID is provided, the document's ID is used.
kwargs ValueError occurs if the ID and number of documents do not match.

return value

List[str] : ID list of added text

exception

NotImplementedError : add_texts Occurs when the method is not implemented

from langchain_core.documents import Document

# page_content, metadata, specify id
db.add_documents(
    [
        Document(
            page_content="Hello! This time I will add a new document",
            metadata={"source": "mydata.txt"},
            id="1",
        )
    ]
)

 ['One']

# Search document with id=1
db.get("1")

{'ids': ['1'],'embeddings': None,'metadatas': [{'source':'mydata.txt'}],'documents': ['Hello! I'm going to add a new docue this time.'],'uris': None,'data': None}

add_texts The method embeds the text and adds it to the vector repository.

parameter

texts (Iterable[str]): Text list to add to the vector repository
metadatas (Optional[List[dict]]): Metadata list. The default is None
ids (Optional[List[str]]): Document ID list. The default is None

Reference

ids If not provided, it is automatically generated using UUID.
If the embedding function is set, the text is embedded.
If metadata is provided:
Separate and process text with and without metadata.
For text without metadata, fill it with an empty dictionary.
Perform upsert tasks on the collection to add text, embedding, and metadata.

return value

List[str] : ID list of added text

exception

ValueError : When an error occurs due to a complex metadata, it occurs with a filtering method guide message When adding to an existing ID upsert Is performed, and existing documents are replaced.

# Add new data. Existing data with id=1 will be overwritten.
db.add_texts(
    [" will overwrite the previously added Document.", "What is the result of overwriting?"],
    metadatas=[{"source": "mydata.txt"}, {"source": "mydata.txt"}],
    ids=["1", "2"],
)

['1', '2']

# id=1 view
db.get(["1"])

{'ids': ['1'],'embeddings': None,'metadatas': [{'source':'mydata.txt'}],'documents': ['Documents previously added I'll cover it.'],'uris': None,'data

Delete documents from vector storage

delete The method deletes the document of the specified ID from the vector repository.

parameter

ids (Optional[List[str]]): ID list of documents to be deleted. The default is None

Reference

This method is internally collected delete Call the method.
ids If it's None, it doesn't do anything.

return value

None

# delete id 1
db.delete(ids=["1"])

# Document lookup
db.get(["1", "2"])

{'ids': ['2'],'embeddings': None,'metadatas': [{'source':'mydata.txt'}],'documents': ['What are the overworked results?'],'uris': None,'data': None<

# Where condition metadata query
db.get(where={"source": "mydata.txt"})

{'ids': ['2'],'embeddings': None,'metadatas': [{'source':'mydata.txt'}],'documents': ['What are the overworked results?'],'uris': None,'data': None<

reset_collection

reset_collection The method initializes the collection of vector repositories.

# Initialize collection
db.reset_collection()

# View document after initialization
db.get()

{'ids': [],'embeddings': None,'metadatas': [],'documents': [],'uris': None,'data': None}

Convert vector storage to Retriever

as_retriever The method produces VectorStoreRetriever based on the vector repository.

parameter

**kwargs : Keyword factor to pass to search function
search_type (Optional[str]): Search type ( "similarity" , "mmr" , "similarity_score_threshold" )
search_kwargs (Optional[Dict]): Additional factors to pass to the search function
- k : Number of documents to return (default: 4)
- score_threshold : Minimum similarity threshold
- fetch_k : Number of documents to pass to MMR algorithm (default: 20)
- lambda_mult : Diversity regulation of MMR results (0~1, default: 0.5)
- filter : Filter document metadata

return value

VectorStoreRetriever : Vector repository based searcher instance DB Generate.

# Create DB 
db = Chroma.from_documents(
    documents=split_doc1 + split_doc2,
    embedding=OpenAIEmbeddings(),
    collection_name="nlp",
)

Four documents set to default values are viewed by performing a similar search.

retriever = db.as_retriever()
retriever.invoke("Word2Vec 에 대하여 알려줘")

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source'),  
... 
(meditation) 
... 
Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database

Search for more documents with high diversity

k : Number of documents to return (default: 4)
fetch_k : Number of documents to pass to MMR algorithm (default: 20)
lambda_mult : Diversity regulation of MMR results (0~1, default: 0.5)

retriever = db.as_retriever(
    search_type="mmr", search_kwargs={"k": 6, "lambda_mult": 0.25, "fetch_k": 10}
)
retriever.invoke("Word2Vec 에 대하여 알려줘")

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source'),  
... 
(Fortune) 
...

Get more documents for the MMR algorithm, but only return the top two

retriever = db.as_retriever(search_type="mmr", search_kwargs={"k": 2, "fetch_k": 10})
retriever.invoke("Word2Vec The first time I met him")

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. This creates a vector based on the contextual similarity of the word.\n Example: In the Word2Vec model, "king" and "kingdom" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively searching for similar vectors in large vectors. FAISS can be used to quickly find similar images among millions of image vectors.\nAssociate: Vector Search, Machine Learning, Database Optimization\n\nOpen Source'), Document (metadata={'source':'data/nlp-keywords.txt'}, page_content='GPT  GPT is a proactive language model pre-trained with a large dataset, utilized for a variety of text-based tasks. This can generate a natural language based on the text entered.\n Example: A chatbot that generates detailed answers to questions provided by the user can use the GPT model.\nAssociation Keywords: natural language processing, text generation, deepening\n\nInstructGPT\n\n Definition: InstructGPT is a GPT model optimized to perform specific tasks according to the user's instructions. This model is designed to produce more accurate and relevant results.\n Example: If a user provides specific instructions such as "draft email", InstructGPT will create an email based on the relevant content.\NAssociation keyword: Artificial, natural language understanding, Command-based processing\n\nKeyword Search')]  Command-based processing\n\nKeyword Search')]  Command-based processing\n\nKeyword Search')] 
Search

Search only documents with similarities above a certain threshold

retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8}
)

retriever.invoke("Word2Vec 에 대하여 알려줘")

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source')]

Search only the single most similar document

retriever = db.as_retriever(search_kwargs={"k": 1})

retriever.invoke("Word2Vec The first time I met him")

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source')]

Apply specific metadata filters

retriever = db.as_retriever(
    search_kwargs={"filter": {"source": "data/finance-keywords.txt"}, "k": 2}
)
retriever.invoke("ESG tell me about")

[Document (metadata={'source':'data/finance-keywords.txt'}, page_content=' Definition: ESG is an investment approach that takes into account the environmental, social and governance aspects of the enterprise.\N Example: The S&P 500 ESG index is an index consisting of companies with excellent ESG performance\nP 500 companies have the largest purchase of their own shares.\n Equestrian keyword: shareholder value, capital management, stock price stimulus\n\nCyclical Stocks\n\n Definition: The circulatory state refers to the shares of companies whose performance varies greatly depending on the economic situation. \N Example: Ford, General Motors Auto companies like are representative recalculators included in the S&P 500. Defensive shares are stocks of companies with stable performance regardless of economic fluctuations.\n Example: Life-must-have companies such as Procter & Bl, Johnson & Johnson are referred to as representative defenses within the S&P 500.\N.Keyword: stable return, low volatility, risk management'), Document (metadata={'source':'data/finance-key  It's an activity that analyzes competitiveness, etc. to help you make investment decisions. \n Example: Goldmanx analysts have announced quarterly earnings prospects for S&P 500 companies. \n Associate Keyword: Investment Analysis, Corporate Valuation, Market Outlook\n\nCorporate Governance\n\n Definition: Corporate Governance Means systems and processes for corporate management and control.\n Example: S&P 500 companiesnMergers and Acquisitions (M&A)\n\n Definition: The merger refers to the process by which companies buy or merge with other companies.\n Example: As Microsoft acquired the activity blizzard, the fando of the game industry within the S&P 500 has changed.\Non-guide keyword: Corporate strategy, synergy, corporate value\n\nESG (Environmental, Social and  As Microsoft acquired Activation Blizzard, the game industry in the S&P 500 has changed.\NAssociation Keyword: Corporate Strategy, Synergy, Corporate Value\n\nESG (Environmental, Social, and Governance)']  As Microsoft acquired Activation Blizzard, the game industry in the S&P 500 has changed.\NAssociation Keyword: Corporate Strategy, Synergy, Corporate Value\n\nESG (Environmental, Social, and Governance)']

Multimodal Search

Chroma supports a multi-modal collection, a collection that can contain and query multiple forms of data.

Data set

Hosted in a Hugging Face coco object detection dataset Use a small subset of.

Only some of all the images in the dataset are downloaded locally and used to create a multi-modal collection.

import os
from datasets import load_dataset
from matplotlib import pyplot as plt

# Load COCO dataset
dataset = load_dataset(
    path="detection-datasets/coco", name="default", split="train", streaming=True
)

# Set the image storage folder and number of images
IMAGE_FOLDER = "tmp"
N_IMAGES = 20

# Settings for plotting graphs
plot_cols = 5
plot_rows = N_IMAGES // plot_cols
fig, axes = plt.subplots(plot_rows, plot_cols, figsize=(plot_rows * 2, plot_cols * 2))
axes = axes.flatten()

# Save images to a folder and display them on a graph
dataset_iter = iter(dataset)
os.makedirs(IMAGE_FOLDER, exist_ok=True)
for i in range(N_IMAGES):
    # Extract images and labels from the dataset
    data = next(dataset_iter)
    image = data["image"]
    label = data["objects"]["category"][0]  # 첫 번째 객체의 카테고리를 레이블로 사용

    # Displaying images and adding labels to a graph
    axes[i].imshow(image)
    axes[i].set_title(label, fontsize=8)
    axes[i].axis("off")

    # Save as image file
    image.save(f"{IMAGE_FOLDER}/{i}.jpg")

# Adjusting and displaying the graph layout
plt.tight_layout()
plt.show()

Multimodal Embeddings

Utilize Multimodal Embeddings to create Embedding for images and text.

In this tutorial, we use OpenClipEmbeddingFunction to embed the image.

OpenCLIP

Model benchmark

Model

Training data

Resolution

# of samples seen

ImageNet zero-shot acc.

ConvNext-Base

LAION-2B

256px

13B

71.5%

ConvNext-Large

LAION-2B

320px

29B

76.9%

ConvNext-XXLarge

LAION-2B

256px

34B

79.5%

ViT-B/32

DataComp-1B

256px

34B

72.8%

ViT-B/16

DataComp-1B

224px

13B

73.5%

ViT-L/14

LAION-2B

224px

32B

75.3%

ViT-H/14

LAION-2B

224px

32B

78.0%

ViT-L/14

DataComp-1B

224px

13B

79.2%

ViT-G/14

LAION-2B

224px

34B

80.1%

ViT-L/14 ( Original CLIP )

WIT

224px

13B

75.5%

ViT-SO400M/14 ( SigLIP )

WebLI

224px

45B

82.0%

ViT-SO400M-14-SigLIP-384 ( SigLIP )

WebLI

384px

45B

83.1%

ViT-H/14-quickgelu ( DFN )

DFN-5B

224px

39B

83.4%

ViT-H-14-378-quickgelu ( DFN )

DFN-5B

378px

44B

84.4%

In the example below model_name and checkpoint Set and use.

model_name : OpenCLIP model name
checkpoint : Of the OpenCLIP model Training data Name

import open_clip
import pandas as pd

# Output available models/checkpoints
pd.DataFrame(open_clip.list_pretrained(), columns=["model_name", "checkpoint"]).head(10)

model_name

checkpoint

RN50

openai

One

RN50

yfcc15m

RN50

cc12m

RN50-quickgelu

openai

RN50-quickgelu

yfcc15m

RN50-quickgelu

cc12m

RN101

openai

RN101

yfcc15m

RN101-quickgelu

openai

RN101-quickgelu

yfcc15m

from langchain_experimental.open_clip import OpenCLIPEmbeddings

# Creating an OpenCLIP embedding function object
image_embedding_function = OpenCLIPEmbeddings(
    model_name="ViT-H-14-378-quickgelu", checkpoint="dfn5b"
)

Save the path of the image as list.

# Save image paths as a list
image_uris = sorted(
    [
        os.path.join("tmp", image_name)
        for image_name in os.listdir("tmp")
        if image_name.endswith(".jpg")
    ]
)

image_uris

 ['tmp/0.jpg','tmp/1.jpg','tmp/10.jpg','tmp/11.jpg','tmp/12.jpg','tmp/13.jpg', 'tmp/14.jpg','tmp/

from langchain_teddynote.models import MultiModal
from langchain_openai import ChatOpenAI

# Initializing the ChatOpenAI model
llm = ChatOpenAI(model="gpt-4o-mini")

# MultiModal Model Setup
model = MultiModal(
    model=llm,
    system_prompt="Your mission is to describe the image in detail",  # 시스템 프롬프트: 이미지를 상세히 설명하도록 지시
    user_prompt="Description should be written in one sentence(less than 60 characters)",  # 사용자 프롬프트: 60자 이내의 한 문장으로 설명 요청
)

Create a description for image.

# Generate image descriptions
model.invoke(image_uris[0])

# Image Description
descriptions = dict()

for image_uri in image_uris:
    descriptions[image_uri] = model.invoke(image_uri, display_image=False)

# Output the generated results
descriptions

'A colorful lunchbox with various healthy snacks and months.','tmp/1.jpg':'Two giraffes near a tree, one reaching for leaves.','tmp/10.jpg':'Two giraffesA skater performs tricks on a graffiti-covered ramp.','tmp/14.jpg':'An owl-shaped candle beside an ornate clock.','tmp/15.jpg':'An Air France Airbus A380 flying through cloudy skies'  and tiled backsplash.','tmp/18.jpg':'A layered chocolate cake slice on a white plate.','tmp/19.jpg':'Deserted street scene with shops and a hotel sign.','tmp/2.jpg':'A white vase filleA curly-haired dog is sleeping on a pile of shoes.','tmp/6.jpg':'Two horses rearing up on a grassy field with riders.','tmp/7.jpg':'Two elephants carry riders through a dense jungle'  A train curves along tracks near a cityscape backdrop.'}

import os
from PIL import Image
import matplotlib.pyplot as plt

# Initialize a list to store the original image, processed image, and text description.
original_images = []
images = []
texts = []

# Set graph size (20x10 inches)
plt.figure(figsize=(20, 10))

# 'tmp' Process image files stored in a directory
for i, image_uri in enumerate(image_uris):
    # Open image file and convert to RGB mode
    image = Image.open(image_uri).convert("RGB")

    # Create subplots in a 4x5 grid
    plt.subplot(4, 5, i + 1)

    # Display image
    plt.imshow(image)

    # Set the image file name and description as the title
    plt.title(f"{os.path.basename(image_uri)}\n{descriptions[image_uri]}", fontsize=8)

    #Remove tick marks from x and y axes
    plt.xticks([])
    plt.yticks([])

    #Add original image, processed image, and text description to each list
    original_images.append(image)
    images.append(image)
    texts.append(descriptions[image_uri])

# Adjust spacing between subplots
plt.tight_layout()

Below we calculate the similarity between the image description and the text you created.

import numpy as np

# Image and text embedding
# Extract image features using image URI
img_features = image_embedding_function.embed_image(image_uris)
# Adding a “This is” prefix to text descriptions and extracting text features
text_features = image_embedding_function.embed_documents(
    ["This is " + desc for desc in texts]
)

# Convert list to numpy array for matrix operations
img_features_np = np.array(img_features)
text_features_np = np.array(text_features)

# Similarity calculation
# Compute cosine similarity between text and image features
similarity = np.matmul(text_features_np, img_features_np.T)

Seek and visualize similarities between text versus image description.

# Create a plot to visualize the similarity matrix
count = len(descriptions)
plt.figure(figsize=(20, 14))

# Displaying the similarity matrix as a heatmap
plt.imshow(similarity, vmin=0.1, vmax=0.3, cmap="coolwarm")
plt.colorbar()  # Add color bar

# Show text description on y-axis
plt.yticks(range(count), texts, fontsize=18)
plt.xticks([])  # Remove x-axis tick marks

# Display original image below x-axis
for i, image in enumerate(original_images):
    plt.imshow(image, extent=(i - 0.5, i + 0.5, -1.6, -0.6), origin="lower")

# Display similarity values as text on top of the heatmap
for x in range(similarity.shape[1]):
    for y in range(similarity.shape[0]):
        plt.text(x, y, f"{similarity[y, x]:.2f}", ha="center", va="center", size=12)

# Remove plot border
for side in ["left", "top", "right", "bottom"]:
    plt.gca().spines[side].set_visible(False)

# Set the plot range
plt.xlim([-0.5, count - 0.5])
plt.ylim([count + 0.5, -2])

# Add a title
plt.title( "Cosine similarity between text and image features", size=20)

Vectorstore creation and image addition

Generate Vectorstore and add images.

# Create a DB 
image_db = Chroma(
    collection_name="multimodal",
    embedding_function=image_embedding_function,
)

# Add image
image_db.add_images(uris=image_uris)

['cdd41dc4-e890-4de8-9035-9b5cd6603405','bb49e9af-54ac-4ed2-94a4-a5212b20d898', '204ecff3-c94e-464b-ab896aaf13e1-cccd-463d-bdab-811498959c2a','d882bf32-2983-4bd5-b7d9-be113f76cbe3', 'd5f4c365b-9579618a830-2207-4878-b127-d87c857fcb1d', '4998ff09-c6dc-4651-ab06-0f51ebf56c91','a6688148-ff30-4d25-85a4-0  480d7a24-f217-4455-9e85-9f83e1a32aee', '83016e20-e971-46aa-8e98-60db810434bc', '2785ce6d-70c5-4b18-b5e2-

Below is the helper class to output the image retrieved results into the image.

import base64
import io
from PIL import Image
from IPython.display import HTML, display
from langchain.schema import Document


class ImageRetriever:
    def __init__(self, retriever):
        """    
        Initializes the image finder.
        
        factor:
        retriever: LangChain의 retriever object
        """
        self.retriever = retriever

        def invoke(self, query):
        """
        Search and display images using queries.

        factor:
        query (str): Search Query
        """
        docs = self.retriever.invoke(query)
        if docs and isinstance(docs[0], Document):
            self.plt_img_base64(docs[0].page_content)
        else:
            print("no images found.")
        return docs

    @staticmethod
    def resize_base64_image(base64_string, size=(224, 224)):
        """
        Resizes an image encoded as a Base64 string.

        factor:
        base64_string (str): Base64 string of the original image.
        size (tuple): Desired image size expressed as (width, height)

        return:
        str: Base64 string of the resized image.
        """
        img_data = base64.b64decode(base64_string)
        img = Image.open(io.BytesIO(img_data))
        resized_img = img.resize(size, Image.LANCZOS)
        buffered = io.BytesIO()
        resized_img.save(buffered, format=img.format)
        return base64.b64encode(buffered.getvalue()).decode("utf-8")

    @staticmethod
    def plt_img_base64(img_base64):
        """
        Displays an image encoded in Base64.

        factor:
        img_base64 (str): Base64 encoded image string

        """
        image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
        display(HTML(image_html))

# Image Retriever generation
retriever = image_db.as_retriever(search_kwargs={"k": 3})
image_retriever = ImageRetriever(retriever)

# Image lookup
result = image_retriever.invoke("A Dog on the street")

# Image lookup
result = image_retriever.invoke("Motorcycle with a man")

PreviousCH09 VectorStore Next02. FAISS

Last updated 1 year ago

hashtagChroma

hashtagVectorStore creation

hashtagVector repository creation (from_documents)

hashtagVector repository creation (from_texts)

hashtagSimilarity search

hashtagAdd documents to vector storage

hashtagDelete documents from vector storage

hashtagreset_collection

hashtagConvert vector storage to Retriever

hashtagMultimodal Search

hashtagData set

hashtagMultimodal Embeddings

hashtagModel benchmark

hashtagVectorstore creation and image addition

Chroma

VectorStore creation

Vector repository creation (from_documents)

Vector repository creation (from_texts)

Similarity search

Add documents to vector storage

Delete documents from vector storage

reset_collection

Convert vector storage to Retriever

Multimodal Search

Data set

Multimodal Embeddings

Model benchmark

Vectorstore creation and image addition