WikiDocs 07. MultiVectorRetriever

MultiVectorRetriever

In LangChain, a special feature that allows you to efficiently query documents in a variety of situations, right MultiVectorRetriever Gives. This feature allows you to store and manage documents in multiple vectors, which can significantly improve the accuracy and efficiency of information retrieval.

MultiVectorRetriever Let's take a look at some ways to create multiple vectors per document using.

Introduction to multiple vector creation methods per document

Small chunk generation : After dividing the document into smaller units, a separate embedding is generated for each chunk. This way, you can pay more attention to certain parts of the document. This course ParentDocumentRetriever It can be implemented through, making navigation to details easier.
Summary embedding : Generate a summary of each document, and create an embedding from this summary. This summary embedding is a great help in quickly grasping the core content of the document. Instead of analyzing the entire document, you can maximize efficiency by using only the key summary parts.
Using hypothetical questions : Create a suitable hypothetical question for each document, and create an embedding based on this question. This method is useful when you want a deep exploration of a particular topic or content. The hypothetical question allows the content of the document to be approached from a variety of perspectives, enabling a broader understanding.
Manual addition method : Users can add specific questions or queries directly to consider when searching documents. This method allows users to have more detailed control in the search process, and allows customized searches tailored to their needs.

Documents utilized for practice

Software Policy Institute (SPRi)-December 2023

Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)
Link: https://spri.kr/posts/view/23669
File name: SPRI_AI_Brief_2023년12월호_F.pdf

Reference : The file above data Get download within the folder

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

 True

# LangSmith set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH11-Retriever")

 Start tracking LangSmith. 
[Project name] 
CH11-Retriever

Perform a preprocessing process that loads data from text files and divides loaded documents into specified sizes.

Split documents can be used for future vectorization and retrieval.

from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("data/SPRI_AI_Brief_2023년12월호_F.pdf")
docs = loader.load()

The original document loaded from the data docs In variable.

Copyprint(docs[5].page_content[:500])

One. Policy/legal  
2. Enterprise/Industry  
3. Technology / Research  
 4. Workforce/training 
28 countries participating in the UK AI Safety Summit declare joint response to AI risk 
n 28 countries participating in the AI Safety Summit held in Wetley Park, UK  
Announcement of the Declaration of Blissley on the way to cooperate 
n Countries and companies developing advanced AI have agreed on a safety testing plan for the AI system,  
UK AI Safety Research Institute will work with countries around the world to lead testing  
KEY Contents 
£ Participating countries of the AI Safety Summit agree to cooperate to ensure AI safety through the Declaration of the Leslie 
n November 2023 1~2 at the AI Safety Summit held in Wetley Park, UK  
Representatives from 28 countries participated in the ‘Wencesley Declaration ’ for AI risk management  
The ∙Declaration is a collaboration of all stakeholders, including national, international organizations, enterprises, civil society and academia, to ensure AI safety.  
Emphasized to be important,

Chunk + Original Document Search

When searching for large amounts of information, it may be useful to embed information in smaller units.

MultiVectorRetriever You can save and manage documents in multiple vectors.

docstore Save the original document on, vectorstore Save the embedded document.

This divides the document into smaller units, allowing for more accurate searches. Depending on the time, the contents of the original document can be viewed.

# A vector store to use for indexing child chunks.
import uuid
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers.multi_vector import MultiVectorRetriever

vectorstore = Chroma(
    collection_name="small_bigger_chunks",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)
# Storage hierarchy of parent document
store = InMemoryStore()

id_key = "doc_id"

# Search engine (empty at start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# Generate a document ID.
doc_ids = [str(uuid.uuid4()) for _ in docs]

# 두개의 생성된 id를 확인합니다.
doc_ids

['ff65f8d8-376d-47d2-bbc0-eae1f7e3c381','e18fd86b-f01a-445f-bf31-eb8d16259241','f09959b-066f-4-377c-403-403-407b-4693-407b-93b-4693-407b-93e6-3f0ea44b500963b42edf-3546-4257-9046-ff3f4687a217', '749f6239-fec3-4893-8959-f641d6d551b3', '4a08aa0e-4e38-460d-90dc-34  91cfa69e-7ca7-44f9-8795-bc710622af17', '817e7519-3853-450c-bb05-b3f6763d9cd4', '0e3e534-762a-b

Here to split into large chunks parent_text_splitter

To split into smaller chunks child_text_splitter Define.

# RecursiveCharacterTextSplitter Create an object.
parent_text_splitter = RecursiveCharacterTextSplitter(chunk_size=600)

# A splitter to use to create smaller chunks.
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

Generate a larger Chunk, Parent document.

parent_docs = []

for i, doc in enumerate(docs):
    # Get the ID of the current document.
    _id = doc_ids[i]
    # Split current document into subdocuments
    parent_doc = parent_text_splitter.split_documents([doc])

    for _doc in parent_doc:
        # metadata Save the document ID in
        _doc.metadata[id_key] = _id
    parent_docs.extend(parent_doc)

parent_docs Written in doc_id Check.

# Check the metadata of the generated Parent document.
parent_docs[0].metadata

 '18:'2','2.','file_path':'data/SPRI_AI_Brief_2023','file_path':'data/SPRI_AI_Brief_2023 Year 12 issue_F.page': 0,'total_pages': 23,'

Generates Child documents, which are relatively smaller Chunk.

child_docs = []
for i, doc in enumerate(docs):
    # Get the ID of the current document.
    _id = doc_ids[i]
    # Split current document into subdocuments
    child_doc = child_text_splitter.split_documents([doc])
    for _doc in child_doc:
        # Store document ID in metadata
        _doc.metadata[id_key] = _id
    child_docs.extend(child_doc)

child_docs Written in doc_id Check.

# Check the metadata of the generated Child document.
child_docs[0].metadata

 '18:'2','2.','file_path':'data/SPRI_AI_Brief_2023','file_path':'data/SPRI_AI_Brief_2023 Year 12 issue_F.page': 0,'total_pages': 23,'

Check the number of chunks each divided.

print(f"divided parent_docs The number of: {len(parent_docs)}")
print(f"divided child_docs The number of: {len(child_docs)}")

Add a set of newly created small splits to the vector storage.

Next, the parent document is mapped with the generated UUID docstore Add to.

mset() Save the document ID and document content to the document repository in pairs of key-value through the method.

# Add parent + child documents to vector storage
retriever.vectorstore.add_documents(parent_docs)
retriever.vectorstore.add_documents(child_docs)

# docstore Save the original document in
retriever.docstore.mset(list(zip(doc_ids, docs)))

Similarity search. Outputs the first piece of document with the highest similarity.

here retriever.vectorstore.similarity_search The method performs a search within the child + parent document chunk.

# vectorstore Perform a similarity search for.
relevant_chunks = retriever.vectorstore.similarity_search(
    "The name of the generative AI created by Samsung Electronics is?"
)
print(f"Number of documents searched: {len(relevant_chunks)}")

 Number of documents searched: 4

for chunk in relevant_chunks:
    print(chunk.page_content, end="\n\n")
    print(">" * 100, end="\n\n")

 Source: Samsung Electronics, ‘Samsung AI Forum ’ Self-developed AI ‘Samsung Gauss’ released, 2023.11.08. 
Samsung Electronics Holds ‘Samsung Developer Conference Korea 2023’, 2023.11.14. 
TechRepublic, Samsung Gauss: Samsung Research Reveals Generative AI, 2023.11.08. 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 

SPRi AI Brief |  
2023-December 
10 
Samsung Electronics unveils self-developed AI ‘Samsung Gauss ’ 
n Create a self-development consisting of 3 models of languages, codes, and images that the Samsung can operate on the on-Device  
AI model ‘Samsung Gauss ’ released 
n Samsung plans to phase out Samsung Gauss in a variety of products, with on-dice operation possible 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 

▹ Samsung Electronics, self-developed AI ‘Samsung Gauss ’ public ················································ 
   ▹ Google invests $200 billion in & Pacific to strengthen AI cooperation ········································ 11 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 

SPRi AI Brief |  
2023-December 
10 
Samsung Electronics unveils self-developed AI ‘Samsung Gauss ’ 
n Create a self-development consisting of 3 models of languages, codes, and images that the Samsung can operate on the on-Device  
AI model ‘Samsung Gauss ’ released 
n Samsung plans to phase out Samsung Gauss in a variety of products, with on-dice operation possible  
Samsung Gauss has the advantage that there is no risk of user information leaking outward 
KEY Contents 
£ Samsung Gauss, Ondevice operation support, consisting of three models of language, code, and images 
n Generated AI model developed by Samsung at the ‘Samsung AI Forum 2023’ event held on November 8, 2023  
‘Samsung Gauss ’ first released 
∙Samsung Gauss, modeled after the name of genius mathematician Gauss, who established the theory of normal distribution,  
Optimized size model selection available 
∙Samsung Gauss has been learned through secure data that does not infringe licenses or personal information,  
Designed to operate on an on-device and has the advantage of not leaking user information externally 
∙The Samsung Electronics also introduced on-device AI technology utilizing Samsung Gauss, and generated AI models in various products. 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

this time retriever.invoke() Run the query using methods.

retriever.invoke() The method searches the entire contents of the original document.

Copyrelevant_docs = retriever.invoke("The name of the generative AI created by Samsung Electronics is?")
print(f"Number of documents searched: {len(relevant_docs)}", end="\n\n")
print("=" * 100, end="\n\n")
print(relevant_docs[0].page_content)

Number of documents searched: 2 

==================================================================================================== 

SPRi AI Brief |  
2023-December 
10 
Samsung Electronics unveils self-developed AI ‘Samsung Gauss ’ 
n Create a self-development consisting of 3 models of languages, codes, and images that the Samsung can operate on the on-Device  
AI model ‘Samsung Gauss ’ released 
n Samsung plans to phase out Samsung Gauss in a variety of products, with on-dice operation possible  
Samsung Gauss has the advantage that there is no risk of user information leaking outward 
KEY Contents 
£ Samsung Gauss, Ondevice operation support, consisting of three models of language, code, and images 
n Generated AI model developed by Samsung at the ‘Samsung AI Forum 2023’ event held on November 8, 2023  
‘Samsung Gauss ’ first released 
∙Samsung Gauss, modeled after the name of genius mathematician Gauss, who established the theory of normal distribution,  
Optimized size model selection available 
∙Samsung Gauss has been learned through secure data that does not infringe licenses or personal information,  
Designed to operate on an on-device and has the advantage of not leaking user information externally 
∙The Samsung Electronics also introduced on-device AI technology utilizing Samsung Gauss, and generated AI models in various products.  
Plan to be phased 
n Samsung Gauss △A language model that generates text △Code model that generates code △Generates images  
Composed of 3 models of image models 
The ∙language model consists of a variety of models for cloud and on-device destinations, composing mail, summarizing documents, and translating  
Support processing 
∙Code model-based AI coding assistant ‘Code Eye (code.i)’ provides services with an interactive interface  
Optimized for in-house software development 
The ∙image model helps create creative images and turn existing images into desired ones,  
High-resolution switching of low-resolution images also supports 
n IT specialization TechRepublic has emerged as a major technology trend,  
Starting in 2024, Samsung smartphones with Gauss have meta's Llama2 with quality-com devices and Google  
Expected to compete with the assistant's Google Pixel 
☞ Source: Samsung Electronics, ‘Samsung AI Forum ’ Self-developed AI ‘Samsung Gauss’ released, 2023.11.08. 
Samsung Electronics Holds ‘Samsung Developer Conference Korea 2023’, 2023.11.14. 
TechRepublic, Samsung Gauss: Samsung Research Reveals Generative AI, 2023.11.08.

The type of search that retriever performs by default in a vector database is a similarity search.

LangChain Vector Stores Max Marginal Relevance Search via also supports, so if you want to use it instead, search_type Just set the attribute.

retriever Object search_type property SearchType.mmr Set to.
This is to specify the use of the Maximum Marginal Relevance (MMR) algorithm at the time of search.

from langchain.retrievers.multi_vector import SearchType

# Search type MMR(Maximal Marginal Relevance)로 설정
retriever.search_type = SearchType.mmr

# Search all related documents
print(retriever.invoke("The name of the generative AI created by Samsung Electronics is?")[0].page_content)

 SPRi AI Brief |  
2023-December 
10 
Samsung Electronics unveils self-developed AI ‘Samsung Gauss ’ 
n Create a self-development consisting of 3 models of languages, codes, and images that the Samsung can operate on the on-Device  
AI model ‘Samsung Gauss ’ released 
n Samsung plans to phase out Samsung Gauss in a variety of products, with on-dice operation possible  
Samsung Gauss has the advantage that there is no risk of user information leaking outward 
KEY Contents 
£ Samsung Gauss, Ondevice operation support, consisting of three models of language, code, and images 
n Generated AI model developed by Samsung at the ‘Samsung AI Forum 2023’ event held on November 8, 2023  
‘Samsung Gauss ’ first released 
∙Samsung Gauss, modeled after the name of genius mathematician Gauss, who established the theory of normal distribution,  
Optimized size model selection available 
∙Samsung Gauss has been learned through secure data that does not infringe licenses or personal information,  
Designed to operate on an on-device and has the advantage of not leaking user information externally 
∙The Samsung Electronics also introduced on-device AI technology utilizing Samsung Gauss, and generated AI models in various products.  
Plan to be phased 
n Samsung Gauss △A language model that generates text △Code model that generates code △Generates images  
Composed of 3 models of image models 
The ∙language model consists of a variety of models for cloud and on-device destinations, composing mail, summarizing documents, and translating  
Support processing 
∙Code model-based AI coding assistant ‘Code Eye (code.i)’ provides services with an interactive interface  
Optimized for in-house software development 
The ∙image model helps create creative images and turn existing images into desired ones,  
High-resolution switching of low-resolution images also supports 
n IT specialization TechRepublic has emerged as a major technology trend,  
Starting in 2024, Samsung smartphones with Gauss have meta's Llama2 with quality-com devices and Google  
Expected to compete with the assistant's Google Pixel 
☞ Source: Samsung Electronics, ‘Samsung AI Forum ’ Self-developed AI ‘Samsung Gauss’ released, 2023.11.08. 
Samsung Electronics Holds ‘Samsung Developer Conference Korea 2023’, 2023.11.14. 
TechRepublic, Samsung Gauss: Samsung Research Reveals Generative AI, 2023.11.08.

from langchain.retrievers.multi_vector import SearchType

# Search type similarity_score_threshold set to
retriever.search_type = SearchType.similarity_score_threshold
retriever.search_kwargs = {"score_threshold": 0.3}

# Search all related documents
print(retriever.invoke("The name of the generative AI created by Samsung Electronics is?")[0].page_content)

SPRi AI Brief |  
2023-December 
10 
Samsung Electronics unveils self-developed AI ‘Samsung Gauss ’ 
n Create a self-development consisting of 3 models of languages, codes, and images that the Samsung can operate on the on-Device  
AI model ‘Samsung Gauss ’ released 
n Samsung plans to phase out Samsung Gauss in a variety of products, with on-dice operation possible  
Samsung Gauss has the advantage that there is no risk of user information leaking outward 
KEY Contents 
£ Samsung Gauss, Ondevice operation support, consisting of three models of language, code, and images 
n Generated AI model developed by Samsung at the ‘Samsung AI Forum 2023’ event held on November 8, 2023  
‘Samsung Gauss ’ first released 
∙Samsung Gauss, modeled after the name of genius mathematician Gauss, who established the theory of normal distribution,  
Optimized size model selection available 
∙Samsung Gauss has been learned through secure data that does not infringe licenses or personal information,  
Designed to operate on an on-device and has the advantage of not leaking user information externally 
∙The Samsung Electronics also introduced on-device AI technology utilizing Samsung Gauss, and generated AI models in various products.  
Plan to be phased 
n Samsung Gauss △A language model that generates text △Code model that generates code △Generates images  
Composed of 3 models of image models 
The ∙language model consists of a variety of models for cloud and on-device destinations, composing mail, summarizing documents, and translating  
Support processing 
∙Code model-based AI coding assistant ‘Code Eye (code.i)’ provides services with an interactive interface  
Optimized for in-house software development 
The ∙image model helps create creative images and turn existing images into desired ones,  
High-resolution switching of low-resolution images also supports 
n IT specialization TechRepublic has emerged as a major technology trend,  
Starting in 2024, Samsung smartphones with Gauss have meta's Llama2 with quality-com devices and Google  
Expected to compete with the assistant's Google Pixel 
☞ Source: Samsung Electronics, ‘Samsung AI Forum ’ Self-developed AI ‘Samsung Gauss’ released, 2023.11.08. 
Samsung Electronics Holds ‘Samsung Developer Conference Korea 2023’, 2023.11.14. 
TechRepublic, Samsung Gauss: Samsung Research Reveals Generative AI, 2023.11.08.

Copyfrom langchain.retrievers.multi_vector import SearchType

# Set search type to similarity, set k value to 1
retriever.search_type = SearchType.similarity
retriever.search_kwargs = {"k": 1}

# Search all related documents
print(len(retriever.invoke("The name of the generative AI created by Samsung Electronics is?")))

One

Save summary (summary) to vector storage

Summaries can often extract chunk content more accurately, resulting in better search results.

Here we will explain how to generate a summary and how to embed it.

# PDF Importing libraries to load files and split text
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# PDF Initialize file loader
loader = PyMuPDFLoader("data/SPRI_AI_Brief_2023년12월호_F.pdf")

# Text Splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=50)

# PDF Load file and run text splitting
split_docs = loader.load_and_split(text_splitter)

# Output the number of split documents
print(f"Number of split documents: {len(split_docs)}")

 Number of documents divided: 61

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI


summary_chain = (
    {"doc": lambda x: x.page_content}
    # Create a prompt template for summarizing a document
    | ChatPromptTemplate.from_messages(
        [
            ("system", "You are an expert in summarizing documents in Korean."),
            (
                "user",
                "Summarize the following documents in 3 sentences in bullet points format.\n\n{doc}",
            ),
        ]
    )
    # OpenAI의 ChatGPT Generating summaries using models
    | ChatOpenAI(temperature=0, model="gpt-4o-mini")
    | StrOutputParser()
)

chain.batch Using methods docs A summary of the list's documents. - here max_concurrency Set the parameter to 10 so that up to 10 documents can be processed simultaneously.

# Document batch processing

summaries = summary_chain.batch(split_docs, {"max_concurrency": 10})

len(summaries)

Output the summarized content to confirm the results.

Copy# Prints the contents of the original document.
print(split_docs[33].page_content, end="\n\n")
# Prints a summary.
print("[summation]")
print(summaries[33])

SPRi AI Brief |  
2023-December 
10 
Samsung Electronics unveils self-developed AI ‘Samsung Gauss ’ 
n Create a self-development consisting of 3 models of languages, codes, and images that the Samsung can operate on the on-Device  
AI model ‘Samsung Gauss ’ released 
n Samsung plans to phase out Samsung Gauss in a variety of products, with on-dice operation possible  
Samsung Gauss has the advantage that there is no risk of user information leaking outward 
KEY Contents 
£ Samsung Gauss, Ondevice operation support, consisting of three models of language, code, and images 
n Generated AI model developed by Samsung at the ‘Samsung AI Forum 2023’ event held on November 8, 2023  
‘Samsung Gauss ’ first released 
∙Samsung Gauss, modeled after the name of genius mathematician Gauss, who established the theory of normal distribution,  
Optimized size model selection available 
∙Samsung Gauss has been learned through secure data that does not infringe licenses or personal information,  
Designed to operate on an on-device and has the advantage of not leaking user information externally 
∙The Samsung Electronics also introduced on-device AI technology utilizing Samsung Gauss, and generated AI models in various products. 

[Summary] 
-The Samsung Electronics unveiled the creation AI model'Samsung Gauss', which can operate on the Ondevice, which consists of three models: language, code, and image. 
-'Samsung Gauss' is named after the mathematician Gauss, who established the theory of normal distribution, and is capable of selecting models optimized for various situations. 
-The Samsung Electronics is designed so that this AI model does not leak user information to the outside, and plans to phase out various products in the future.

Chroma Initialize the vector repository to index child chunks. At this time OpenAIEmbeddings Use as an embedding function.

With key indicating document ID "doc_id" Use.

import uuid

# Create a vector store to store summary information.
summary_vectorstore = Chroma(
    collection_name="summaries",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)

# Create a repository to store the parent document.
store = InMemoryStore()

# Specifies the key name under which to store the document ID.
id_key = "doc_id"

# Reset the search engine. (Empty at start)
retriever = MultiVectorRetriever(
    vectorstore=summary_vectorstore,  # The best way to get started
    byte_store=store,  # byte storage
    id_key=id_key,  # Document ID Key
)
# Generate a document ID.
doc_ids = [str(uuid.uuid4()) for _ in split_docs]

# Number of documents in the summary
len(summary_docs)

The number of documents in the summary matches the number of original documents.

summary_docs = [
    # Create a Document object with the summarized content as page content and the document ID as metadata.
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

Summarized documents and metadata (for the summary generated here) Document ID Save).

Copy# Outputs 1 result document.
print(result_docs[0].page_content)

# Perform a similarity search.
result_docs = summary_vectorstore.similarity_search(
    "The name of the generative AI created by Samsung Electronics is?"
)

vectorstore Object similarity_search Perform similarity searches using methods.

retriever.vectorstore.add_documents(
    summary_docs
)  # Add summarized document to vector storage.

# Map document IDs to documents and store them in a document repository.
retriever.docstore.mset(list(zip(doc_ids, split_docs)))

retriever.vectorstore.add_documents(summary_docs) Through summary_docs Add to vector repository.
retriever.docstore.mset(list(zip(doc_ids, docs))) Using doc_ids Wow docs Map and save it to the document repository.

Last edited by: Aug. 31, 2024, 12:15 a.m.

 Samsung Electronics Holds ‘Samsung Developer Conference Korea 2023’, 2023.11.14. 
TechRepublic, Samsung Gauss: Samsung Research Reveals Generative AI, 2023.11.08. 
SPRi AI Brief |  
2023-December 
10 
Samsung Electronics unveils self-developed AI ‘Samsung Gauss ’ 
n Create a self-development consisting of 3 models of languages, codes, and images that the Samsung can operate on the on-Device  
AI model ‘Samsung Gauss ’ released 
n Samsung plans to phase out Samsung Gauss in a variety of products, with on-dice operation possible  
Samsung Gauss has the advantage that there is no risk of user information leaking outward 
KEY Contents 
£ Samsung Gauss, Ondevice operation support, consisting of three models of language, code, and images 
n Generated AI model developed by Samsung at the ‘Samsung AI Forum 2023’ event held on November 8, 2023  
‘Samsung Gauss ’ first released 
∙Samsung Gauss, modeled after the name of genius mathematician Gauss, who established the theory of normal distribution,  
Optimized size model selection available 
∙Samsung Gauss has been learned through secure data that does not infringe licenses or personal information,  
Designed to operate on an on-device and has the advantage of not leaking user information externally 
∙The Samsung Electronics also introduced on-device AI technology utilizing Samsung Gauss, and generated AI models in various products.

# Search and retrieve related documents.
retrieved_docs = retriever.invoke(result_docs[1].page_content)

# Prints the searched documents.
for doc in retrieved_docs:
    print(doc.page_content)

retriever Object invoke Search for documents related to queries using methods.

 What are the new products or services developed by Samsung using AI technology? 
{'doc_id': '61bcf671-1f5c-4d75-850e-42d4c88c2c87'} 
What competitiveness is Samsung's Generative AI technology expected to have in the AI market in the future? 
{'doc_id': '61bcf671-1f5c-4d75-850e-42d4c88c2c87'} 
How will the content presented at the Samsung Developer Conference Korea 2023 affect the AI industry? 
{'doc_id': '61bcf671-1f5c-4d75-850e-42d4c88c2c87'} 
How will the function of Samsung Gauss, working on the ondives, affect the future of the AI industry? 
{'doc_id':'b7e88e61-8c28-400e-a4fb-f6bd25e5e40b'}

# Outputs similarity search results.
for doc in result_docs:
    print(doc.page_content)
    print(doc.metadata)

Since we have only added the hypothetical queries we have created here, we return the document with the highest similarity among the hypothetical queries we have created.

Below are the results of similar search.

# Search for similar documents in the vector repository.
result_docs = hypothetical_vectorstore.similarity_search(
    "The name of the generative AI created by Samsung Electronics is?"
)

vectorstore Object similarity_search Perform similarity searches using methods.

# hypothetical_questions Add a document to the vector repository.
retriever.vectorstore.add_documents(question_docs)

# Map document IDs to documents and store them in a document repository.
retriever.docstore.mset(list(zip(doc_ids, split_docs)))

Add hypothetical queries to documents, original documents docstore Add to.

question_docs = []
# hypothetical_questions save
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        # For each question in the question list, create a Document object and include the document ID of that question in its metadata.
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

question_docs Add metadata (document ID) to the list.

# A vector store to use for indexing child chunks.
hypothetical_vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# Storage hierarchy of parent document
store = InMemoryStore()

id_key = "doc_id"
# Search engine (empty at start)
retriever = MultiVectorRetriever(
    vectorstore=hypothetical_vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in split_docs]  # Generate Document ID

Below is the process of storing hypothetical Queries in vector storage, the same way they did before.

 ['What advantages does Samsung Gauss offer in protecting user privacy when compared to other generated AI models?','Why did the Samsung decide to put Samsung Gauss on various products?','How will the function of Samsung Gauss, working on an on-dives, affect the future of the AI industry?']

hypothetical_questions[33]

# Create batches of hypothesis questions for a list of documents
hypothetical_questions = hypothetical_query_chain.batch(
    split_docs, {"max_concurrency": 10}
)

chain.batch Using methods split_docs Process multiple requests simultaneously for data.

 ['What competitiveness can Samsung Gauss have compared to other generated AI models?','What is the impact of Samsung Gauss, which is capable of operating on the fly, on the protection of personal information?','Why would the Holy Ghost phase Samsung Gauss into various products?']

# Runs a chain over a given document.
hypothetical_query_chain.invoke(split_docs[33])

The output contains three hypothetical Queries generated.

Output answers to documents.

from langchain_core.prompts import ChatPromptTemplate
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
from langchain_openai import ChatOpenAI

hypothetical_query_chain = (
    {"doc": lambda x: x.page_content}
    # We ask you to generate exactly three hypothetical questions that can be answered using the documents below. This number can be adjusted.
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer. "
        "Potential users are those interested in the AI industry. Create questions that they would be interested in. "
        "Output should be written in Korean:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4o-mini").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    # Extract the value corresponding to the "questions" key from the output.
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

functions Wow function_call Set to call the virtual question generation function.
JsonKeyOutputFunctionsParser Parse virtual questions created using, questions Extract the value corresponding to the key.

ChatPromptTemplate Use to define prompt templates that generate 3 virtual questions based on a given document.

functions = [
    {
        "name": "hypothetical_questions",  # Name the function.
        "description": "Generate hypothetical questions",  # Write a description for the function.
        "parameters": {  # Defines the parameters of a function.
            "type": "object",  # Specifies the type of the parameter as an object.
            "properties": {  # Defines the properties of an object.
                "questions": {  # 'questions' Defines properties.
                    "type": "array",  # 'questions'Specify the type as an array.
                    "items": {
                        "type": "string"
                    },  # Specifies the element type of the array as a string.
                },
            },
            "required": ["questions"],  # As a required parameter 'questions'Specifies.
        },
    }
]

Below Function Calling This is an example of using to generate hypothetical questions.

Creating home questions helps you grasp the main topics and concepts of the document, and can lead readers to more curious about the content of the document.

The questions created in this way can be embedded, which allows you to explore and understand the content of the document in more depth.

LLM can also be used to generate a list of questions that can be assumed for a particular document.

Explore document content using Hypothetical Queries

 SPRi AI Brief |  
2023-December 
10 
Samsung Electronics unveils self-developed AI ‘Samsung Gauss ’ 
n Create a self-development consisting of 3 models of languages, codes, and images that the Samsung can operate on the on-Device  
AI model ‘Samsung Gauss ’ released 
n Samsung plans to phase out Samsung Gauss in a variety of products, with on-dice operation possible  
Samsung Gauss has the advantage that there is no risk of user information leaking outward 
KEY Contents 
£ Samsung Gauss, Ondevice operation support, consisting of three models of language, code, and images 
n Generated AI model developed by Samsung at the ‘Samsung AI Forum 2023’ event held on November 8, 2023  
‘Samsung Gauss ’ first released 
∙Samsung Gauss, modeled after the name of genius mathematician Gauss, who established the theory of normal distribution,  
Optimized size model selection available 
∙Samsung Gauss has been learned through secure data that does not infringe licenses or personal information,  
Designed to operate on an on-device and has the advantage of not leaking user information externally 
∙The Samsung Electronics also introduced on-device AI technology utilizing Samsung Gauss, and generated AI models in various products.

# Search and retrieve related documents.
retrieved_docs = retriever.invoke("The name of the generative AI created by Samsung Electronics is?")
print(retrieved_docs[0].page_content)

retriever Object invoke() Use it to search for documents related to your question.

-The Samsung Electronics unveiled the creation AI model'Samsung Gauss', which can operate on the Ondevice, which consists of three models: language, code, and image. 
-'Samsung Gauss' is named after the mathematician Gauss, who established the theory of normal distribution, and is capable of selecting models optimized for various situations. 
-The Samsung Electronics is designed so that this AI model does not leak user information to the outside, and plans to phase out various products in the future.

 Samsung Electronics Holds ‘Samsung Developer Conference Korea 2023’, 2023.11.14. 
TechRepublic, Samsung Gauss: Samsung Research Reveals Generative AI, 2023.11.08. 
SPRi AI Brief |  
2023-December 
10 
Samsung Electronics unveils self-developed AI ‘Samsung Gauss ’ 
n Create a self-development consisting of 3 models of languages, codes, and images that the Samsung can operate on the on-Device  
AI model ‘Samsung Gauss ’ released 
n Samsung plans to phase out Samsung Gauss in a variety of products, with on-dice operation possible  
Samsung Gauss has the advantage that there is no risk of user information leaking outward 
KEY Contents 
£ Samsung Gauss, Ondevice operation support, consisting of three models of language, code, and images 
n Generated AI model developed by Samsung at the ‘Samsung AI Forum 2023’ event held on November 8, 2023  
‘Samsung Gauss ’ first released 
∙Samsung Gauss, modeled after the name of genius mathematician Gauss, who established the theory of normal distribution,  
Optimized size model selection available 
∙Samsung Gauss has been learned through secure data that does not infringe licenses or personal information,  
Designed to operate on an on-device and has the advantage of not leaking user information externally 
∙The Samsung Electronics also introduced on-device AI technology utilizing Samsung Gauss, and generated AI models in various products.

# Search and retrieve related documents.
retrieved_docs = retriever.invoke(result_docs[1].page_content)

# Prints the searched documents.
for doc in retrieved_docs:
    print(doc.page_content)

retriever Object invoke Search for documents related to queries using methods.

 What are the new products or services developed by Samsung using AI technology? 
{'doc_id': '61bcf671-1f5c-4d75-850e-42d4c88c2c87'} 
What competitiveness is Samsung's Generative AI technology expected to have in the AI market in the future? 
{'doc_id': '61bcf671-1f5c-4d75-850e-42d4c88c2c87'} 
How will the content presented at the Samsung Developer Conference Korea 2023 affect the AI industry? 
{'doc_id': '61bcf671-1f5c-4d75-850e-42d4c88c2c87'} 
How will the function of Samsung Gauss, working on the ondives, affect the future of the AI industry? 
{'doc_id':'b7e88e61-8c28-400e-a4fb-f6bd25e5e40b'}

# Outputs similarity search results.
for doc in result_docs:
    print(doc.page_content)
    print(doc.metadata)

Since we have only added the hypothetical queries we have created here, we return the document with the highest similarity among the hypothetical queries we have created.

Below are the results of similar search.

# Search for similar documents in the vector repository.
result_docs = hypothetical_vectorstore.similarity_search(
    "The name of the generative AI created by Samsung Electronics is?"
)

vectorstore Object similarity_search Perform similarity searches using methods.

# hypothetical_questions Add a document to the vector repository.
retriever.vectorstore.add_documents(question_docs)

# Map document IDs to documents and store them in a document repository.
retriever.docstore.mset(list(zip(doc_ids, split_docs)))

Add hypothetical queries to documents, original documents docstore Add to.

question_docs = []
# hypothetical_questions save
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        # For each question in the question list, create a Document object and include the document ID of that question in its metadata.
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

question_docs Add metadata (document ID) to the list.

# A vector store to use for indexing child chunks.
hypothetical_vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# Storage hierarchy of parent document
store = InMemoryStore()

id_key = "doc_id"
# Search engine (empty at start)
retriever = MultiVectorRetriever(
    vectorstore=hypothetical_vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in split_docs]  # Generate Document ID

Below is the process of storing hypothetical Queries in vector storage, the same way they did before.

 ['What advantages does Samsung Gauss offer in protecting user privacy when compared to other generated AI models?','Why did the Samsung decide to put Samsung Gauss on various products?','How will the function of Samsung Gauss, working on an on-dives, affect the future of the AI industry?']

hypothetical_questions[33]

# Create batches of hypothesis questions for a list of documents
hypothetical_questions = hypothetical_query_chain.batch(
    split_docs, {"max_concurrency": 10}
)

chain.batch Using methods split_docs Process multiple requests simultaneously for data.

 ['What competitiveness can Samsung Gauss have compared to other generated AI models?','What is the impact of Samsung Gauss, which is capable of operating on the fly, on the protection of personal information?','Why would the Holy Ghost phase Samsung Gauss into various products?']

# Runs a chain on a given document.
hypothetical_query_chain.invoke(split_docs[33])

The output contains three hypothetical Queries generated.

Output answers to documents.

from langchain_core.prompts import ChatPromptTemplate
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
from langchain_openai import ChatOpenAI

hypothetical_query_chain = (
    {"doc": lambda x: x.page_content}
    # We ask you to generate exactly three hypothetical questions that can be answered using the documents below. This number can be adjusted.
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer. "
        "Potential users are those interested in the AI industry. Create questions that they would be interested in. "
        "Output should be written in Korean:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4o-mini").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    # Extract the value corresponding to the "questions" key from the output.
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

functions Wow function_call Set to call the virtual question generation function.
JsonKeyOutputFunctionsParser Parse virtual questions created using, questions Extract the value corresponding to the key.

ChatPromptTemplate Use to define prompt templates that generate 3 virtual questions based on a given document.

functions = [
    {
        "name": "hypothetical_questions",  # Specifies the name of the function.
        "description": "Generate hypothetical questions",  # Write a description for the function.
        "parameters": {  # Defines the parameters of a function.
            "type": "object",  # Specifies the type of the parameter as an object.
            "properties": {  # Defines the properties of an object.
                "questions": {  # 'questions' Defines properties.
                    "type": "array",  # 'questions'Specifies the type as an array.
                    "items": {
                        "type": "string"
                    },  # Specifies the element type of the array as a string.
                },
            },
            "required": ["questions"],  # Specify 'questions' as a required parameter.
        },
    }
]

Below Function Calling This is an example of using to generate hypothetical questions.

Creating home questions helps you grasp the main topics and concepts of the document, and can lead readers to more curious about the content of the document.

The questions created in this way can be embedded, which allows you to explore and understand the content of the document in more depth.

LLM can also be used to generate a list of questions that can be assumed for a particular document.

Explore document content using Hypothetical Queries

 SPRi AI Brief |  
2023-December 
10 
Samsung Electronics unveils self-developed AI ‘Samsung Gauss ’ 
n Create a self-development consisting of 3 models of languages, codes, and images that the Samsung can operate on the on-Device  
AI model ‘Samsung Gauss ’ released 
n Samsung plans to phase out Samsung Gauss in a variety of products, with on-dice operation possible  
Samsung Gauss has the advantage that there is no risk of user information leaking outward 
KEY Contents 
£ Samsung Gauss, Ondevice operation support, consisting of three models of language, code, and images 
n Generated AI model developed by Samsung at the ‘Samsung AI Forum 2023’ event held on November 8, 2023  
‘Samsung Gauss ’ first released 
∙Samsung Gauss, modeled after the name of genius mathematician Gauss, who established the theory of normal distribution,  
Optimized size model selection available 
∙Samsung Gauss has been learned through secure data that does not infringe licenses or personal information,  
Designed to operate on an on-device and has the advantage of not leaking user information externally 
∙The Samsung Electronics also introduced on-device AI technology utilizing Samsung Gauss, and generated AI models in various products.

# Search and retrieve related documents.
retrieved_docs = retriever.invoke("The name of the generative AI created by Samsung Electronics is?")
print(retrieved_docs[0].page_content)

retriever Object invoke() Use it to search for documents related to your question.

-The Samsung Electronics unveiled the creation AI model'Samsung Gauss', which can operate on the Ondevice, which consists of three models: language, code, and image. 
-'Samsung Gauss' is named after the mathematician Gauss, who established the theory of normal distribution, and is capable of selecting models optimized for various situations. 
-The Samsung Electronics is designed so that this AI model does not leak user information to the outside, and plans to phase out various products in the future.

Previous06. MultiQueryRetriever Next08. Self Query Retriever

Last updated 1 year ago

hashtagMultiVectorRetriever

hashtagDocuments utilized for practice

hashtagChunk + Original Document Search

hashtagSave summary (summary) to vector storage

hashtagExplore document content using Hypothetical Queries

hashtagExplore document content using Hypothetical Queries

MultiVectorRetriever

Documents utilized for practice

Chunk + Original Document Search

Save summary (summary) to vector storage

Explore document content using Hypothetical Queries

Explore document content using Hypothetical Queries