WikiDocs 07. MultiVectorRetriever

MultiVectorRetriever

In LangChain, a special feature that allows you to efficiently query documents in a variety of situations, right MultiVectorRetriever Gives. This feature allows you to store and manage documents in multiple vectors, which can significantly improve the accuracy and efficiency of information retrieval.

MultiVectorRetriever Let's take a look at some ways to create multiple vectors per document using.

Introduction to multiple vector creation methods per document

  1. Small chunk generation : After dividing the document into smaller units, a separate embedding is generated for each chunk. This way, you can pay more attention to certain parts of the document. This course ParentDocumentRetriever It can be implemented through, making navigation to details easier.

  2. Summary embedding : Generate a summary of each document, and create an embedding from this summary. This summary embedding is a great help in quickly grasping the core content of the document. Instead of analyzing the entire document, you can maximize efficiency by using only the key summary parts.

  3. Using hypothetical questions : Create a suitable hypothetical question for each document, and create an embedding based on this question. This method is useful when you want a deep exploration of a particular topic or content. The hypothetical question allows the content of the document to be approached from a variety of perspectives, enabling a broader understanding.

  4. Manual addition method : Users can add specific questions or queries directly to consider when searching documents. This method allows users to have more detailed control in the search process, and allows customized searches tailored to their needs.

Documents utilized for practice

Software Policy Institute (SPRi)-December 2023

  • Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)

  • Link: https://spri.kr/posts/view/23669

  • File name: SPRI_AI_Brief_2023년12월호_F.pdf

Reference : The file above data Get download within the folder

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

Perform a preprocessing process that loads data from text files and divides loaded documents into specified sizes.

Split documents can be used for future vectorization and retrieval.

The original document loaded from the data docs In variable.

Chunk + Original Document Search

When searching for large amounts of information, it may be useful to embed information in smaller units.

MultiVectorRetriever You can save and manage documents in multiple vectors.

docstore Save the original document on, vectorstore Save the embedded document.

This divides the document into smaller units, allowing for more accurate searches. Depending on the time, the contents of the original document can be viewed.

Here to split into large chunks parent_text_splitter

To split into smaller chunks child_text_splitter Define.

Generate a larger Chunk, Parent document.

parent_docs Written in doc_id Check.

Generates Child documents, which are relatively smaller Chunk.

child_docs Written in doc_id Check.

Check the number of chunks each divided.

Add a set of newly created small splits to the vector storage.

Next, the parent document is mapped with the generated UUID docstore Add to.

  • mset() Save the document ID and document content to the document repository in pairs of key-value through the method.

Similarity search. Outputs the first piece of document with the highest similarity.

here retriever.vectorstore.similarity_search The method performs a search within the child + parent document chunk.

this time retriever.invoke() Run the query using methods.

retriever.invoke() The method searches the entire contents of the original document.

The type of search that retriever performs by default in a vector database is a similarity search.

LangChain Vector Stores Max Marginal Relevance Search via also supports, so if you want to use it instead, search_type Just set the attribute.

  • retriever Object search_type property SearchType.mmr Set to.

  • This is to specify the use of the Maximum Marginal Relevance (MMR) algorithm at the time of search.

Save summary (summary) to vector storage

Summaries can often extract chunk content more accurately, resulting in better search results.

Here we will explain how to generate a summary and how to embed it.

chain.batch Using methods docs A summary of the list's documents. - here max_concurrency Set the parameter to 10 so that up to 10 documents can be processed simultaneously.

Output the summarized content to confirm the results.

Chroma Initialize the vector repository to index child chunks. At this time OpenAIEmbeddings Use as an embedding function.

  • With key indicating document ID "doc_id" Use.

The number of documents in the summary matches the number of original documents.

Summarized documents and metadata (for the summary generated here) Document ID Save).

vectorstore Object similarity_search Perform similarity searches using methods.

  • retriever.vectorstore.add_documents(summary_docs) Through summary_docs Add to vector repository.

  • retriever.docstore.mset(list(zip(doc_ids, docs))) Using doc_ids Wow docs Map and save it to the document repository.

Last edited by: Aug. 31, 2024, 12:15 a.m.

retriever Object invoke Search for documents related to queries using methods.

Since we have only added the hypothetical queries we have created here, we return the document with the highest similarity among the hypothetical queries we have created.

Below are the results of similar search.

vectorstore Object similarity_search Perform similarity searches using methods.

Add hypothetical queries to documents, original documents docstore Add to.

question_docs Add metadata (document ID) to the list.

Below is the process of storing hypothetical Queries in vector storage, the same way they did before.

chain.batch Using methods split_docs Process multiple requests simultaneously for data.

  • The output contains three hypothetical Queries generated.

Output answers to documents.

  • functions Wow function_call Set to call the virtual question generation function.

  • JsonKeyOutputFunctionsParser Parse virtual questions created using, questions Extract the value corresponding to the key.

ChatPromptTemplate Use to define prompt templates that generate 3 virtual questions based on a given document.

Below Function Calling This is an example of using to generate hypothetical questions.

Creating home questions helps you grasp the main topics and concepts of the document, and can lead readers to more curious about the content of the document.

The questions created in this way can be embedded, which allows you to explore and understand the content of the document in more depth.

LLM can also be used to generate a list of questions that can be assumed for a particular document.

Explore document content using Hypothetical Queries

retriever Object invoke() Use it to search for documents related to your question.

retriever Object invoke Search for documents related to queries using methods.

Since we have only added the hypothetical queries we have created here, we return the document with the highest similarity among the hypothetical queries we have created.

Below are the results of similar search.

vectorstore Object similarity_search Perform similarity searches using methods.

Add hypothetical queries to documents, original documents docstore Add to.

question_docs Add metadata (document ID) to the list.

Below is the process of storing hypothetical Queries in vector storage, the same way they did before.

chain.batch Using methods split_docs Process multiple requests simultaneously for data.

  • The output contains three hypothetical Queries generated.

Output answers to documents.

  • functions Wow function_call Set to call the virtual question generation function.

  • JsonKeyOutputFunctionsParser Parse virtual questions created using, questions Extract the value corresponding to the key.

ChatPromptTemplate Use to define prompt templates that generate 3 virtual questions based on a given document.

Below Function Calling This is an example of using to generate hypothetical questions.

Creating home questions helps you grasp the main topics and concepts of the document, and can lead readers to more curious about the content of the document.

The questions created in this way can be embedded, which allows you to explore and understand the content of the document in more depth.

LLM can also be used to generate a list of questions that can be assumed for a particular document.

Explore document content using Hypothetical Queries

retriever Object invoke() Use it to search for documents related to your question.

Last updated