# 05. ParentDocumentRetriever

## Parent Document Retriever <a href="#parent-document-retriever" id="parent-document-retriever"></a>

**Balancing document search and document splitting**

Dividing a document into pieces of appropriate size (chunks) during the document retrieval process is: **Consider two conflicting important factors** Should do.

1. If you want a small document: This will allow the embedding of the document to most accurately reflect its meaning. If the document is too long, embedding may lose meaning.
2. This is the case when you want a document long enough to maintain the context of each chunk.

**`ParentDocumentRetriever` Role of**

To balance between these two requirements `ParentDocumentRetriever` Ragi tools are used. This tool divides documents into small pieces and manages these pieces. When you go through the search, you can first find these small pieces, then grasp the overall context through the identifier (ID) of the original document (or larger piece) to which these pieces belong.

The term'parent document' here refers to the original document in which small pieces are divided. This could be a full document, or another relatively large piece. This way, you can accurately grasp the meaning of the document, but maintain the overall context.

**theorem**

* **Leverage hierarchies between documents** : `ParentDocumentRetriever` Utilizes hierarchies between documents to increase the efficiency of document retrieval.
* **Improved search performance** : Quickly find relevant documents, and effectively find documents that provide the best answers to a given question. There are two conflicting requirements that often arise when searching for documents:

To load multiple text files `TextLoader` Create objects and load data.

```
# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()
```

```
True
```

```
# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH11-Retriever")
```

```
Start tracking LangSmith. 
[Project name] 
CH11-Retriever 
```

```
from langchain.storage import InMemoryStore
from langchain_community.document_loaders import TextLoader
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
```

```
loaders = [
    # load the file.
    TextLoader("./data/appendix-keywords.txt"),
]

docs = []
for loader in loaders:
    # Use the loader to load a document and add it to the docs list.
    docs.extend(loader.load())

```

### Search for entire document <a href="#id-1" id="id-1"></a>

In this mode, I want to search the entire document. therefore `child_splitter` I'll only specify.

* Later `parent_splitter` Let's compare the results by specifying the degrees.

```
# Create a child splitter.
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Create a DB.
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)

store = InMemoryStore()

# Retriever creates.
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)
```

`retriever.add_documents(docs, ids=None)` Add a document list as a function.

* `ids` end `None` This is automatically generated.
* `add_to_docstore=False` Do not add document as duplicate when setting to. However, to check for duplicates `ids` Values are required as required values.

```
# Adds a document to the searcher. docs is a list of documents, and ids is a list of unique identifiers for the documents..
retriever.add_documents(docs, ids=None, add_to_docstore=True)
```

This code must return two keys. The reason is that we added two documents.

* `store` Object `yield_keys()` Call the method to convert the returned key values to the list.

```
# Returns a list of all keys in the repository.
list(store.yield_keys())
```

```
 ['c2a89a0f-a690-4915-af68-2ea432fb6e51'] 
```

Now let's call the vector store search function.

Since we are storing small chunks, we will be able to confirm that small chunks are returned as a result of the search.

`vectorstore` Object `similarity_search` Perform similarity searches using methods.

```
# Perform a similarity search.
sub_docs = vectorstore.similarity_search("Word2Vec")
```

`sub_docs[0].page_content` Outputs.

```
# Outputs the page_content attribute of the first element in the sub_docs list.
print(sub_docs[0].page_content)
```

```
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
```

Now let's search in the whole retriever. In this process, small chunks are located **Return document** Because of this, relatively large documents will be returned.

`retriever` Object `invoke()` Search for documents related to queries using methods.

```
# Search and retrieve documents.
retrieved_docs = retriever.invoke("Word2Vec")
```

Documents retrieved ( `retrieved_docs[0]` ) Outputs some content.

```
# Outputs the length of the page content of the document in the searched document.
print(
    f"The squirrel: {len(retrieved_docs[0].page_content)}",
    end="\n\n=====================\n\n",
)

# Print part of the document.
print(retrieved_docs[0].page_content[2000:2500])
```

```
Document length: 5733 

===================== 

 Innovating data storage and processing by introducing computing is an example of digital transformation. 
Related Keywords: innovation, technology, business model 

Crawling 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 

Definition: LLM is a large language model trained with large-scale text data 
```

### resize larger Chunk <a href="#chunk" id="chunk"></a>

Like the previous result **Not suitable to search as the entire document is too large** You can.

In this case, what we really want to do is first split the raw document into larger chunks, then into smaller chunks.

Then index small chunks, but search for larger chunks at the time of search (but still not the whole document).

* `RecursiveCharacterTextSplitter` Use to create parent and child documents.
* Parent documents `chunk_size` It is set to 1000.
* Child documents `chunk_size` It is set to 200, and is created in a smaller size than the parent document.

```python
# A text splitter used to generate the parent document.
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
# A text splitter used to generate child documents.
# You must create a document that is smaller than its parent.
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# A vector store to use for indexing child chunks.
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
# This is the storage layer of the parent document.
store = InMemoryStore()
```

`ParentDocumentRetriever` Code to initialize.

* `vectorstore` The parameter specifies a vector repository that stores document vectors.
* `docstore` The parameter specifies the document repository that stores document data.
* `child_splitter` The parameter specifies the document divider used to split the sub-documents.
* `parent_splitter` The parameter specifies the document divider used to split the parent document.

`ParentDocumentRetriever` handles hierarchical document structures, and divides and stores parent and sub-documents separately. This allows you to effectively utilize the parent and sub-documents at the time of search.

```python
retriever = ParentDocumentRetriever(
    # Specifies a vector storage.
    vectorstore=vectorstore,
    # Specify a document repository.
    docstore=store,
    # Specifies a subdocument divider.
    child_splitter=child_splitter,
    # Specifies the parent document divider.
    parent_splitter=parent_splitter,
)
```

`retriever` On the object `docs` Add. `retriever` It serves to add new documents to a set of searchable documents.<br>

```
retriever.add_documents(docs)  # Adds a document to the retriever.
```

Now you can see that the number of documents is much higher. These are the larger chunks.

```
# Generates a key from the storage, converts it to a list, and returns its length.
len(list(store.yield_keys()))
```

```
7
```

Let's see if the default vector repository still searches for small chunks.

`vectorstore` Object `similarity_search` Perform similarity searches using methods.

```python
# Perform a similarity search.
sub_docs = vectorstore.similarity_search("Word2Vec")
# sub_docs Outputs the page_content attribute of the first element in the list.
print(sub_docs[0].page_content)
```

```
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
```

this time `retriever` Object `invoke()` Search for documents using methods.

```
# Search and retrieve documents.
retrieved_docs = retriever.invoke("Word2Vec")

# Returns the length of the page content of the first document in the searched documents.
print(retrieved_docs[0].page_content)

```

```
 Definition: Transformers are a type of deep-learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism. 
Example: Google translators use transformer models to perform translations between different languages. 
Associated Keywords: deep learning, natural language processing, Attention 

HuggingFace 

Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily. 
Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. 
Associates: Natural language processing, deep learning, library 

Digital Transformation 

Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology. 
Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation. 
Related Keywords: innovation, technology, business model 

Crawling 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
```
