# CH07 text split (Text Splitter)

Dividing documents is the second stage of the Retrieval-Augmented Generation (RAG) system, loaded documents **Efficiently processed** It is an important process to prepare the system to make better use of the information.

The purpose of this stage is to accept large and complex documents by LLM **Efficient small-scale pieces** is. For later questions entered by the user, only more efficient information is to be compressed/selected.

(Example) **How much did Google invest in Ansropic?**

### The need for division <a href="#id-1" id="id-1"></a>

1. **Pinpoint information retrieval (accuracy)** : By subdividing documents **Information relevant to the question (Query)** It only helps to bring. Each unit focuses on a specific topic or content, **Provide relevant information** To.
2. **Resource optimization (efficiency)** : Entering the entire document in LLM is expensive, and excerpts from many sources of efficient answers will prevent you from answering them. Sometimes these problems **Halusination** This leads to. Therefore, there is also a purpose to excerpt only the information needed to answer.

### Document division process <a href="#id-2" id="id-2"></a>

1. **Identify document structure** : Identify structures in various types of documents, including PDF files, web pages, and e-books. This may include the process of identifying the document's header, footer, page number, section title, and more.
2. **Unit selection** : Decide which unit to divide the document. This can be page-by-page, section-by-section, or paragraph-by-paragraph, depending on the content and purpose of the document.
3. **Unit size selection (chunk size)** : Decide how many token units the document will divide.
4. **Chunk overlap** : It is common to split (overlap) by overlapping some so that the context can continue at the divided end.

**Chunk size & chunk overlap**

### code <a href="#id-3" id="id-3"></a>

```
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 단계 2: 문서 분할(Split Documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
splits = text_splitter.split_documents(docs)
```

### Chunk split visualization <a href="#chunk" id="chunk"></a>

Chunk Visualization site created by Greg Kamradt.

* <https://chunkviz.up.railway.app/>

### Reference <a href="#id-4" id="id-4"></a>

* [Text divider](https://wikidocs.net/233998)
* [LangChain TextSplitters](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/)

<br>