03. Token Text Split (TokenTextSplitter)

TokenTextSplitter

Language models have token restrictions. Therefore, the token limit should not be exceeded.

TokenTextSplitter Is useful when generating chunks based on the number of tokens in the text.

tiktoken

%pip install --upgrade --quiet langchain-text-splitters tiktoken
  • ./data/appendix-keywords.txt Open the file and read the content.

  • Read file Save to variable.

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

Outputs some of the contents of the file read from the file.

Prints some of the content read from a file
print(file[:500])
Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language 

CharacterTextSplitter Split text using

  • from_tiktoken_encoder Use methods to initialize Tiktoken encoder-based text splitters.

Outputs the number of chunks divided.

Reference

  • CharacterTextSplitter.from_tiktoken_encoder When using, the text CharacterTextSplitter Only divided tiktoken The talkizer is used to merge split text. (This is the split text tiktoken This means that it can be larger than the size of the chunk shown by the talkizer.)

  • RecursiveCharacterTextSplitter.from_tiktoken_encoder Using allows the split text to be no larger than the chunk size of the token allowed by the language model, and each split is recursively split if the size is larger. You can also load the tiktoken divider directly, which ensures that each split is less than the chunk size.

TokenTextSplitter

  • TokenTextSplitter Split text into token units using classes.

spaCy

spaCy is an open source software library for advanced natural language processing written in Python and Cython programming languages.

Another alternative to NLTK is to use spaCy tokenizer.

  1. How text is split: spaCy tokenizer Is divided by.

  2. How chunk size is measured: Number of characters Measured as.

Pip command to upgrade spaCy library to latest version.

en_core_web_sm Download the model.

  • appendix-keywords.txt Open the file and read the content.

Check by outputting some content.

  • text_splitter Object split_text Using methods file Split text.

SentenceTransformers

SentenceTransformersTokenTextSplitter has sentence-transformer It is a model-specific text divider.

The default behavior is to split the text into chunks to fit the token window of the sentence transformer model you want to use.

Check sample text.

next file A code that counts the number of tokens in a variable. Output after excluding the number of start and end tokens.

splitter.split_text() Using functions text_to_split Split the text stored in the variable into chunks.

NLTK

Natural Language Toolkit (NLTK) is a collection of libraries and programs for English Natural Language Processing (NLP) written in Python programming language.

Instead of simply splitting it into "\n\n", you can use NLTK to split text based on NLTK tokenizers.

  1. Text splitting method: split by NLTK tokenizer.

  2. chunk size measurement method: measured by number of characters.

  3. nltk Pip instruction to install the library.

  4. NLTK (Natural Language Toolkit) is a python library for natural language processing.

  5. Various NLP tasks can be performed, including pre-processing of text data, tokenization, morpheme analysis, and tagging of goods.

Check sample text

  • NLTKTextSplitter Create a text divider using classes.

  • chunk_size Specifies that the text is split up to 1000 characters by setting the parameter to 1000 characters.

text_splitter Object split_text Using methods file Split text.

KoNLPy

KoNLPy (Korean NLP in Python) is a python package for Korean natural language processing (NLP).

Token splitting involves splitting text into smaller, more manageable units called tokens.

These tokens are often words, phrases, symbols, or other meaningful elements that are important for further processing and analysis.

In languages such as English, token splitting usually involves separating words with spaces and punctuation marks.

The effect of token splitting is highly dependent on the understanding of the talker about the language structure, which ensures the creation of meaningful tokens.

Designed for English, the talkizer cannot be used effectively in Korean processing because it does not have the ability to understand the unique semantic structure of other languages such as Korean.

Korean token split using KoNLPy's Kkma analyzer

For Korean text, KoNLPY Kkma A morpheme analyzer called (English Knowledge Morpheme Analyzer) is included.

Kkma Provides a detailed morpheme analysis of Korean text.

Disassemble sentences into words, words into each morpheme, and identify stakes for each token.

Text blocks can be divided into individual sentences, which is especially useful for long text processing.

Considers when using

Kkma Is famous for detailed analysis, but it should be noted that this precision can affect the processing speed. therefore Kkma Is best suited for applications where analytical depth is prioritized over rapid text processing.

  • Pip instruction to install KoNLPy library.

  • KoNLPy is a Python package for Korean natural language processing, providing features such as morpheme analysis, tagging, and parsing.

Check sample text.

This is an example of splitting Korean text using KonlpyTextSplitter.

text_splitter Using file Split in sentence units.

Hugging Face tokenizer

The Hugging Face offers a variety of talkers.

In this code, we calculate the token length of the text using GPT2TokenizerFast, one of the talkers of the Hugging Face.

The text splitting method is as follows:

  • It is divided into the units of characters passed.

Here's how to measure the chunk size:

  • Based on the number of tokens calculated by the Hugging Face talkizer.

  • GPT2TokenizerFast Using class tokenizer Create an object.

  • from_pretrained Load the pre-learned "gpt2" talkizer model by calling the method.

Check the sample text.

from_huggingface_tokenizer Huggingface Talker via method tokenizer ) To initialize the text divider.

Check the split result of the first element.

Last updated