./data/appendix-keywords.txt Open the file and read the content.
Read file Save to variable.
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.
Outputs some of the contents of the file read from the file.
Prints some of the content read from a file
print(file[:500])
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Keywords: natural language processing, vectorization, deep learning
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse.
Example: Split the sentence "I go to school" into "I", "to school", and "go".
Associated Keyword: Tokenization, Natural Language
CharacterTextSplitter Split text using
from_tiktoken_encoder Use methods to initialize Tiktoken encoder-based text splitters.
Outputs the number of chunks divided.
Reference
CharacterTextSplitter.from_tiktoken_encoder When using, the text CharacterTextSplitter Only divided tiktoken The talkizer is used to merge split text. (This is the split text tiktoken This means that it can be larger than the size of the chunk shown by the talkizer.)
RecursiveCharacterTextSplitter.from_tiktoken_encoder Using allows the split text to be no larger than the chunk size of the token allowed by the language model, and each split is recursively split if the size is larger. You can also load the tiktoken divider directly, which ensures that each split is less than the chunk size.
TokenTextSplitter
TokenTextSplitter Split text into token units using classes.
spaCy
spaCy is an open source software library for advanced natural language processing written in Python and Cython programming languages.
Another alternative to NLTK is to use spaCy tokenizer.
How text is split: spaCy tokenizer Is divided by.
How chunk size is measured: Number of characters Measured as.
Pip command to upgrade spaCy library to latest version.
en_core_web_sm Download the model.
appendix-keywords.txt Open the file and read the content.
Check by outputting some content.
text_splitter Object split_text Using methods file Split text.
SentenceTransformers
SentenceTransformersTokenTextSplitter has sentence-transformer It is a model-specific text divider.
The default behavior is to split the text into chunks to fit the token window of the sentence transformer model you want to use.
Check sample text.
next file A code that counts the number of tokens in a variable. Output after excluding the number of start and end tokens.
splitter.split_text() Using functions text_to_split Split the text stored in the variable into chunks.
NLTK
Natural Language Toolkit (NLTK) is a collection of libraries and programs for English Natural Language Processing (NLP) written in Python programming language.
Instead of simply splitting it into "\n\n", you can use NLTK to split text based on NLTK tokenizers.
Text splitting method: split by NLTK tokenizer.
chunk size measurement method: measured by number of characters.
nltk Pip instruction to install the library.
NLTK (Natural Language Toolkit) is a python library for natural language processing.
Various NLP tasks can be performed, including pre-processing of text data, tokenization, morpheme analysis, and tagging of goods.
Check sample text
NLTKTextSplitter Create a text divider using classes.
chunk_size Specifies that the text is split up to 1000 characters by setting the parameter to 1000 characters.
text_splitter Object split_text Using methods file Split text.
KoNLPy
KoNLPy (Korean NLP in Python) is a python package for Korean natural language processing (NLP).
Token splitting involves splitting text into smaller, more manageable units called tokens.
These tokens are often words, phrases, symbols, or other meaningful elements that are important for further processing and analysis.
In languages such as English, token splitting usually involves separating words with spaces and punctuation marks.
The effect of token splitting is highly dependent on the understanding of the talker about the language structure, which ensures the creation of meaningful tokens.
Designed for English, the talkizer cannot be used effectively in Korean processing because it does not have the ability to understand the unique semantic structure of other languages such as Korean.
Korean token split using KoNLPy's Kkma analyzer
For Korean text, KoNLPY Kkma A morpheme analyzer called (English Knowledge Morpheme Analyzer) is included.
Kkma Provides a detailed morpheme analysis of Korean text.
Disassemble sentences into words, words into each morpheme, and identify stakes for each token.
Text blocks can be divided into individual sentences, which is especially useful for long text processing.
Considers when using
Kkma Is famous for detailed analysis, but it should be noted that this precision can affect the processing speed. therefore Kkma Is best suited for applications where analytical depth is prioritized over rapid text processing.
Pip instruction to install KoNLPy library.
KoNLPy is a Python package for Korean natural language processing, providing features such as morpheme analysis, tagging, and parsing.
Check sample text.
This is an example of splitting Korean text using KonlpyTextSplitter.
text_splitter Using file Split in sentence units.
Hugging Face tokenizer
The Hugging Face offers a variety of talkers.
In this code, we calculate the token length of the text using GPT2TokenizerFast, one of the talkers of the Hugging Face.
The text splitting method is as follows:
It is divided into the units of characters passed.
Here's how to measure the chunk size:
Based on the number of tokens calculated by the Hugging Face talkizer.
GPT2TokenizerFast Using class tokenizer Create an object.
from_pretrained Load the pre-learned "gpt2" talkizer model by calling the method.
Check the sample text.
from_huggingface_tokenizer Huggingface Talker via method tokenizer ) To initialize the text divider.
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
# Set the chunk size to 300.
chunk_size=300,
# Set to ensure that there are no overlapping parts between chunks.
chunk_overlap=0,
)
# file Splits text into chunks.
texts = text_splitter.split_text(file)
print(len(texts)) Prints the number of split chunks.
51
outputs the first element of the text list.
# texts Prints the first element of the list.
print(texts[0])
Semantic Search
from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter(
chunk_size=200, # Set the chunk size to 10.
chunk_overlap=0, # Sets the overlap between chunks to 0.
)
# state_of_the_union Split text into chunks.
texts = text_splitter.split_text(file)
print(texts[0]) # Outputs the first chunk of the split text.
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: User "Solar Planet" �
%pip install --upgrade --quiet spacy
Note: you may need to restart the kernel to use updated packages.
!python -m spacy download en_core_web_sm --quiet
✔ Download and installation successful
You can now load the package via spacy.load ('en_core_web_sm')
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Read the contents of a file and store them in the file variable.
# Prints some of the content read from the file.
print(file[:350])
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Press: Natural Language
SpacyTextSplitter Create a text divider using classes.
import warnings
from langchain_text_splitters import SpacyTextSplitter
# Ignore the warning message.
warnings.filterwarnings("ignore")
# SpacyTextSplitter creates.
text_splitter = SpacyTextSplitter(
chunk_size=200, # Set the chunk size to 200.
chunk_overlap=50, # Set the overlap between chunks to 50.
)
# text_splitter Splits the file text using.
texts = text_splitter.split_text(file)
print(texts[0]) # Outputs the first element of the split text.
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
# Create a sentence splitter and set the overlap between chunks to 0.
splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=0)
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.
# Prints some of the content read from the file.
print(file[:350])
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Press: Natural Language
count_start_and_stop_tokens = 2 # Set the number of start and end tokens to 2
# Subtract the number of start and end tokens from the number of tokens in the text.
text_token_count = splitter.count_tokens(
text=file) - count_start_and_stop_tokens
print(text_token_count) # Prints the number of computed text tokens.
7686
text_chunks = splitter.split_text(text=file) # Split text into chunks.
. This allows the computer to understand and process the text [UNK]. [UNK]: The word "apology" [0. 65, - 0. 23, 0. 17] and [UNK] vector. Associative keyword: natural language processing, vectorization, dipping token definition: token means [UNK] splitting the text into smaller [UNK]. These are usually words, sentences, and [UNK] verses [UNK]. [UNK]: Sentence "I go to school" "I split into ", "to school", "go". Related keywords: tokenization, natural language processing, parsing tokenizer definition: torque
%pip install -qU nltk
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.
# Prints some of the content read from the file.
print(file[:350])
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Press: Natural Language
from langchain_text_splitters import NLTKTextSplitter
text_splitter = NLTKTextSplitter(
chunk_size=200, # Set the chunk size to 200.
chunk_overlap=0, # Sets the overlap between chunks to 0.
)
# text_splitter Splits the file text using.
texts = text_splitter.split_text(file)
print(texts[0]) # Outputs the first element of the split text.
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
%pip install -qU konlpy
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.
# Prints some of the content read from the file.
print(file[:350])
from langchain_text_splitters import KonlpyTextSplitter
# KonlpyTextSplitter Create a text splitter object using .
text_splitter = KonlpyTextSplitter()
texts = text_splitter.split_text(file) # Splits a Korean document into sentences..
print(texts[0]) # Prints the first sentence of the split sentences.
from transformers import GPT2TokenizerFast
# GPT-2 Load the tokenizer for the model.
hf_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.
# Prints some of the content read from the file.
print(file[:350])
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Press: Natural Language
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
# Create a CharacterTextSplitter object using the huggingface tokenizer.
hf_tokenizer,
chunk_size=300,
chunk_overlap=50,
)
# state_of_the_union Split the text and store it in the texts variable.
texts = text_splitter.split_text(file)
print(texts[1]) # texts Prints the first element of the list.
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining