01. Character text split (CharacterTextSplitter)
CharacterTextSplitter
This is the simplest way.
Basically "\n\n" Split text in character units based on, and measure the size of chunks by number of characters.
Text splitting method: single character basis
Chunk size measurement method: based on number of characters
%pip install -qU langchain-text-splitters./data/appendix-keywords.txtOpen the file and read the content.Read
fileSave to variable.
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Read the contents of the file Store it in a variable.Outputs some of the contents of the file read from the file.
# Prints some of the content read from the file.Specifies the delimiter to use when splitting text. The default is "\n\n".
print(file[:500])Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Keywords: natural language processing, vectorization, deep learning
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse.
Example: Split the sentence "I go to school" into "I", "to school", and "go".
Associated Keyword: Tokenization, Natural Language Code that divides text into chunks using CharacterTextSplitter.
separatorSet the criteria to split into parameters. Default value"\n\n"is.chunk_sizeSet the parameter to 250 to limit the maximum size of each chunk to 250 characters.chunk_overlapSet the parameter to 50, allowing 50 characters to overlap between adjacent chunks.length_functionSpecifies a function that calculates the length of a text by setting the parameter to len.is_separator_regexSet the parameter to False to process the separator as a normal string rather than a regular expression.
text_splitterUsingfileSplit text into document units.The first document in a split document list (
texts[0]).
Here is an example of passing a metadata along with a document.
Notice that the metadata is split with the document.
create_documentsThe method receives text data and metadata list as factors.
split_text() Split text using methods.
text_splitter.split_text(file)[0]silverfiletexttext_splitterAfter splitting using, it returns the first element of the split text fragment.
Here is an example of passing a metadata along with a document.
Notice that the metadata is split with the document.
create_documentsThe method receives text data and metadata list as factors.
split_text() Split text using methods.
text_splitter.split_text(file)[0]silverfiletexttext_splitterAfter splitting using, it returns the first element of the split text fragment.
Last updated