01. Character text split (CharacterTextSplitter)

CharacterTextSplitter

This is the simplest way.

Basically "\n\n" Split text in character units based on, and measure the size of chunks by number of characters.

Text splitting method: single character basis
Chunk size measurement method: based on number of characters

%pip install -qU langchain-text-splitters

./data/appendix-keywords.txt Open the file and read the content.
Read file Save to variable.

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the contents of the file Store it in a variable.

Outputs some of the contents of the file read from the file.

# Prints some of the content read from the file.Specifies the delimiter to use when splitting text. The default is "\n\n".
print(file[:500])

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language

Code that divides text into chunks using CharacterTextSplitter.

separator Set the criteria to split into parameters. Default value "\n\n" is.
chunk_size Set the parameter to 250 to limit the maximum size of each chunk to 250 characters.
chunk_overlap Set the parameter to 50, allowing 50 characters to overlap between adjacent chunks.
length_function Specifies a function that calculates the length of a text by setting the parameter to len.
is_separator_regex Set the parameter to False to process the separator as a normal string rather than a regular expression.

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    # Specifies the delimiter to use when splitting text. The default is "\n\n".
    # separator=" ",
    # Specifies the maximum size of a split text chunk.
    chunk_size=250,
    # Specifies the number of overlapping characters between split text chunks.
    chunk_overlap=50,
    # Specifies a function that calculates the length of text.
    length_function=len,
    # Specifies whether the delimiter is a regular expression.
    is_separator_regex=False,
)

text_splitter Using file Split text into document units.
The first document in a split document list ( texts[0] ).

# text_splitter using state_of_the_union Split text into documents.
texts = text_splitter.create_documents([file])
print(texts[0])  # .

 page_content='Semantic Search\n\n Definition: A semantic search is a search method that goes beyond a simple keyword matching and grasps its meaning and returns related results.\n Example: When a user searches for a "solar planet", " Jupiter", Returns information about related planets such as "Mars".\Negator keyword: natural language processing, search algorithm

Here is an example of passing a metadata along with a document.

Notice that the metadata is split with the document.

create_documents The method receives text data and metadata list as factors.

metadatas = [
    {"document": 1},
    {"document": 2},
]  # Defines a list of metadata for a document.
documents = text_splitter.create_documents(
    [
        file,
        file,
    ],  # Pass the text data to be split as a list.
    metadatas=metadatas,  # Passes metadata corresponding to each document.
)
print(documents[0])  # Prints the first document from the split documents.

 page_content='Semantic Search\n\n Definition: A semantic search is a search method that goes beyond a simple keyword matching and grasps its meaning and returns related results.\n Example: When a user searches for "solar planet", "purpose", Returns information about related planets such as "Mars".\Negator keyword: natural language processing, search algorithm

split_text() Split text using methods.

text_splitter.split_text(file)[0] silver file text text_splitter After splitting using, it returns the first element of the split text fragment.

# text_splitter using file Split text into documents.
text_splitter.split_text(file)[0]
print(texts[0])  # Prints the first document from the split documents.

page_content='Semantic Search\n\n Definition: A semantic search is a search method that goes beyond a simple keyword matching and grasps its meaning and returns related results.\n Example: When a user searches for a "solar planet", " Jupiter", Returns information about related planets such as "Mars".\Negator keyword: natural language processing, search algorithm

Here is an example of passing a metadata along with a document.

Notice that the metadata is split with the document.

create_documents The method receives text data and metadata list as factors.

metadatas = [
    {"document": 1},
    {"document": 2},
]  # Defines a list of metadata for a document.
documents = text_splitter.create_documents(
    [
        file,
        file,
    ],  # Pass the text data to be split as a list.
    metadatas=metadatas,  # Passes metadata corresponding to each document.
)
print(documents[0])  # Prints the first document from the split documents.

page_content='Semantic Search\n\n Definition: A semantic search is a search method that goes beyond a simple keyword matching and grasps its meaning and returns related results.\n Example: When a user searches for "solar planet", "purpose", Returns information about related planets such as "Mars".\Negator keyword: natural language processing, search algorithm

split_text() Split text using methods.

text_splitter.split_text(file)[0] silver file text text_splitter After splitting using, it returns the first element of the split text fragment.

# text_splitter Splits the file text using and returns the first element of the split text..
text_splitter.split_text(file)[0]

'Semantic Search\n\n Definition: Semantic Search is a search method that goes beyond simple keyword matching to understand the meaning of a user and returns related results.\n Example: When a user searches for a "solar planet", "speech", Returns information about related planets such as "Mars", etc.\Negator Keyword: Natural language processing, search algorithm, data mining\n\nEmbedd

PreviousCH07 text split (Text Splitter)Next02. Securitic text text split (RecursiveCharacterTextSplitter)

Last updated 1 year ago

hashtagCharacterTextSplitter

CharacterTextSplitter