04. Semantic chunker
SemanticChunker
Split text based on semantic similarity.
Reference
This method goes through the process of dividing the text into sentence units, grouping three sentences, and merging similar sentences in the embedding space.
Install dependency package
%pip install -qU langchain_experimental langchain_openaiLoad sample text and output content.
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.
# Prints some of the content read from the file.
print(file[:350])Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Press: Natural Language SemanticChunker generation
SemanticChunker Is one of LangChain's experimental features, which serves to divide text into semantically similar chunks.
This allows you to process and analyze text data more effectively.
SemanticChunker Use to split text into semantically relevant chunks.
Text split
text_splitterUsingfileSplit text into document units.
Check the split chunk.
create_documents() You can use functions to convert chunks into documents.
Breakpoints
This chunker works by deciding when to "separate" the sentence. This is done by looking at the differences in embedding between the two sentences.
If the difference exceeds a certain threshold, the sentence is separated.
Reference video: https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580
Percentile
The basic separation method is percentile Percentile Based on ).
In this method, all differences between sentences are calculated, and then separated based on the specified percentile.
Check the split result.
docs Output the length of.
Standard Deviation
Specified in this method breakpoint_threshold_amount If there is a difference greater than the standard deviation, it is split.
breakpoint_threshold_typeSet the parameter to "standard_deviation" to specify the chunk splitting criterion as standard deviation basis.
Check the split results.
docs Output the length of.
Interquartile
In this method, chunks are split using the quadrant range (interquartile range).
breakpoint_threshold_typeSet the parameter to "interquartile" to specify the chunk splitting criterion as the quadrant range.
docs Output the length of.
Last updated