04. Semantic chunker

SemanticChunker

Split text based on semantic similarity.

Reference

Greg Kamradt's laptop

This method goes through the process of dividing the text into sentence units, grouping three sentences, and merging similar sentences in the embedding space.

Install dependency package

%pip install -qU langchain_experimental langchain_openai

Load sample text and output content.

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

# Prints some of the content read from the file.
print(file[:350])

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Press: Natural Language

SemanticChunker generation

SemanticChunker Is one of LangChain's experimental features, which serves to divide text into semantically similar chunks.

This allows you to process and analyze text data more effectively.

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv
w
# API Load key information
load_dotenv()

True

SemanticChunker Use to split text into semantically relevant chunks.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# OpenAI Initialize a semantic chunk splitter using embeddings.
text_splitter = SemanticChunker(OpenAIEmbeddings())

Text split

text_splitter Using file Split text into document units.

chunks = text_splitter.split_text(file)

Check the split chunk.

# Outputs the first chunk of the split chunks.
print(chunks[0])

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keywords: tokenization, natural language processing, parsing 

Tokenizer 

Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. Associated Keywords: tokenization, natural language processing, parsing 

VectorStore 

Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks. Example: Save the word embedding vectors to the database for quick access. Associated Keywords: embedding, database, vectorization 

SQL 

Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc. Example: SELECT * FROM users WHERE age > 18; views user information over 18 years old. Associated Keywords: database, query, data management 

CSV 

Definition: CSV (Comma-Separated Values) is a file format that stores data, and each data value is separated by commas. Used to simply store and exchange tabular data. Example: CSV files with headers named name, age, job may contain data such as Hong Gil-dong, 30, developer. Associates: data format, file processing, data exchange 

JSON 

Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses readable text for both people and machines to represent data objects. Example: {" Name": "Flood Road", "Age": 30, "Occupation": "Developer" } is data in JSON format. Associates: Data Exchange, Web Development, API 

Transformer 

Definition: Transformers are a type of deep-learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism. Example: Google translators use transformer models to perform translations between different languages. Associated Keywords: deep learning, natural language processing, Attention 

HuggingFace 

Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily. Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. Associates: Natural language processing, deep learning, library 

Digital Transformation 

Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology. Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation. Related Keywords: innovation, technology, business model 

Crawling 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. Associates: data collection, web scraping, search engine 

Word2Vec 

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words.

create_documents() You can use functions to convert chunks into documents.

# text_splitter Split using .
docs = text_splitter.create_documents([file])
print(docs[0].page_content)  # Prints the contents of the first document among the split documents.

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keywords: tokenization, natural language processing, parsing 

Tokenizer 

Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. Associated Keywords: tokenization, natural language processing, parsing 

VectorStore 

Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks. Example: Save the word embedding vectors to the database for quick access. Associated Keywords: embedding, database, vectorization 

SQL 

Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc. Example: SELECT * FROM users WHERE age > 18; views user information over 18 years old. Associated Keywords: database, query, data management 

CSV 

Definition: CSV (Comma-Separated Values) is a file format that stores data, and each data value is separated by commas. Used to simply store and exchange tabular data. Example: CSV files with headers named name, age, job may contain data such as Hong Gil-dong, 30, developer. Associates: data format, file processing, data exchange 

JSON 

Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses readable text for both people and machines to represent data objects. Example: {" Name": "Flood Road", "Age": 30, "Occupation": "Developer" } is data in JSON format. Associates: Data Exchange, Web Development, API 

Transformer 

Definition: Transformers are a type of deep-learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism. Example: Google translators use transformer models to perform translations between different languages. Associated Keywords: deep learning, natural language processing, Attention 

HuggingFace 

Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily. Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. Associates: Natural language processing, deep learning, library 

Digital Transformation 

Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology. Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation. Related Keywords: innovation, technology, business model 

Crawling 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. Associates: data collection, web scraping, search engine 

Word2Vec 

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words.

Breakpoints

This chunker works by deciding when to "separate" the sentence. This is done by looking at the differences in embedding between the two sentences.

If the difference exceeds a certain threshold, the sentence is separated.

Reference video: https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580

Percentile

The basic separation method is percentile Percentile Based on ).

In this method, all differences between sentences are calculated, and then separated based on the specified percentile.

text_splitter = SemanticChunker(
    # OpenAI Initialize the semantic chunker using the embedding model.
    OpenAIEmbeddings(),
    # Set the split criteria type to percentile.
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=70,
)

Check the split result.

docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(doc.page_content)  # Prints the contents of the first document among the split documents.
    print("===" * 20)

[Chunk 0] 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
============================================================ 
[Chunk 1] 

Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
============================================================ 
[Chunk 2] 

Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keywords: tokenization, natural language processing, parsing 

Tokenizer 

Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. 
============================================================ 
[Chunk 3] 

Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. Associated Keywords: tokenization, natural language processing, parsing 

VectorStore 

Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks. 
============================================================ 
[Chunk 4] 

Example: Save the word embedding vectors to the database for quick access. Associated Keywords: embedding, database, vectorization 

SQL 

Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc. 
============================================================

docs Output the length of.

print(len(docs))  # docs Outputs the length of .

Standard Deviation

Specified in this method breakpoint_threshold_amount If there is a difference greater than the standard deviation, it is split.

breakpoint_threshold_type Set the parameter to "standard_deviation" to specify the chunk splitting criterion as standard deviation basis.

text_splitter = SemanticChunker(
    # OpenAI Initialize the semantic chunker using the embedding model.
    OpenAIEmbeddings(),
    # We use standard deviation as the splitting criterion.
    breakpoint_threshold_type="standard_deviation",
    breakpoint_threshold_amount=1.25,
)

Check the split results.

# text_splitter Split using .
docs = text_splitter.create_documents([file])

docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(doc.page_content)  # Prints the contents of the first document among the split documents.
    print("===" * 20)

[Chunk 0] 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keywords: tokenization, natural language processing, parsing 

Tokenizer 

Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. Associated Keywords: tokenization, natural language processing, parsing 

VectorStore 

Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks. 
============================================================ 
[Chunk 1] 

Example: Save the word embedding vectors to the database for quick access. Associated Keywords: embedding, database, vectorization 

SQL 

Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc. 
============================================================ 
[Chunk 2] 

Example: SELECT * FROM users WHERE age > 18; views user information over 18 years old. Associated Keywords: database, query, data management 

CSV 

Definition: CSV (Comma-Separated Values) is a file format that stores data, and each data value is separated by commas. Used to simply store and exchange tabular data. Example: CSV files with headers named name, age, job may contain data such as Hong Gil-dong, 30, developer. Associates: data format, file processing, data exchange 

JSON 

Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses readable text for both people and machines to represent data objects. Example: {" Name": "Flood Road", "Age": 30, "Occupation": "Developer" } is data in JSON format. Associates: Data Exchange, Web Development, API 

Transformer 

Definition: Transformers are a type of deep-learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism. 
============================================================ 
[Chunk 3] 

Example: Google translators use transformer models to perform translations between different languages. Associated Keywords: deep learning, natural language processing, Attention 

HuggingFace 

Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily. 
============================================================ 
[Chunk 4] 

Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. Associates: Natural language processing, deep learning, library 

Digital Transformation 

Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology. 
============================================================

docs Output the length of.

print(len(docs))  # docs Outputs the length of .

Interquartile

In this method, chunks are split using the quadrant range (interquartile range).

breakpoint_threshold_type Set the parameter to "interquartile" to specify the chunk splitting criterion as the quadrant range.

text_splitter = SemanticChunker(
    # OpenAIWe initialize a semantic chunk splitter using the embedding model of .
    OpenAIEmbeddings(),
    # Set the split criteria threshold type to interquartile range.
    breakpoint_threshold_type="interquartile",
    breakpoint_threshold_amount=0.5,
)

# text_splitter Split using .
docs = text_splitter.create_documents([file])

# Prints the results.
for i, doc in enumerate(docs[:5]):
    print(f"[Chunk {i}]", end="\n\n")
    print(doc.page_content)  # Prints the contents of the first document among the split documents.
    print("===" * 20)

[Chunk 0] 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
============================================================ 
[Chunk 1] 

Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keywords: tokenization, natural language processing, parsing 

Tokenizer 

Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. 
============================================================ 
[Chunk 2] 

Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. Associated Keywords: tokenization, natural language processing, parsing 

VectorStore 

Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks. 
============================================================ 
[Chunk 3] 

Example: Save the word embedding vectors to the database for quick access. Associated Keywords: embedding, database, vectorization 

SQL 

Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc. 
============================================================ 
[Chunk 4] 

Example: SELECT * FROM users WHERE age > 18; views user information over 18 years old. Associated Keywords: database, query, data management 

CSV 

Definition: CSV (Comma-Separated Values) is a file format that stores data, and each data value is separated by commas. Used to simply store and exchange tabular data. 
============================================================

docs Output the length of.

print(len(docs))  # docs Outputs the length of .

Previous03. Token Text Split (TokenTextSplitter)Next05. Code splitting (Python, Markdown, JAVA, C++, C#, GO, JS, Latex, etc.)

Last updated 1 year ago

hashtagSemanticChunker

hashtagSemanticChunker generation

hashtagText split

hashtagBreakpoints

hashtagPercentile

hashtagStandard Deviation

hashtagInterquartile

SemanticChunker

SemanticChunker generation

Text split

Breakpoints

Percentile

Standard Deviation

Interquartile