02. Securitic text text split (RecursiveCharacterTextSplitter)
RecursiveCharacterTextSplitter
This text divider is the recommended method for general text.
This divider acts by taking a list of characters as parameters.
The divider attempts to split the text in the order of the given character list until the chunks are small enough.
Default character list ["\n\n", "\n", " ", ""] is.
paragraph -> sentence -> word Split recursively in order.
This has the effect of keeping the paragraph (then sentence, word) unit as the most strongly associated piece of text as possible.
How text is split: character list
["\n\n", "\n", " ", ""]) Is divided by.How the chunk size is measured: measured by the number of character
%pip install -qU langchain-text-splittersappendix-keywords.txtOpen the file and read the content.Read
fileSave to variable.
# appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.Outputs some of the contents of the file read from the file.
# Prints some of the content read from the file.
print(file[:500])RecursiveCharacterTextSplitter An example of splitting text into small chunks using.
chunk_sizeLimit the size of each chunk by setting to 250.chunk_overlapSet to 50 to allow nesting of 50 characters between adjacent chunks.length_functiontolenCalculate the length of the text using functions.is_separator_regexforFalseSet to separator and do not use regular expressions.
text_splitterUsingfileSplit text into document units.Split documents
textsStored in list.print(texts[0])andprint(texts[1])Outputs the first and second documents of the split document.
text_splitter.split_text() Using functions file Split text.
Last updated