02. Securitic text text split (RecursiveCharacterTextSplitter)

RecursiveCharacterTextSplitter

This text divider is the recommended method for general text.

This divider acts by taking a list of characters as parameters.

The divider attempts to split the text in the order of the given character list until the chunks are small enough.

Default character list ["\n\n", "\n", " ", ""] is.

  • paragraph -> sentence -> word Split recursively in order.

This has the effect of keeping the paragraph (then sentence, word) unit as the most strongly associated piece of text as possible.

  1. How text is split: character list ["\n\n", "\n", " ", ""] ) Is divided by.

  2. How the chunk size is measured: measured by the number of character

%pip install -qU langchain-text-splitters
  • appendix-keywords.txt Open the file and read the content.

  • Read file Save to variable.

# appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

Outputs some of the contents of the file read from the file.

# Prints some of the content read from the file.
print(file[:500])

RecursiveCharacterTextSplitter An example of splitting text into small chunks using.

  • chunk_size Limit the size of each chunk by setting to 250.

  • chunk_overlap Set to 50 to allow nesting of 50 characters between adjacent chunks.

  • length_function to len Calculate the length of the text using functions.

  • is_separator_regex for False Set to separator and do not use regular expressions.

  • text_splitter Using file Split text into document units.

  • Split documents texts Stored in list.

  • print(texts[0]) and print(texts[1]) Outputs the first and second documents of the split document.

text_splitter.split_text() Using functions file Split text.

Last updated