05. Code splitting (Python, Markdown, JAVA, C++, C#, GO, JS, Latex, etc.)

Split code

CodeTextSplitter allows you to split code written in various programming languages.

To do this Language Just import the enum and specify the corresponding programming language.

%pip install -qU langchain-text-splitters

RecursiveCharacterTextSplitter This is an example of splitting text using.

langchain_text_splitters In module Language Wow RecursiveCharacterTextSplitter Import the class.
RecursiveCharacterTextSplitter Is a text divider that recursively divides text into character units.

from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

Get a complete list of supported languages.

# Get a full list of supported languages
[e.value for e in Language]

 ['cpp','go','java','kotlin','js','ts','php','proto','python','rst','ruby','rust', 'scala ','swift','markdown','latex','html','sol','csharp'

RecursiveCharacterTextSplitter Class get_separators_for_language You can use methods to identify the separators used in a particular language.

In example Language.PYTHON Pass the enumeration values to the factor to confirm the delimiter used in the Python language.

# You can check the delimiters used for a given language.
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

 ['\nclass','\ndef','\n\tdef','\n\n','\n','','']

Python

RecursiveCharacterTextSplitter The examples used are:

RecursiveCharacterTextSplitter Split Python code into document units using.
language In parameters Language.PYTHON Specify and use the Python language.
chunk_size Set to 50 to limit the maximum size of each document.
chunk_overlap Setting 0 does not allow duplication between documents.

PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

hello_world()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)

Document Generate. Created Document is returned in list form.

python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

for doc in python_docs:
    print(doc.page_content, end="\n==================\n")

 def hello_world(): 
    print("Hello, World!") 
================== 
hello_world() 
==================

JS

Here is an example using a JS text divider

JS_CODE = """
function helloWorld() {
  console.log("Hello, World!");
}

helloWorld();
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)

js_docs = js_splitter.create_documents([JS_CODE])
js_docs

[Document (page_content='function helloWorld() {\n console.log("Hello, World!");\n}'), Document (page_content='helloWorld();')]

TS

Here is an example using a TS text divider.

TS_CODE = """
function helloWorld(): void {
  console.log("Hello, World!");
}

helloWorld();
"""

ts_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.TS, chunk_size=60, chunk_overlap=0
)
ts_docs = ts_splitter.create_documents([TS_CODE])
ts_docs

[Document (page_content='function helloWorld(): void {'), Document (page_content='console.log("Hello, World!");\n}'), Document (page_content='helloWorld();')]

Markdown

Here is an example using a Markdown text divider.

markdown_text = """
# 🦜️🔗 LangChain

⚡ Build super-fast applications with LLM ⚡

## quick installation

```bash
pip install langchain

It is an open source project in a rapidly developing field. Ministry of Mass 🙏

Split and print the results.


```python
md_splitter = RecursiveCharacterTextSplitter.from_language(
    # Create a text splitter using the Markdown language
    language=Language.MARKDOWN,
    # Set chunk size to 60
    chunk_size=60,
    # Make sure there are no overlapping parts between chunks
    chunk_overlap=0,
)
# Create a document by splitting markdown text
md_docs = md_splitter.create_documents([markdown_text])
# Output the generated document
md_docs

[Document (page_content='#  ⁇ ️🔗 LangChain\n\n⚡ Build a second-speed application using LLM ⚡'), Document (page_content='## Fast installation\n\n```bash\npip install` Ministry of Mass 🙏')]

Latex

LaTeX is a markup language for writing documents, widely used to express mathematical symbols and formulas.

Here is an example of LaTeX text.

latex_text = """
\documentclass{article}

\begin{document}

\maketitle

\section{Introduction}
% LLM is a type of machine learning model that can learn from large amounts of text data and generate human-like language.
% In recent years, LLM has made significant progress in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.

\subsection{History of LLMs}
% Early LLMs were developed in the 1980s and 1990s, but were limited by the amount of data they could process and the computing power available at the time.
% However, over the past decade, advances in hardware and software have made it possible to train LLMs on large datasets, leading to significant improvements in performance.

\subsection{Applications of LLMs}
% The LLM has many applications across industries, including chatbots, content creation, and virtual assistants.
% It can also be used in academia for research in linguistics, psychology, and computational linguistics.

\end{document}
"""

Split and output results.

latex_splitter = RecursiveCharacterTextSplitter.from_language(
    # Split text using Markdown language.
    language=Language.LATEX,
    # Set the size of each chunk to 60 characters.
    chunk_size=60,
    # Set the number of overlapping characters between chunks to 0.
    chunk_overlap=0,
)
# latex_text Generate a list of documents by splitting them.
latex_docs = latex_splitter.create_documents([latex_text])
# Prints a list of generated documents.
latex_docs

[Document (page_content='\documentclass{article}\n\x08egin{document}n\\maketitle'), Document (page_content='\section{ Data can be used for various natural language processing operations, such as emotional analysis.'), Document (page_content='\subsection{History of LLMs}\n% Initial LLM was developed in 1980s and 1990s'), Document (page_content='), Document (page_content=', which led to a great improvement in performance.'), Document (page_content='\subsection{Applications of LLMs}\n% LLM has chatbots, content creation, virtual'), Document (page_content<  \n% can also be used in academia for linguistics, psychology, computer linguistics'), Document (page_content=' research.\n\n\\end{document}')]

HTML

Here is an example using an HTML text divider:

html_text = """
<!DOCTYPE html>
<html>
    <head>
        <title>🦜️🔗 LangChain</title>
        <style>
            body {
                font-family: Arial, sans-serif;  
            }
            h1 {
                color: darkblue;
            }
        </style>
    </head>
    <body>
        <div>
            <h1>🦜️🔗 LangChain</h1>
            <p>⚡ Building applications with LLMs through composability ⚡</p>  
        </div>
        <div>
            As an open-source project in a rapidly developing field, we are extremely open to contributions.
        </div>
    </body>
</html>
"""

Split and output results.

html_splitter = RecursiveCharacterTextSplitter.from_language(
    # HTML Create a text splitter using language
    language=Language.HTML,
    # Set chunk size to 60
    chunk_size=60,
    # Make sure there are no overlapping parts between chunks
    chunk_overlap=0,
)
# Split given HTML text to create a document
html_docs = html_splitter.create_documents([html_text])
# Output the generated document
html_docs

[Document (page_content='\n'), Document (page_content='\n         '), Document (page_content='  \n    
 Solidarity 

 Here is an example using the Solidity text divider: 



 Solidity code in string form  
SOL_CODE
  Save to variable. 

RecursiveCharacterTextSplitter
 Split the Solidarity code in chunks using  
sol_splitter
 Generate. 
 
 
 language
  parameter  
Language.SOL
 Specify the Solidarity language by setting it to. 
 
 
 chunk_size
 Set to 128 to specify the maximum size of each chunk. 

chunk_overlap
 Set to 0 to avoid duplication between chunks. 
 
 
 sol_splitter.create_documents()
  Using methods  
SOL_CODE
 Split in chunks, split chunks  
sol_docs
  Save to variable. 

sol_docs
 Output to confirm the split Solidarity code chunk. 


SOL_CODE = """
pragma solidity ^0.8.20; 
contract HelloWorld {  
   function add(uint a, uint b) pure public returns(uint) {
       return a + b;
   }
}
"""

# Split and print the results
sol_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.SOL, chunk_size=128, chunk_overlap=0
)

sol_docs = sol_splitter.create_documents([SOL_CODE])
sol_docs


[Document (page_content='pragma stability ^0.8.20;'), Document (page_content='contract HelloWorld { \n function add(uint a, uint b) pure public returns (uint) { 

 C 

 Here is an example using a C# text divider: 

C_CODE = """
using System;
class Program
{
    static void Main()
    {
        Console.WriteLine("Enter a number (1-5):");
        int input = Convert.ToInt32(Console.ReadLine());
        for (int i = 1; i <= input; i++)
        {
            if (i % 2 == 0)
            {
                Console.WriteLine($"{i} is even.");
            }
            else
            {
                Console.WriteLine($"{i} is odd.");
            }
        }
        Console.WriteLine("Goodbye!");
    }
}
"""

# Split and print the results.
c_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.CSHARP, chunk_size=128, chunk_overlap=0
)
c_docs = c_splitter.create_documents([C_CODE])
c_docs



[Document (page_content='using System;'), Document (page_content='class Program\n{\n static void Main()\n {\n Console.WriteLine ("Enter a number (1-5):"), Document Document (page_content='if (i% 2 == 0)\n {\n Console.WriteLine ($"{i} is even."), Document (page_content;'), Document (page_content='}\n}')]

Previous04. Semantic chunker Next06. Markdownheader Text Split (MarkdownheaderTextSplitter)

Last updated 1 year ago

hashtagSplit code

hashtagPython

hashtagJS

hashtagTS

hashtagMarkdown

hashtagIt is an open source project in a rapidly developing field. Ministry of Mass 🙏

hashtagLatex

hashtagHTML