# 09. Text (TextLoader)

## TXT Loader <a href="#txt-loader" id="txt-loader"></a>

`.txt` Let's look at how to load files with extensions into loaders.

```
from langchain_community.document_loaders import TextLoader

# Create a text loader
loader = TextLoader("data/appendix-keywords.txt")

# load document
docs = loader.load()
print(f"Number of documents: {len(docs)}\n")
print("[Metadata]\n")
print(docs[0].metadata)
print("\n========= [Preview] =========\n")
print(docs[0].page_content[:500])
```

```
Number of documents: 1 

[Metadata] 

{'source':'data/appendix-keywords.txt'} 

========= [Front] Preview ========= 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language 
```

### Automatic detection of file encoding via TextLoader <a href="#textloader" id="textloader"></a>

In this example, we'll look at some of the strategies that are useful when using the TextLoader class to load random file lists in bulk in a directory.

First, let's load multiple texts with random encodings to illustrate the problem.

* `silent_errors` : You can pass the silent\_errors parameter to the directoryer to cross the file that cannot be loaded and continue the load process.
* `autodetect_encoding` : You can also pass the auto-sensing\_ encoding to the loader class to request that the file encoding be detected automatically before it fails.

```
from langchain_community.document_loaders import DirectoryLoader

path = "data/"

text_loader_kwargs = {"autodetect_encoding": True}

loader = DirectoryLoader(
    path,
    glob="**/*.txt",
    loader_cls=TextLoader,
    silent_errors=True,
    loader_kwargs=text_loader_kwargs,
)
docs = loader.load()
```

`data/appendix-keywords.txt` Derivatives with similar file names and file names are all files with different encoding methods.

```
doc_sources = [doc.metadata["source"] for doc in docs]
doc_sources
```

```
 ['data/appendix-keywords-CP949.txt','data/reference.txt','data/appendix-keywords-EUCKR.txt','data/chain-of-density.txt','data/appendix -keywords.txt','data/appendix 
```

```
print("[Metadata]\n")
print(docs[2].metadata)
print("\n========= [Preview] =========\n")
print(docs[2].page_content[:500])
```

```
[Metadata] 

{'source':'data/appendix-keywords-EUCKR.txt'} 

========= [Front] Preview ========= 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language 
```

```
print("[Metadata]\n")
print(docs[3].metadata)
print("\n========= [Preview] =========\n")
print(docs[3].page_content[:500])
```

```
 [Metadata] 

{'source':'data/chain-of-density.txt'} 

========= [Front] Preview ========= 

Selecting the “right” amount of information to include in a summary is a difficult task.  
A good summary should be detailed and entity-centric without being overly sense and hard to follow. To better understand this tradeoff, we solicit increasingly sense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) prompt. Specificly, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries genera 
```

```
print("[Metadata]\n")
print(docs[4].metadata)
print("\n========= [Preview] =========\n")
print(docs[4].page_content[:500])
```

```
[Metadata] 

{'source':'data/appendix-keywords.txt'} 

========= [Front] Preview ========= 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language 
```
