# 01. Structure of Document

## Document & Document Loaders <a href="#document-document-loaders" id="document-document-loaders"></a>

**Reference**

* [Main loader used in LangChain](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/)
* [List of loaders used in LangChain](https://python.langchain.com/v0.1/docs/integrations/document_loaders/)

### Documents utilized for practice <a href="#id-1" id="id-1"></a>

Software Policy Institute (SPRi)-December 2023

* Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)
* link: <https://spri.kr/posts/view/23669>
* File name: `SPRI_AI_Brief_2023년12월호_F.pdf`

### Document <a href="#document" id="document"></a>

This is the basic document object of LangChain.

**property** - `page_content` : A string representing the content of the document. - `metadata` : A dictionary representing the document's metadata.

```
from langchain_core.documents import Document

document = Document("Hello, this is Langchain's document.")
```

```
# Check the properties of a document
document.__dict__
```

```
{'id': None,'metadata': {},'page_content':'Hello? This is Langchain's Tokemand','type':'Document' } 
```

Add properties to metadata

```
# Add metadata
document.metadata["source"] = "TeddyNote"
document.metadata["page"] = 1
document.metadata["author"] = "Teddy"
```

```
# Check the properties of a document
document.metadata
```

```
 {'source':'TeddyNote','page': 1,'author':'Teddy'} 
```

### Document Loader <a href="#document-loader" id="document-loader"></a>

It serves to convert content from various file formats to Document objects.

#### Main Loader <a href="#loader" id="loader"></a>

* PyPDFLoader: A loader that loads PDF files.
* CSVLoader: A loader that loads CSV files.
* UnstructuredHTMLLoader: A loader that loads HTML files.
* JSONLoader: A loader that loads JSON files.
* TextLoader: A loader that loads text files.
* DirectoryLoader: A loader that loads directories.

```
# Example file path
FILE_PATH = "./data/SPRI_AI_Brief_2023년12월호_F.pdf"
```

```
from langchain_community.document_loaders import PyPDFLoader

# Loader Settings
loader = PyPDFLoader(FILE_PATH)
```

#### load() <a href="#load" id="load"></a>

* Load and return documents.
* Returned results `List[Document]` Form.

```
# PDF loader
docs = loader.load()

# Check the number of loaded documents
len(docs)
```

```
23
```

```
# Check the first document
docs[0]
```

#### load\_and\_split() <a href="#load_and_split" id="load_and_split"></a>

* Split and return documents using splitter.
* Returned results `List[Document]` Form.

```
from langchain_text_splitters import CharacterTextSplitter

# Setting the text divider
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=0)
# Split document
docs = loader.load_and_split(text_splitter=text_splitter)

# Check the number of loaded documents
len(docs)
# Check the first document
docs[0]
```

```
 Document (metadata={'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 0}, page_content='December 2023')
```

#### lazy\_load() <a href="#lazy_load" id="lazy_load"></a>

* Load documents in a generator way.

```
# generator Load document in this way
for doc in loader.lazy_load():
    print(doc.metadata)
```

```
 {'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 0} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 1} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 2} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 3} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 4} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 5} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 6} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 7} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 8} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 9} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 10} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 11} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 12} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 13} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 14} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 15} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 16} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 17} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 18} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 19} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 20} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 21} 
{'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 22} 
```

#### aload() <a href="#aload" id="aload"></a>

* Loading documents in asynchronous (Async)

```
# The document async Load in a manner
adocs = loader.aload()
```

```
# load documentation
await adocs
```

```
 [Document (metadata={'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 0}, page_content=' 12/2023 issue'), Document (metadata<TA Artificial Industry Trend Brief\n 1. United States ▹ United States, safe and reliable AI development and use executive order · · · · · · · · · · · · · · · · · · · · · G7, Hiroshima AI process to AI company target international action force · · Submission of AI comments in terms of consumer protection and competition to the Copyright Office ················ 5\n ▹ EU AI law 3rd party negotiation, based model regulation related views, ovulation,································ Corporate/Industry \n ▹ American Frontier Model Forum, 1,  $0 million AI Safety Fundraising ································ 7\n ▹ Cohir, Data Sources to Ensure Data Transparency Explorer Disclosure ············  
... 
(meditation) 
... 
Conference \non Artificial \nIntelligence\n-AI Development Association Conference (AAAI) promotes AI research, provides opportunities for exchanges between AI fields \n researchers, practitioners, scientists, students and engineers \N-Conference announces AI-related skills, special tracks, Invited speakers, \nworkshop, tutorial, poster session, topicn, exhibition Document (metadata={'source':'./data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 22}, page_content='Homepage: https://spri.kr/\n보고서와 Inquiries related to AI Policy Lab (jayoo@spri. 
```
