# 13. LlamaParser

## LlamaParser <a href="#llamaparser" id="llamaparser"></a>

LlamaParse is a document parsing service developed by LlamaIndex, specially designed for large language models (LLM). The main features are:

* Support for various document formats including PDF, Word, PowerPoint, Excel, etc.
* Provide custom output format via natural language instruction
* Complex table and image extraction function
* JSON mode support
* Foreign language support

LlamaParse is available as a standalone API and is also available as part of the LlamaCloud platform. The service aims to improve the performance of LLM-based applications such as Search Enhancement Generation (RAG) by parsing and refining documents.

Users can process 1,000 pages per day for free, and additional capacity can be obtained through a paid plan. LlamaParse is currently available in public beta, and its functionality is constantly expanding.

* link: [https://cloud.llamaindex.ai](https://cloud.llamaindex.ai/)

**API key setting** -After issuing API key `.env` To file `LLAMA_CLOUD_API_KEY` Set on.

```
# INSTALLATION
# !pip install llama-index-core llama-parse llama-index-readers-file python-dotenv
```

```
import os
import nest_asyncio
from dotenv import load_dotenv

load_dotenv()
nest_asyncio.apply()
```

Basic parser application

```
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

# Parser settings
parser = LlamaParse(
    result_type="markdown",  # "markdown"과 "text" Available
    num_workers=8,  # worker 수 (Default: 4)
    verbose=True,
    language="ko",
)

# SimpleDirectoryReader Parsing files using
file_extractor = {".pdf": parser}

# LlamaParse Parsing files with
documents = SimpleDirectoryReader(
    input_files=["data/SPRI_AI_Brief_December 2023 issue_F.pdf"],
    file_extractor=file_extractor,
).load_data()
```

```
 Started parsing the file under job_id 6a2aa79c-0d8b-4d59-866d-afa368baa31d
```

```
# Check page count
len(documents)
```

```
23
```

LlamaIndex -> Convert to LangChain Document

```
# Convert to Langchain document
docs = [doc.to_langchain_format() for doc in documents]
```

```
# metadata output of power
docs[0].metadata
```

```
 {'file_path':'data/SPRI_AI_Brief_2023 December issue_F.pdf','file_name':'SPRI_AI_Brief_2023 December issue_F.pdf','file_type':'application/pdf', 'file_size' 
```

### MultiModal Model as Parsing <a href="#multimodal-model" id="multimodal-model"></a>

**Main parameters**

* `use_vendor_multimodal_model` : Specifies whether to use a multi-modal model. `True` When set to, it uses a multi-modal model of the external vendor.
* `vendor_multimodal_model_name` : Specifies the name of the multi-modal model to use. I am using "openai-gpt4o" here.
* `vendor_multimodal_api_key` : Specifies the multi-modal model API key. Get the OpenAI API key from the environment variable.
* `result_type` : Specifies the format of the parsing result. Set to "markdown", the result is returned in the markdown format.
* `language` : Specifies the language of the document to be parsed. It is set to "en" and processed in Korean.
* `skip_diagonal_text` : Decide whether to skip diagonal text.
* `page_separator` : You can specify the page delimiter.

```
documents = LlamaParse(
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o",
    vendor_multimodal_api_key=os.environ["OPENAI_API_KEY"],
    result_type="markdown",
    language="ko",
    # skip_diagonal_text=True,
    # page_separator="\n=================\n"
)
```

```
# Parsed results
parsed_docs = documents.load_data(file_path="data/SPRI_AI_Brief_December 2023 issue
_F.pdf")
```

```
 Started parsing the file under job_id cf2876e9-02c2-4277-ae92-03ae21d4a3bd
```

```
# langchain Convert to document
docs = [doc.to_langchain_format() for doc in parsed_docs]
```

It is also possible to specify a custom instrument as shown below.

```
# parsing instruction Specifies.
parsing_instruction = (
    "You are parsing a brief of AI Report. Please extract tables in markdown format."
)

# LlamaParse setting
parser = LlamaParse(
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o",
    vendor_multimodal_api_key=os.environ["OPENAI_API_KEY"],
    result_type="markdown",
    language="ko",
    parsing_instruction=parsing_instruction,
)

# parsing The result was
parsed_docs = parser.load_data(file_path="data/SPRI_AI_Brief_2023년12월호_F.pdf")

# langchain Convert to document
docs = [doc.to_langchain_format() for doc in parsed_docs]
```

```
Started parsing the file under job_id afdbf3ba-61f6-4c14-8d41-9e986950b612 .
```

```
# markdown Check the table extracted in format
print(docs[-2].page_content)
```

```
# Ⅱ. Main event schedule 

| Event name | Main overview of the event | 
| --- | --- | 
| CES 2024 | -The world's largest consumer electronics,IT, and consumer goods exhibition hosted by the American Society of Consumer Electronics (CTA), with companies exhibiting the latest technology products around major categories including 5G, AR&VR, digital health, transportation and mobility. 
 -Chairman CTA Sapiro has AI as the most notable sector, and in the sense of including all industries, this exhibition on the theme of'All InAI on' will host more than 500 Korean companies. 

 ![CES 2024] (https://www.ces.tech/) | 
| Period | 2024.1.9~12 | 
| Place | USA, Las Vegas | 
| Homepage | [https://www.ces.tech/](https://www.ces.tech/) | 

| Event name | Main overview of the event | 
| --- | --- | 
| AIMLA 2024 | -International conference on machine learning and application (AIMLA 2024) shares knowledge and latest research results on the theory, methodology and practical approach of artificial intelligence and machine learning 
 -In terms of theory and practice, we discuss the main areas of artificial intelligence and mechanical learning, and together, share the cutting-edge development news in the field with researchers and practitioners in industry. 

 ![AIMLA 2024] (https://ccnet2024.org/aimla/index) | 
| Period | 2024.1.27~28 | 
| Place | Denmark, Copenhagen | 
| Homepage | [https://ccnet2024.org/aimla/index](https://ccnet2024.org/aimla/index) | 

| Event name | Main overview of the event | 
| --- | --- | 
| AAAI Conference on Artificial Intelligence | - AI Development Association Conference (AAAI) promotes AI research and provides opportunities for exchange between AI researchers, practitioners, scientists, academics and engineers 
 At conferences, AI-related technical presentations, special tracks, guest speakers, workshops, tutorials, poster sessions, topic presentations, competitions, exhibition programs, etc. 

 ![AAAI Conference on Artificial Intelligence] (https://aaai.org/aaai-conference/) | 
| Period | 2024.2.20~27 | 
| Place | Canada, Vancouver | 
| Homepage | [https://aaai.org/aaai-conference/](https://aaai.org/aaai-conference/) | 

```