02. PDF

PDF

Portable Document Format (PDF) The file format, standardized to ISO 32000, was developed by Adobe to present documents in 1992, which includes text formatting and images in a way that is independent of application software, hardware and operating systems.

This guide PDF LangChain document Document Covers how to load in format. This format is used in the downstream.

LangChain is integrated with various PDF parsers. Some are simple and relatively low-level, others support OCR and image processing or perform advanced document layout analysis.

The right choice depends on the user's application.

Reference

PDF experiment on AutoRAG team

leaderboards based on experiments conducted at AutoRAG

The numbers shown below represent the number of equal numbers. (The lower, the better)

PDFMiner

PDFPlumber

PyPDFium2

PyMuPDF

PyPDF2

Medical

One

2

3

4

5

Law

3

One

One

3

5

Finance

One

2

2

4

5

Public

One

One

One

4

5

Sum

5

5

7

15

20

source: AutoRAG Medium Blog

PyPDF

Here pypdf Load PDFs into document arrays using, each document page Includes page content and metadata along with the number.

PyPDF (OCR)

Some PDFs contain text images within scanned documents or pictures. rapidocr-onnxruntime You can also extract text from images using packages.

PyMuPDF

PyMuPDF Is speed optimization and contains detailed metadata for PDF and its pages. Returns one document per page:

Unstructured

Unstructured Supports a common interface to deal with unstructured or hemisputed file formats such as Markdown or PDF.

LangChain UnstructuredPDFLoader LangChain PDF documents integrated with Unstructured Document Parse with objects.

Internally atypical, each text chunk is different. Element Create ". Basically these are combined mode="elements" You can easily separate it by specifying.

See the full set of element types for this particular document

PyPDFium2

PDFMiner

PDFMiner Generate HTML text using

This method is output HTML content BeautifulSoup By parsing through, you can get more structured and rich information about font size, page numbers, PDF headers/puters, etc., which can help you divide text into semantically sections.

PyPDF directory

Load PDF from directory

PDFPlumber

Like PyMuPDF, the output document contains a PDF and a detailed metadata for that page, and returns one document per page.

Last updated