02. PDF
PDF
Portable Document Format (PDF) The file format, standardized to ISO 32000, was developed by Adobe to present documents in 1992, which includes text formatting and images in a way that is independent of application software, hardware and operating systems.
This guide PDF LangChain document Document Covers how to load in format. This format is used in the downstream.
LangChain is integrated with various PDF parsers. Some are simple and relatively low-level, others support OCR and image processing or perform advanced document layout analysis.
The right choice depends on the user's application.
Reference
PDF experiment on AutoRAG team
leaderboards based on experiments conducted at AutoRAG
The numbers shown below represent the number of equal numbers. (The lower, the better)
PDFMiner
PDFPlumber
PyPDFium2
PyMuPDF
PyPDF2
Medical
One
2
3
4
5
Law
3
One
One
3
5
Finance
One
2
2
4
5
Public
One
One
One
4
5
Sum
5
5
7
15
20
source: AutoRAG Medium Blog
PyPDF
Here pypdf Load PDFs into document arrays using, each document page Includes page content and metadata along with the number.
PyPDF (OCR)
Some PDFs contain text images within scanned documents or pictures. rapidocr-onnxruntime You can also extract text from images using packages.
PyMuPDF
PyMuPDF Is speed optimization and contains detailed metadata for PDF and its pages. Returns one document per page:
Unstructured
Unstructured Supports a common interface to deal with unstructured or hemisputed file formats such as Markdown or PDF.
LangChain UnstructuredPDFLoader LangChain PDF documents integrated with Unstructured Document Parse with objects.
Internally atypical, each text chunk is different. Element Create ". Basically these are combined mode="elements" You can easily separate it by specifying.
See the full set of element types for this particular document
PyPDFium2
PDFMiner
PDFMiner Generate HTML text using
This method is output HTML content BeautifulSoup By parsing through, you can get more structured and rich information about font size, page numbers, PDF headers/puters, etc., which can help you divide text into semantically sections.
PyPDF directory
Load PDF from directory
PDFPlumber
Like PyMuPDF, the output document contains a PDF and a detailed metadata for that page, and returns one document per page.
Last updated