06. Word

Microsoft Word

Microsoft Word is a word processor developed by Microsoft.

This covers how to load a Word document into a document format that can be used downstream.

Docx2txtLoader

You can use docx2txt to import .docx files into documents.

# installation
# !pip install -qU docx2txt
from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("./data/sample-word-document.docx")  # Initialize document loader

docs = loader.load()  # loading documents

print(len(docs))
1

UnstructuredWordDocumentLoader

from langchain_community.document_loaders import UnstructuredWordDocumentLoader

# Output of the uploaded document
loader = UnstructuredWordDocumentLoader("./data/sample-word-document.docx")

# data loading
docs = loader.load()

print(len(docs))

The result is loaded as a single Document.

Internally, amorphism creates different “elements” for each chunk of text.

By default these are combined together, but can be easily separated by specifying mode="elements" .

Last updated