Document embedding is the process of converting the content of a document into numerical vectors.
This process allows you to quantify the meaning of documents and utilize them for various natural language processing tasks. Representative pre-learned language models include BERT and GPT, and these models capture contextual information to encode the meaning of documents.
Document embedding creates an embedding vector by entering the tokenized document into the model, averaging it to generate a vector for the entire document. This vector can be used for document classification, emotional analysis, and calculation of similarity between documents.
from langchain_openai import OpenAIEmbeddings
# OpenAI "text-embedding-3-large" Generate embeddings using the model.
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
text = "Here are some sample sentences for embedding testing."
# Generate query results by embedding text.
query_result = embeddings.embed_query(text)
# Selects the first 5 items in the query results.
query_result[:5]
# Returns the length of the first element in the document result.
len(doc_result[0])
1536
# OpenAI의 "text-embedding-3-small" Initialize an object that generates a 1024-dimensional embedding using the model.
embeddings_1024 = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=1024)
1024
sentence1 = "hello nice to meet you."
sentence2 = "hello nice to meet you!"
sentence3 = "Hello, nice to meet you.."
sentence4 = "Hi, nice to meet you."
sentence5 = "I like to eat apples."
# sentence1 = "hello nice to meet you."
# sentence2 = "Hello, nice to meet you."
# sentence3 = "Hi, nice to meet you."
# sentence4 = "I like to eat apples."
for i, sentence in enumerate(embedded_sentences):
for j, other_sentence in enumerate(embedded_sentences):
if i < j:
print(
f"[유사도 {similarity(sentence, other_sentence):.4f}] {sentences[i]} \t <=====> \t {sentences[j]}"
)