01. OpenAIEmbeddings

OpenAIEmbeddings

Document embedding is the process of converting the content of a document into numerical vectors.

This process allows you to quantify the meaning of documents and utilize them for various natural language processing tasks. Representative pre-learned language models include BERT and GPT, and these models capture contextual information to encode the meaning of documents.

Document embedding creates an embedding vector by entering the tokenized document into the model, averaging it to generate a vector for the entire document. This vector can be used for document classification, emotional analysis, and calculation of similarity between documents.

Learn more

Settings

First install langchain-openai and set the required environment variables.

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

True

List of supported models

MODEL

PAGES PER DOLLAR

PERFORMANCE ON MTEB EVAL

MAX INPUT

text-embedding-3-small

62,500

62.3%

8191

text-embedding-3-large

9,615

64.6%

8191

text-embedding-ada-002

12,500

61.0%

8191

from langchain_openai import OpenAIEmbeddings

# OpenAI "text-embedding-3-large" Generate embeddings using the model.
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

text = "Here are some sample sentences for embedding testing."

Query embedding

embeddings.embed_query(text) Is a function that converts a given text into an embedding vector.

This function can be used to map text to a vector space to find semantically similar text or to calculate similarities between texts.

# Generate query results by embedding text.
query_result = embeddings.embed_query(text)

query_result[:5] has query_result Select by slicing the first 5 elements of the list.

# Selects the first 5 items in the query results.
query_result[:5]

 [-0.007747666910290718, 0.0367600381731987, 0.019548965618014336, -0.0197015218436718, 0.0172061396

Document embedding

embeddings.embed_documents() Embed text documents using functions.

[text] Pass a single document to the embedding function in the form of a list by passing it to the factor.
Embedding vector returned as result of function call doc_result Assign to variable.

doc_result = embeddings.embed_documents(
    [text]
)  # Create document vectors by embedding text.

doc_result[0][:5] has doc_result Slice and select the first 5 characters from the first element of the list.

# Selects the first five items from the first element of the document results.
doc_result[0][:5]

 [-0.007747666910290718, 0.0367600381731987, 0.019548965618014336, -0.0197015218436718, 0.0172061396

Dimension assignment

text-embedding-3 Model classes allow you to specify the size of the embedding returned.

For example, basically text-embedding-3-small Returns 1536-dimensional embedding.

# Returns the length of the first element in the document result.
len(doc_result[0])

Dimensions adjustment

But dimensions=1024 By passing, the size of the embedding can be reduced to 1024.

# OpenAI의 "text-embedding-3-small" Initialize an object that generates a 1024-dimensional embedding using the model.
embeddings_1024 = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=1024)

Similarity calculation

sentence1 = "hello nice to meet you."
sentence2 = "hello nice to meet you!"
sentence3 = "Hello, nice to meet you.."
sentence4 = "Hi, nice to meet you."
sentence5 = "I like to eat apples."

from sklearn.metrics.pairwise import cosine_similarity

sentences = [sentence1, sentence2, sentence3, sentence4, sentence5]
embedded_sentences = embeddings_1024.embed_documents(sentences)

def similarity(a, b):
    return cosine_similarity([a], [b])[0][0]

# sentence1 = "hello nice to meet you."
# sentence2 = "Hello, nice to meet you."
# sentence3 = "Hi, nice to meet you."
# sentence4 = "I like to eat apples."

for i, sentence in enumerate(embedded_sentences):
    for j, other_sentence in enumerate(embedded_sentences):
        if i < j:
            print(
                f"[유사도 {similarity(sentence, other_sentence):.4f}] {sentences[i]} \t <=====> \t {sentences[j]}"
            )

PreviousCH08 Embedding Next02. CacheBackedEmbeddings

Last updated 1 year ago

hashtagOpenAIEmbeddings

hashtagSettings

hashtagQuery embedding

hashtagDocument embedding

hashtagDimension assignment

hashtagDimensions adjustment

hashtagSimilarity calculation

OpenAIEmbeddings

Settings

Query embedding

Document embedding

Dimension assignment

Dimensions adjustment

Similarity calculation