# 01. OpenAIEmbeddings

## OpenAIEmbeddings <a href="#openaiembeddings" id="openaiembeddings"></a>

Document embedding is the process of converting the content of a document into numerical vectors.

This process allows you to quantify the meaning of documents and utilize them for various natural language processing tasks. Representative pre-learned language models include BERT and GPT, and these models capture contextual information to encode the meaning of documents.

Document embedding creates an embedding vector by entering the tokenized document into the model, averaging it to generate a vector for the entire document. This vector can be used for document classification, emotional analysis, and calculation of similarity between documents.

[Learn more](https://platform.openai.com/docs/guides/embeddings/embedding-models)

### Settings <a href="#id-1" id="id-1"></a>

First install langchain-openai and set the required environment variables.

```
# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()
```

```
True
```

List of supported models

| MODEL                  | PAGES PER DOLLAR | PERFORMANCE ON MTEB EVAL | MAX INPUT |
| ---------------------- | ---------------- | ------------------------ | --------- |
| text-embedding-3-small | 62,500           | 62.3%                    | 8191      |
| text-embedding-3-large | 9,615            | 64.6%                    | 8191      |
| text-embedding-ada-002 | 12,500           | 61.0%                    | 8191      |

```
from langchain_openai import OpenAIEmbeddings

# OpenAI "text-embedding-3-large" Generate embeddings using the model.
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
```

```
text = "Here are some sample sentences for embedding testing."
```

### Query embedding <a href="#id-2" id="id-2"></a>

`embeddings.embed_query(text)` Is a function that converts a given text into an embedding vector.

This function can be used to map text to a vector space to find semantically similar text or to calculate similarities between texts.

```
# Generate query results by embedding text.
query_result = embeddings.embed_query(text)
```

`query_result[:5]` has `query_result` Select by slicing the first 5 elements of the list.

```
# Selects the first 5 items in the query results.
query_result[:5]
```

```
 [-0.007747666910290718, 0.0367600381731987, 0.019548965618014336, -0.0197015218436718, 0.0172061396 
```

### Document embedding <a href="#document" id="document"></a>

`embeddings.embed_documents()` Embed text documents using functions.

* `[text]` Pass a single document to the embedding function in the form of a list by passing it to the factor.
* Embedding vector returned as result of function call `doc_result` Assign to variable.

```
doc_result = embeddings.embed_documents(
    [text]
)  # Create document vectors by embedding text.
```

`doc_result[0][:5]` has `doc_result` Slice and select the first 5 characters from the first element of the list.

```
# Selects the first five items from the first element of the document results.
doc_result[0][:5]
```

```
 [-0.007747666910290718, 0.0367600381731987, 0.019548965618014336, -0.0197015218436718, 0.0172061396 
```

### Dimension assignment <a href="#id-3" id="id-3"></a>

`text-embedding-3` Model classes allow you to specify the size of the embedding returned.

For example, basically `text-embedding-3-small` Returns 1536-dimensional embedding.

```
# Returns the length of the first element in the document result.
len(doc_result[0])
```

```
 1536 
```

#### Dimensions adjustment <a href="#dimensions" id="dimensions"></a>

But `dimensions=1024` By passing, the size of the embedding can be reduced to 1024.

```
# OpenAI의 "text-embedding-3-small" Initialize an object that generates a 1024-dimensional embedding using the model.
embeddings_1024 = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=1024)
```

```
1024
```

### Similarity calculation <a href="#id-4" id="id-4"></a>

```
sentence1 = "hello nice to meet you."
sentence2 = "hello nice to meet you!"
sentence3 = "Hello, nice to meet you.."
sentence4 = "Hi, nice to meet you."
sentence5 = "I like to eat apples."
```

```
from sklearn.metrics.pairwise import cosine_similarity

sentences = [sentence1, sentence2, sentence3, sentence4, sentence5]
embedded_sentences = embeddings_1024.embed_documents(sentences)
```

```
def similarity(a, b):
    return cosine_similarity([a], [b])[0][0]
```

```
# sentence1 = "hello nice to meet you."
# sentence2 = "Hello, nice to meet you."
# sentence3 = "Hi, nice to meet you."
# sentence4 = "I like to eat apples."

for i, sentence in enumerate(embedded_sentences):
    for j, other_sentence in enumerate(embedded_sentences):
        if i < j:
            print(
                f"[유사도 {similarity(sentence, other_sentence):.4f}] {sentences[i]} \t <=====> \t {sentences[j]}"
            )
```
