05. Data frame output parser (PandasDataFrameOutputParser)

PandasDataFrameOutputParser

Pandas DataFrame is a widely used data structure in the Python programming language, commonly used for data manipulation and analysis. It provides a comprehensive set of tools to cover structured data, which can be used in a variety of tasks such as data purification, conversion and analysis.

This output parser allows users to request an LLM that allows them to view data in formatted dictionary form by specifying any Pandas DataFrame and extracting data from that DataFrame.

from dotenv import load_dotenv

load_dotenv()

True

# Set up LangSmith tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH03-OutputParser")

 Start tracking LangSmith. 
[Project name] 
CH03-OutputParser

import pprint
from typing import Any, Dict

import pandas as pd
from langchain.output_parsers import PandasDataFrameOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

# Initializing the ChatOpenAI model (gpt-3.5-turbo We recommend using models)
model = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")

format_parser_output Functions are used to convert parser output to preform and format output.

# For printing purposes only.
def format_parser_output(parser_output: Dict[str, Any]) -> None:
    # Iterate over the keys in the parser output.
    for key in parser_output.keys():
        # Converts the value of each key to a dictionary.
        parser_output[key] = parser_output[key].to_dict()
    # It prints out nicely.
    return pprint.PrettyPrinter(width=4, compact=True).pprint(parser_output)

titanic.csv After reading the data, load the DataFrame df Assign to variable.
Parse DataFrame using PandasDataFrameOutputParser.

# Define the desired Pandas DataFrame.
df = pd.read_csv("./data/titanic.csv")
df.head()

# Set up the parser and inject instructions into the prompt template.
parser = PandasDataFrameOutputParser(dataframe=df)

# Prints instructions for the parser.
print(parser.get_format_instructions())

The output should be be formatted as a string as the operation, followed by a colon, followed by the column or row to be queried on, followed by optional array parameters. 
One. The column names are limited to the possible columns bellow. 
2. Arrays must either be a comma-separated list of numbers formatted as [1,3,5], or it must be in range of numbers formatted as [0..4]. 
3. Remember that arrays are optional and not necessarily required. 
4. If the column is not in the possible columns or the operation is not a valid Pandas DataFrame operation, return why it invalid as a sentence starting with either "Invalid column" or "Invalid operation". 

As an example, for the formats: 
One. String "column:num_legs" is a well-formatted instance whatich gets the column num_legs, where num_legs is a possible column. 
2. String "row:1" is a well-formed instance what gets row 1. 
3. String "column:num_legs[1,2]" is a well-formatted instance whatich gets the column num_legs for rows 1 and 2, where num_legs is a possible column. 
4. String "row:1[num_legs]" is a well-formatted instance whatch gets row 1, but for just column num_legs, where num_legs is a possible column. 
5. String "mean:num_legs[1..3]" is a well-formatted instance what takes the mean of num_legs from rows 1 to 3, where num_legs is a possible column and mean is a valid Pandas DataFrame operation. 
6. String "do_something:num_legs" is a badly-formatted instance, where do_something is not a valid Pandas DataFrame operation. 
7. String "mean:invalid_col" is a badly-formatted instance, where invalid_col is not a possible column. 

Here are the possible columns: 
PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked

This is an example of looking at values for a column.

# Here is an example of a heat job.
df_query = "Please check the Age column."


# Set up a prompt template.
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],  # Set input variables
    partial_variables={
        "format_instructions": parser.get_format_instructions()
    },  # Set partial variables
)

# Create a chain
chain = prompt | model | parser

# Chain execution
parser_output = chain.invoke({"query": df_query})

# output of power
format_parser_output(parser_output)

{'Age': {0: 22.0, 
         1: 38.0, 
         2: 26.0, 
         3: 35.0, 
         4: 35.0, 
         5: nan, 
         6: 54.0, 
         7: 2.0, 
         8: 27.0, 
         9: 14.0, 
         10: 4.0, 
         11: 58.0, 
         12: 20.0, 
         13: 39.0, 
         14: 14.0, 
         15: 55.0, 
         16: 2.0, 
         17: nan, 
         18: 31.0, 
         19: nan}}

This is an example of searching for the first row.

# Here is an example of a row query.
df_query = "Retrieve the first row."

# chain execution
parser_output = chain.invoke({"query": df_query})

# output the results
format_parser_output(parser_output)

 {'0': {'Age': 22.0, 
       'Cabin': nan, 
       'Embarked':'S', 
       'Fare': 7.25, 
       'Name':'Braund,' 
               'Mr. ' 
               'Owen' 
               'Harris', 
       'Parch': 0, 
       'PassengerId': 1, 
       'Pclass': 3, 
       'Sex':'male', 
       'SibSp': 1, 
       'Survived': 0, 
       'Ticket':'A/5' 
                 '21171' }}

An example of a task that searches for the average of some rows in a specific column.

# row 0 ~ 4 Find the average age of 4.
df["Age"].head().mean()

 31.2

# Example of working with any Pandas DataFrame, limiting the number of rows.
df_query = "Retrieve the average of the Ages from row 0 to 4."

# chain execution
parser_output = chain.invoke({"query": df_query})

# output the results
print(parser_output)

{'mean': 31.2}

Here is an example that calculates the average price for the rate (Fare).

# Here's an example of a malformed query.
df_query = "Calculate average `Fare` rate."

# chain execution
parser_output = chain.invoke({"query": df_query})

# Output the results
print(parser_output)

 {'mean': 22.19937}

# Verification of results
df["Fare"].mean()

 22.19937

Previous04. JSON output parser (JsonOutputParser)Next06. Date format output parser (DatetimeOutputParser)

Last updated 1 year ago

hashtagPandasDataFrameOutputParser

PandasDataFrameOutputParser