09. Experimental (Experiment) evaluation comparison

Experimental (Experiment) evaluation comparison

The Compare feature provided by LangSmith makes it easy to compare experimental results.

Reference

https://docs.smith.langchain.com/cookbook/tracing-examples/traceable#using-the-decorator

#installation
# !pip install -qU langsmith langchain-teddynote

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

 True

# Set up LangSmith tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH16-Evaluations")

 Start tracking LangSmith. 
[Project name] 
CH16-Evaluations

Define functions for RAG performance testing

We will create a RAG system to use for testing.

from myrag import PDFRAG
from langchain_openai import ChatOpenAI


# Create a function that answers the question
def ask_question_with_llm(llm):
    # Creating a PDFRAG object
    rag = PDFRAG(
        "data/SPRI_AI_Brief_2023년12월호_F.pdf",
        llm,
    )

    # Create a retriever
    retriever = rag.create_retriever()

    # Create a chain
    rag_chain = rag.create_chain(retriever)

    def _ask_question(inputs: dict):
        context = retriever.invoke(inputs["question"])
        context = "\n".join([doc.page_content for doc in context])
        return {
            "question": inputs["question"],
            "context": context,
            "answer": rag_chain.invoke(inputs["question"]),
        }

    return _ask_question

from langchain_community.chat_models import ChatOllama

# Load the Ollama model.
ollama = ChatOllama(model="EEVE-Korean-10.8B:latest")

# Calling the Ollama model
ollama.invoke("hello?")

 AIMessage (content='Hello! As a helpful and respectful assistant, we will do our best to answer your questions. We will provide accurate and useful information and strive to be safe and free from social prejudice. There may be limits to the information I can provide, but I'll always try to give you honest answers. If the question is nonsense, harmful or unethical, I will explain why instead. \n\n Let's start! If you have any questions or help, please feel free to ask. '1','Evonse_metadata😊'model':'EEVE-Korean-10.8B:latest','created_at': '2024-09-19T10:47:07.198627Z','message':  run-6bb08ee1-77ff-4ab1-be72-5f72dd277964-0')

Utilize the GPT-4o-mini model and the Ollama model to generate functions that generate answers to your questions.

gpt_chain = ask_question_with_llm(ChatOpenAI(model="gpt-4o-mini", temperature=0))
ollama_chain = ask_question_with_llm(ChatOllama(model="EEVE-Korean-10.8B:latest"))

Evaluate answers using the GPT-4o-mini model and Ollama model.

Proceed for each of the two chains.

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# qa Create an evaluator
cot_qa_evalulator = LangChainStringEvaluator(
    "cot_qa",
    config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],
        "input": example.inputs["question"],
    },
)

dataset_name = "RAG_EVAL_DATASET"

# Running the evaluation
experiment_results1 = evaluate(
    gpt_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="MODEL_COMPARE_EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "GPT-4o-mini 평가 (cot_qa)",
    },
)

# Running the evaluation
experiment_results2 = evaluate(
    ollama_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="MODEL_COMPARE_EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "Ollama(EEVE-Korean-10.8B:latest) 평가 (cot_qa)",
    },
)

Use a comparative view to examine the results.

How to make a comparison view

On Dataset's Experiment tab, select the experiment you want to compare.
Click the "Compare" button at the bottom.
A comparison view

Previous08. Rouge, BLEU, METEOR, SemScore based heuristic evaluation Next10. Assessment of the summary method

Last updated 1 year ago

hashtagExperimental (Experiment) evaluation comparison

hashtagDefine functions for RAG performance testing

Experimental (Experiment) evaluation comparison

Define functions for RAG performance testing