07. Custom LLM evaluation

Rate with custom Evaluator

You can configure custom LLM evaluators or Heuristic evaluators.

# installation
# !pip install -U langsmith langchain-teddynote

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

 True

# LangSmith set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH16-Evaluations")

 Start tracking LangSmith. 
[Project name] 
CH16-Evaluations

Define functions for RAG performance testing

We will create a RAG system to use for testing.

from myrag import PDFRAG
from langchain_openai import ChatOpenAI

# Creating a PDFRAG object
rag = PDFRAG(
    "data/SPRI_AI_Brief_December 2023 issue_F.pdf",
    ChatOpenAI(model="gpt-4o-mini", temperature=0),
)

# Create a retriever
retriever = rag.create_retriever()

# Create a chain
chain = rag.create_chain(retriever)

# Generate answers to questions
chain.invoke("What is the name of the generative AI developed by Samsung Electronics?")

 "The name of the generated AI developed by the Samsung is'Samsung Gauss'."

ask_question Generate a function with the name Lee. Input inputs Ra receives a dickery, answer Ra returns the dictionary.

# Create a function that answers the question
def ask_question(inputs: dict):
    return {"answer": chain.invoke(inputs["question"])}

Custom Evaluator Configuration

You can keep the input parameters and return value format of the custom functions below.

Custom function

Input Run and Example To receive and output dict Returns.
Return value {"key": "score_name", "score": score} It is organized in format. Below we have defined a simple example function. Returns a random score between 1~10 regardless of the answer.

from langsmith.schemas import Run, Example
import random


def random_score_evaluator(run: Run, example: Example) -> dict:
    # Return random score
    score = random.randint(1, 11)
    return {"key": "random_score", "score": score}

from langsmith.evaluation import evaluate

# Set dataset name
dataset_name = "RAG_EVAL_DATASET"

# execution
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=[random_score_evaluator],
    experiment_prefix="CUSTOM-EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "Random Score Evaluator",
    },
)

Custom LLM-as-Judge

This time, we will create an LLM Chain and use it as an evaluator.

first, context , answer , question Defines the function that returns.

# RAG result return function that returns Context
def context_answer_rag_answer(inputs: dict):
    context = retriever.invoke(inputs["question"])
    return {
        "context": "\n".join([doc.page_content for doc in context]),
        "answer": chain.invoke(inputs["question"]),
        "question": inputs["question"],
    }

Next, create a custom LLM evaluator.

At this time, the evaluation prompt is freely adjustable.

from langchain import hub

# Get Evaluator Prompt
llm_evaluator_prompt = hub.pull("teddynote/context-answer-evaluator")
llm_evaluator_prompt.pretty_print()

 As an LLM evaluator (judge), please assess the LLM's response to the given question. Evaluate the response's accuracy, comprehensiveness, and context precision based on the provided context. After your evaluation, return only the numerical scores in the following format: 
Accuracy: [score] 
Comprehensiveness: [score] 
Context Precision: [score] 
Final: [normalized score] 
Grading rubric: 

Accuracy (0-10 points): 
Evaluate how well the answer aligns with the information provided in the given context. 

0 points: The answer is completely incurate or contradicts the provided context 
4 points: The answer partially aligns with the context but contains signal inaccuracies 
7 points: The answer mostly aligns with the context but has minor inaccuracies or omissions 
10 points: The answer fully aligns with the provided context and is completely accurate 


Comprehensiveness (0-10 points): 

0 points: The answer is completely inadequate or irrelevant 
3 points: The answer is accurate but too brief to fully address the question 
7 points: The answer covers main aspects but lacks detail or misses minor points 
10 points: The answer comprehensively covers all aspects of the question 


Context Precision (0-10 points): 
Evaluate how precisely the answeruses the information from the provided context. 

0 points: The answer doesn't use any information from the context oruses it entirely incorrectly 
4 points: The answeruses some information from the context but with signalant misinterpretations 
7 points: The answer uses most of the relevant context information correctly but with minor misinterpretations 
10 points: The answer precisely and correctly use all relevant information from the context 


Final Normalized Score: 
Calculate by summing the scores for accuracy, comprehensiveness, and context precision, then division by 30 to get a score between 0 and 1. 
Formula: (Accuracy + Comprehensiveness + Context Precision) / 30 

#Given question: 
{question} 

#LLM's response: 
{answer} 

#Provided context: 
{context} 

Please evaluate the LLM's response according to the criteria Above.  

In your output, include only the numerical scores for FINAL NORMALIZED SCORE without any additional exploration or reasoning. 
ex) 0.81 

#Final Normalized Score (Just the number):

from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

# create an evaluator
custom_llm_evaluator = (
    llm_evaluator_prompt
    | ChatOpenAI(temperature=0.0, model="gpt-4o-mini")
    | StrOutputParser()
)

Previously created context_answer_rag_answer Answers generated using functions, context custom_llm_evaluator Enter in to proceed with the evaluation.

# Generates an answer.
output = context_answer_rag_answer(
    {"question": "What is the name of the generative AI developed by Samsung Electronics?"}
)

# Run the score evaluation
custom_llm_evaluator.invoke(output)

 '1.00'

custom_evaluator Define functions.

run.outputs : Get the answer, context, question created by the RAG chain.
example.outputs : Get the correct answer from the dataset.

from langsmith.schemas import Run, Example


def custom_evaluator(run: Run, example: Example) -> dict:
    # LLM Generate Answers, Get Correct Answers
    llm_answer = run.outputs.get("answer", "")
    context = run.outputs.get("context", "")
    question = example.outputs.get("question", "")

    # Return random score
    score = custom_llm_evaluator.invoke(
        {"question": question, "answer": llm_answer, "context": context}
    )
    return {"key": "custom_score", "score": float(score)}

Proceed with the evaluation.

from langsmith.evaluation import evaluate

# Set dataset name
dataset_name = "RAG_EVAL_DATASET"

# execution
experiment_results = evaluate(
    context_answer_rag_answer,
    data=dataset_name,
    evaluators=[custom_evaluator],
    experiment_prefix="CUSTOM-LLM-EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "Custom LLM Evaluator 활용한 평가",
    },
)

Previous06. Embedding-based evaluation (embedding_distance)Next08. Rouge, BLEU, METEOR, SemScore based heuristic evaluation

Last updated 1 year ago

hashtagRate with custom Evaluator

hashtagDefine functions for RAG performance testing

hashtagCustom Evaluator Configuration

hashtagCustom LLM-as-Judge

Rate with custom Evaluator

Define functions for RAG performance testing

Custom Evaluator Configuration

Custom LLM-as-Judge