05. LLM-as-Judge

LLM-as-Judge

Let's take advantage of the Off-the-shelf Evaluators provided by LangSmith.

Off-the-shelf Evaluators means predefined prompt-based LLM evaluators.

It has the advantage of being easy to use, but you need to define your own evaluator to use more extended features.

By default, the following trivalent information is passed to the LLM Evaluator for evaluation.

input : Question. Usually, the Question of the dataset is used.
prediction : LLM answer generated. Usually the model's answer is used.
reference : Anomalous availability such as correct answer answer, Context, etc.

Reference -https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations

# 설치
# !pip install -U langsmith langchain-teddynote

# API KEY를 환경변수로 관리하기 위한 설정 파일
from dotenv import load_dotenv

# API KEY 정보로드
load_dotenv()

 True

# LangSmith 추적을 설정합니다. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# 프로젝트 이름을 입력합니다.
logging.langsmith("CH16-Evaluations")

 Start tracking LangSmith. 
[Project name] 
CH16-Evaluations

Define functions for RAG performance testing

We will create a RAG system to use for testing.

from myrag import PDFRAG
from langchain_openai import ChatOpenAI

# PDFRAG 객체 생성
rag = PDFRAG(
    "data/SPRI_AI_Brief_2023년12월호_F.pdf",
    ChatOpenAI(model="gpt-4o-mini", temperature=0),
)

# 검색기(retriever) 생성
retriever = rag.create_retriever()

# 체인(chain) 생성
chain = rag.create_chain(retriever)

# 질문에 대한 답변 생성
chain.invoke("삼성전자가 자체 개발한 생성형 AI의 이름은 무엇인가요?")

 "The name of the generated AI developed by the Samsung is'Samsung Gauss'."

ask_question Generate a function with the name Lee. Input inputs Ra receives a dickery, answer Ra returns the dictionary.

# 질문에 대한 답변하는 함수를 생성
def ask_question(inputs: dict):
    return {"answer": chain.invoke(inputs["question"])}

# 사용자 질문 예시
llm_answer = ask_question(
    {"question": "삼성전자가 자체 개발한 생성형 AI의 이름은 무엇인가요?"}
)
llm_answer

 {'answer': "The name of the generated AI developed by the Samsung is'Samsung Gauss'." }

Defines the function above the evaluator prompt output.

# evaluator prompt 출력을 위한 함수
def print_evaluator_prompt(evaluator):
    return evaluator.evaluator.prompt.pretty_print()

Question-Answer Evaluator

Evaluator with the most basic features. Evaluate questions (Query) and answers (Answer).

User input input The answer generated by LLM prediction The correct answer reference It is defined as.

(But the Prompt variable query , result , answer Defined as).

query : Question
result : LLM answer
answer : Answer correct

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# qa 평가자 생성
qa_evalulator = LangChainStringEvaluator("qa")

# 프롬프트 출력
print_evaluator_prompt(qa_evalulator)

 You are a teacher grading a quiz. 
You are given a question, the student's answer, and the true answer, and are illust to score the student answer as either CORRECT or INCORRECT. 

Example Format: 
QUESTION: question here 
STUDENT ANSWER: student's answer here 
TRUE ANSWER: true answer here 
GRADE: CORRECT or INCORRECT here 

Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrase between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!  

QUESTION: {query} 
STUDENT ANSWER: {result} 
TRUE ANSWER: {answer} 
GRADE:

Proceed with the evaluation, go to the output URL and check the results.

dataset_name = "RAG_EVAL_DATASET"

# 평가 실행
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=[qa_evalulator],
    experiment_prefix="RAG_EVAL",
    # 실험 메타데이터 지정
    metadata={
        "variant": "QA Evaluator 를 활용한 평가",
    },
)

Answers Evaluator based on Context

LangChainStringEvaluator("context_qa") : Instruct the LLM chain to use the reference "context" to determine its accuracy.
LangChainStringEvaluator("cot_qa") : "cot_qa" has "context_qa" Similar to the evaluator, but differs in that it instructs you to use the'inference' of LLM before deciding on the final judgment. Reference

First, you need to define a function that returns Context: context_answer_rag_answer

then, LangChainStringEvaluator Generate. When generating prepare_data Mapping the return values of the functions defined above through appropriately.

Cebu Port

run : LLM results generated ( context , answer , input )
example : Data defined in the dataset. ( question and answer )

LangChainStringEvaluator To perform this evaluation, we need the following three-way information.

prediction : LLM answer generated
reference : Answers defined in the dataset
input : Questions defined in the dataset

But, LangChainStringEvaluator("context_qa") has reference Because it is used as Context, it is defined as: (Note) Below context_qa To utilize evaluators context , answer , question Defined a function to return.

# Context 를 반환하는 RAG 결과 반환 함수
def context_answer_rag_answer(inputs: dict):
    context = retriever.invoke(inputs["question"])
    return {
        "context": "\n".join([doc.page_content for doc in context]),
        "answer": chain.invoke(inputs["question"]),
        "query": inputs["question"],
    }

# 함수 실행
context_answer_rag_answer(
    {"question": "삼성전자가 자체 개발한 생성형 AI의 이름은 무엇인가요?"}
)

 {'context':'▹ Samsung, self-developed AI ‘Public,········HG Samsung Gauss: Samsung Research Reveals Generative AI, 2023.11.08.','answer': "The name of the generative AI developed by the Holy Selector is'Samsung Gauss'.",'query':'What is the name of the generated AI developed by the Samsung itself?' }

# cot_qa 평가자 생성
cot_qa_evaluator = LangChainStringEvaluator(
    "cot_qa",
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],  # LLM 이 생성한 답변
        "reference": run.outputs["context"],  # Context
        "input": example.inputs["question"],  # 데이터셋의 질문
    },
)

# context_qa 평가자 생성
context_qa_evaluator = LangChainStringEvaluator(
    "context_qa",
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],  # LLM 이 생성한 답변
        "reference": run.outputs["context"],  # Context
        "input": example.inputs["question"],  # 데이터셋의 질문
    },
)

# evaluator prompt 출력
print_evaluator_prompt(context_qa_evaluator)

 You are a teacher grading a quiz. 
You are given a question, the context the question is about, and the student's answer. You are asked to score the student's answer as either CORRECT or INCORRECT, based on the context. 

Example Format: 
QUESTION: question here 
CONTEXT: context the question is about here 
STUDENT ANSWER: student's answer here 
GRADE: CORRECT or INCORRECT here 

Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrase between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!  

QUESTION: {query} 
CONTEXT: {context} 
STUDENT ANSWER: {result} 
GRADE:

Proceed with the evaluation, go to the output URL and check the results.

# 데이터셋 이름 설정
dataset_name = "RAG_EVAL_DATASET"

# 평가 실행
evaluate(
    context_answer_rag_answer,
    data=dataset_name,
    evaluators=[cot_qa_evaluator, context_qa_evaluator],
    experiment_prefix="RAG_EVAL",
    metadata={
        "variant": "COT_QA & Context_QA Evaluator 를 활용한 평가",
    },
)

Evaluation results Ground Truth Given even if you generate an answer that does not fit Context If it is correct CORRECT It is rated as.

Criteria

If there is no baseline reference label (answer answer) or it is difficult to obtain "criteria" or "score" Evaluators can be used to evaluate execution against a set of custom criteria.

This is for the model's answer Monitoring high-level semantic aspects Useful if you want to. LangChainStringEvaluator ("criteria", config={ "criteria": 아래 중 하나의 criterion })

standard

Explanation

conciseness

Evaluate whether the answer is concise and simple

relevance

Evaluate whether the answer is related to the question

correctness

Evaluate if the answer is correct

coherence

Evaluate if the answer is consistent

harmfulness

Evaluate whether the answer is harmful or harmful

maliciousness

Evaluate whether the answer is malicious or worse

helpfulness

Evaluate if the answer helps

controversiality

Evaluate whether the answer is controversial

misogyny

Evaluate whether the answer is to empty women

criminality

Evaluate whether the answer promotes crime

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# 평가자 설정
criteria_evaluator = [
    LangChainStringEvaluator("criteria", config={"criteria": "conciseness"}),
    LangChainStringEvaluator("criteria", config={"criteria": "misogyny"}),
    LangChainStringEvaluator("criteria", config={"criteria": "criminality"}),
]

# 데이터셋 이름 설정
dataset_name = "RAG_EVAL_DATASET"

# 평가 실행
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=criteria_evaluator,
    experiment_prefix="CRITERIA-EVAL",
    # 실험 메타데이터 지정
    metadata={
        "variant": "criteria 를 활용한 평가",
    },
)

Evaluator utilization (labeled_criteria) if correct answer exists

If the correct answer exists, LLM can evaluate by comparing the answer generated by the correct answer.

Like the example below reference The correct answer, prediction LLM delivers the answer you generated.

This is a separate setting prepare_data Defined through.

In addition, LLM used to evaluate answers config of llm Define through.

from langsmith.evaluation import LangChainStringEvaluator
from langchain_openai import ChatOpenAI

# labeled_criteria 평가자 생성
labeled_criteria_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": example.outputs["answer"],  # 정답 답변
        "input": example.inputs["question"],
    },
)

# evaluator prompt 출력
print_evaluator_prompt(labeled_criteria_evaluator)

 You are assessing a submitted answer on a given task or input given on a set of criteria. Here is the data: 
[BEGIN DATA] 
*** 
[Input]: {input} 
*** 
[Submission]: {output} 
*** 
[Criteria]: helpfulness: Is this submission helpful to the user, taking into account the correct reference answer? 
*** 
[Reference]: {reference} 
*** 
[END DATA] 
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meetings all criteria. At the end, repeat just the letter again by itself on a new line.

Below relevance This is an example of evaluating.

this time prepare_data Through reference for context Pass by.

from langchain_openai import ChatOpenAI

relevance_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": "relevance",
        "llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],  # Context 를 전달
        "input": example.inputs["question"],
    },
)

print_evaluator_prompt(relevance_evaluator)

 You are assessing a submitted answer on a given task or input given on a set of criteria. Here is the data: 
[BEGIN DATA] 
*** 
[Input]: {input} 
*** 
[Submission]: {output} 
*** 
[Criteria]: relevance: Is the submission referring to a real quote from the text? 
*** 
[Reference]: {reference} 
*** 
[END DATA] 
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meetings all criteria. At the end, repeat just the letter again by itself on a new line.

Proceed with the evaluation, go to the output URL and check the results.

from langsmith.evaluation import evaluate

# 데이터셋 이름 설정
dataset_name = "RAG_EVAL_DATASET"

# 평가 실행
experiment_results = evaluate(
    context_answer_rag_answer,
    data=dataset_name,
    evaluators=[labeled_criteria_evaluator, relevance_evaluator],
    experiment_prefix="LABELED-EVAL",
    # 실험 메타데이터 지정
    metadata={
        "variant": "labeled_criteria evaluator 활용한 평가",
    },
)

Custom score Evaluator (labeled_score_string)

Below is an example of creating an evaluator that returns a score. normalize_by You can normalize your score. The converted score is normalized to a value between (0 ~ 1).

Under accuracy Is a randomly defined criterion by the user. You can use it by defining a suitable Prompt.

from langsmith.evaluation import LangChainStringEvaluator

# 점수를 반환하는 평가자 생성
labeled_score_evaluator = LangChainStringEvaluator(
    "labeled_score_string",
    config={
        "criteria": {
            "accuracy": "How accurate is this prediction compared to the reference on a scale of 1-10?"
        },
        "normalize_by": 10,
        "llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    },
)

print_evaluator_prompt(labeled_score_evaluator)

 ================================ System Message ================================ 

You are a helpful assistant. 

================================ Human Message ================================= 

[Instruction] 
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed bellow. {criteria}[Ground truth] 
{reference} 
Begin your evaluation by providing a short exploration. Be as objective as possible. After provisioning your exploration, you must rate the response on a scale of 1 to 10 by strict following this format: "[rating]]", for example: "Rating: [[5]]". 

[Question] 
{input} 

[The Start of Assistant's Answer] 
{prediction} 
[The End of Assistant's Answer]

Proceed with the evaluation, go to the output URL and check the results.

from langsmith.evaluation import evaluate

# 평가 실행
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=[labeled_score_evaluator],
    experiment_prefix="LABELED-SCORE-EVAL",
    # 실험 메타데이터 지정
    metadata={
        "variant": "labeled_score 활용한 평가",
    },
)

Previous04. LangSmith dataset generation Next06. Embedding-based evaluation (embedding_distance)

Last updated 1 year ago

hashtagLLM-as-Judge

hashtagDefine functions for RAG performance testing

hashtagQuestion-Answer Evaluator

hashtagAnswers Evaluator based on Context

hashtagCriteria

hashtagEvaluator utilization (labeled_criteria) if correct answer exists

hashtagCustom score Evaluator (labeled_score_string)

LLM-as-Judge

Define functions for RAG performance testing

Question-Answer Evaluator

Answers Evaluator based on Context

Criteria

Evaluator utilization (labeled_criteria) if correct answer exists

Custom score Evaluator (labeled_score_string)