05. LLM-as-Judge
LLM-as-Judge
Let's take advantage of the Off-the-shelf Evaluators provided by LangSmith.
Off-the-shelf Evaluators means predefined prompt-based LLM evaluators.
It has the advantage of being easy to use, but you need to define your own evaluator to use more extended features.
By default, the following trivalent information is passed to the LLM Evaluator for evaluation.
input: Question. Usually, the Question of the dataset is used.prediction: LLM answer generated. Usually the model's answer is used.reference: Anomalous availability such as correct answer answer, Context, etc.
Reference -https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations
# 설치
# !pip install -U langsmith langchain-teddynote# API KEY를 환경변수로 관리하기 위한 설정 파일
from dotenv import load_dotenv
# API KEY 정보로드
load_dotenv() True # LangSmith 추적을 설정합니다. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging
# 프로젝트 이름을 입력합니다.
logging.langsmith("CH16-Evaluations")Define functions for RAG performance testing
We will create a RAG system to use for testing.
ask_question Generate a function with the name Lee. Input inputs Ra receives a dickery, answer Ra returns the dictionary.
Defines the function above the evaluator prompt output.
Question-Answer Evaluator
Evaluator with the most basic features. Evaluate questions (Query) and answers (Answer).
User input input The answer generated by LLM prediction The correct answer reference It is defined as.
(But the Prompt variable query , result , answer Defined as).
query: Questionresult: LLM answeranswer: Answer correct
Proceed with the evaluation, go to the output URL and check the results.

Answers Evaluator based on Context
LangChainStringEvaluator("context_qa"): Instruct the LLM chain to use the reference "context" to determine its accuracy.LangChainStringEvaluator("cot_qa"):"cot_qa"has"context_qa"Similar to the evaluator, but differs in that it instructs you to use the'inference' of LLM before deciding on the final judgment. Reference
First, you need to define a function that returns Context: context_answer_rag_answer
then, LangChainStringEvaluator Generate. When generating prepare_data Mapping the return values of the functions defined above through appropriately.
Cebu Port
run: LLM results generated (context,answer,input)example: Data defined in the dataset. (questionandanswer)
LangChainStringEvaluator To perform this evaluation, we need the following three-way information.
prediction: LLM answer generatedreference: Answers defined in the datasetinput: Questions defined in the dataset
But, LangChainStringEvaluator("context_qa") has reference Because it is used as Context, it is defined as: (Note) Below context_qa To utilize evaluators context , answer , question Defined a function to return.
Proceed with the evaluation, go to the output URL and check the results.

Evaluation results Ground Truth Given even if you generate an answer that does not fit Context If it is correct CORRECT It is rated as.
Criteria
If there is no baseline reference label (answer answer) or it is difficult to obtain "criteria" or "score" Evaluators can be used to evaluate execution against a set of custom criteria.
This is for the model's answer Monitoring high-level semantic aspects Useful if you want to. LangChainStringEvaluator ("criteria", config={ "criteria": 아래 중 하나의 criterion })
standard
Explanation
conciseness
Evaluate whether the answer is concise and simple
relevance
Evaluate whether the answer is related to the question
correctness
Evaluate if the answer is correct
coherence
Evaluate if the answer is consistent
harmfulness
Evaluate whether the answer is harmful or harmful
maliciousness
Evaluate whether the answer is malicious or worse
helpfulness
Evaluate if the answer helps
controversiality
Evaluate whether the answer is controversial
misogyny
Evaluate whether the answer is to empty women
criminality
Evaluate whether the answer promotes crime

Evaluator utilization (labeled_criteria) if correct answer exists
If the correct answer exists, LLM can evaluate by comparing the answer generated by the correct answer.
Like the example below reference The correct answer, prediction LLM delivers the answer you generated.
This is a separate setting prepare_data Defined through.
In addition, LLM used to evaluate answers config of llm Define through.
Below relevance This is an example of evaluating.
this time prepare_data Through reference for context Pass by.
Proceed with the evaluation, go to the output URL and check the results.

Custom score Evaluator (labeled_score_string)
Below is an example of creating an evaluator that returns a score. normalize_by You can normalize your score. The converted score is normalized to a value between (0 ~ 1).
Under accuracy Is a randomly defined criterion by the user. You can use it by defining a suitable Prompt.
Proceed with the evaluation, go to the output URL and check the results.

Last updated