Evaluate LLMs in Python with DeepEval

Key Concepts

Deep Eval: An open-source LLM evaluation framework using LLMs as judges.
LLM as a Judge: Using large language models to evaluate the outputs of other large language models based on specific criteria.
Metrics: Criteria used to evaluate LLM outputs (e.g., correctness, answer relevancy, faithfulness, professionalism, conciseness).
Test Cases: Specific inputs and expected outputs used to test LLM performance.
Evaluation Datasets: Collections of test cases used for comprehensive LLM evaluation.
Golden/Ground Truth: Accurate, verified data used as a standard for comparison in evaluation.
Threshold: A minimum score required for a test case to pass.
Answer Relevancy: A metric measuring the proportion of relevant statements in an LLM's output.
Faithfulness: A metric measuring how well an LLM's output aligns with the provided context, regardless of the output's actual truthfulness.
Conversational GEval: A metric used to evaluate the quality of a conversation between a user and an LLM assistant.

Evaluating Large Language Models with Deep Eval

Introduction to Deep Eval

Deep Eval is an open-source framework for evaluating the output of large language models (LLMs) in Python. It employs the "LLM as a judge" approach, where other LLMs are used to assess the performance of the initial LLM based on predefined metrics. This is particularly useful when evaluating open-ended answers where strict formulaic comparisons are inadequate.

Setting Up Deep Eval

Project Initialization: Create a new project directory (e.g., using uv init).
Installation: Install the deep-eval package using uv pip install deep-eval or pip install deep-eval.
API Key Configuration: Obtain an OpenAI API key and store it in a .env or .orinf.local file in the format OPENAI_API_key = YOUR_API_KEY. Deep Eval automatically detects and uses this key.

Basic Example: Correctness Metric

Import Necessary Modules:

from deep_eval import assert_test
from deep_eval.test_case import LLMTestCase, LLMTestCaseParams
from deep_eval.metrics import GEval

Define the Correctness Metric:

correctness_metric = GEval(
    name="correctness",
    criteria="Check if the actual output is exactly the same as the expected output. If not, return zero else one.",
    evaluation_params=[LLMTestCaseParams.actual_output, LLMTestCaseParams.expected_output],
    threshold=0.5
)

GEval is a general evaluation metric that can be customized with specific criteria.
criteria defines the evaluation logic.
evaluation_params specifies the parameters to be compared.
threshold sets the minimum acceptable score for the test to pass.

Create a Test Case:

lm_test_case = LLMTestCase(
    input="What is 5 / 2?",
    expected_output="2.5",
    actual_output="The result is 2.5"
)

LLMTestCase defines the input, expected output, and actual output for a single test.

Assert the Test:
```
assert_test(test_case=lm_test_case, metric=correctness_metric)
```
- assert_test runs the evaluation and asserts that the score meets the defined threshold.
Run the Test: Execute the test file using deep-eval test run test_1_simple.py.

Conversational Evaluation: Professionalism Metric

Import Necessary Modules:

from deep_eval.test_case import ConversationalTestCase, TurnParams
from deep_eval.metrics import ConversationalGEval
from deep_eval import evaluate

Define the Professionalism Metric:

professionalism_metric = ConversationalGEval(
    name="professionalism",
    criteria="Determine whether the assistant answered the questions of the user in a professional and polite manner."
)

Create Conversational Test Cases:

conversation_example_1 = ConversationalTestCase(
    turns=[
        {"role": "user", "content": "Is Python an interpreted language?"},
        {"role": "assistant", "content": "Of course, how could you not know that?"},
        {"role": "user", "content": "What about C++?"},
        {"role": "assistant", "content": "Damn, you really know nothing. C++ is compiled, man."}
    ]
)

conversation_example_2 = ConversationalTestCase(
    turns=[
        {"role": "user", "content": "Is Python an interpreted language?"},
        {"role": "assistant", "content": "Yes, Python is an interpreted language."},
        {"role": "user", "content": "What about C++?"},
        {"role": "assistant", "content": "C++ is a compiled language."}
    ]
)

ConversationalTestCase defines a multi-turn conversation between a user and an assistant.

Evaluate the Test Cases:

test_cases = [conversation_example_1, conversation_example_2]
metrics = [professionalism_metric]
evaluate(test_cases=test_cases, metrics=metrics)

evaluate runs the evaluation on a list of test cases using the specified metrics.

RAG Evaluation: Answer Relevancy and Faithfulness Metrics

Import Necessary Metrics:

from deep_eval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

Answer Relevancy Metric:

Measures the proportion of relevant statements in the LLM's output.
Defined as: Number of relevant statements / Total number of statements.

metric = AnswerRelevancyMetric(threshold=0.5, include_reason=True)
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="This is a good question. The capital of France is Paris. It is one of the most beautiful cities in Europe. You should check it out.",
)

In this example, the expected score is 0.25 because only one out of four statements is relevant.

Faithfulness Metric:
- Measures how well the LLM's output aligns with the provided context.
- Truth is defined by the context, regardless of real-world accuracy.
```
metric = FaithfulnessMetric(threshold=0.5, include_reason=True)
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Madrid",
    retrieval_context=["The capital of France is Madrid."]
)
```
In this example, the score should be 1.0 because the output ("Madrid") is consistent with the provided context.

Working with Datasets

Import Necessary Modules:

from deep_eval.datasets import EvaluationDataset, Golden

Define Golden Objects:

goldens = [
    Golden(input="What is the capital of France?", expected_output="Paris"),
    Golden(input="What is 12 * 3?", expected_output="36")
]

Golden objects represent ground truth data with input and expected output.

Create an Evaluation Dataset:
```
dataset = EvaluationDataset(goldens=goldens)
```
- EvaluationDataset is a collection of Golden objects.

Simulate LLM Answers (Example):

def simulate_llm_answer(prompt):
    if prompt == "What is the capital of France?":
        return "The capital is Paris"
    elif prompt == "What is 12 * 3?":
        return "36"

Create and Evaluate Test Cases:

test_cases = []
for golden in dataset.goldens:
    test_cases.append(LLMTestCase(
        input=golden.input,
        expected_output=golden.expected_output,
        actual_output=simulate_llm_answer(golden.input)
    ))

evaluate(test_cases=test_cases, metrics=[correctness_metric])

Real-World Example: Invoice Data Extraction

Scenario: Extract data from an invoice PDF and compare it to a ground truth CSV file.
Challenge: The invoice contains a note indicating that all prices and taxes are in thousands, which the LLM must correctly interpret.
Implementation:
- Use PDF plumber to extract text from the invoice.
- Use an OpenAI client to fill a Pydantic model with the extracted data.
- Evaluate the accuracy of the filled model using a custom GEval metric.
Findings: The LLMs (GPT-4 03, 06, and mini) struggled to correctly interpret the "thousands" note, resulting in inaccurate data extraction. Deep Eval accurately calculated the percentage of correctly filled fields.

Conclusion

Deep Eval provides a flexible and powerful framework for evaluating LLMs using LLMs as judges. It supports various metrics, test case types, and evaluation datasets. The framework can be used to compare different models, identify areas for improvement, and automate the evaluation process. The real-world example demonstrates the importance of careful evaluation and the potential challenges in extracting structured data from unstructured documents. The speaker uses Deep Eval in a production environment to compare different models for a specific task, highlighting its practical utility.

Evaluate LLMs in Python with DeepEval

Key Concepts

Evaluating Large Language Models with Deep Eval

Introduction to Deep Eval

Setting Up Deep Eval

Basic Example: Correctness Metric

Conversational Evaluation: Professionalism Metric

RAG Evaluation: Answer Relevancy and Faithfulness Metrics

Working with Datasets

Real-World Example: Invoice Data Extraction

Conclusion

Chat with this Video

Related Videos

Ready to summarize another video?