Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop)

Key Concepts

Large Language Models (LLMs) in production
Evaluations and Benchmarks for LLMs
Inference at scale (performance, reliability, affordability)
Bias and discrimination in LLMs
Knowledge cut-off limitation
Retrieval-Augmented Generation (RAG) systems
Agent systems
Model evaluation pyramid (system performance, formatting, factual accuracy, safety, bias, custom evaluations)
Benchmarking (latency, throughput, MMLU)
Guide LLM
VLLM inference runtime
MLEval harness
PromptFu

Setting up Generative AI Tech in Production: Challenges

Complexity: Setting up generative AI to be scalable, reliable, and safe is challenging due to the creative nature of the technology.
Incremental Approach: Enterprises should take an incremental approach to implementing GenAI, starting with automating tasks and chatbots before moving to more complex systems like RAG and agents.
Drawbacks of GenAI Models:
- Policy Restrictions: Companies often restrict the AI tools developers can use.
- Legal Exposures and Risks: Models can generate inappropriate or harmful content.
- Bias and Discrimination: Models trained on internet data can be skewed due to the eurocentric and US-based nature of the data.
- Cost and Performance: Running GenAI models at scale in production can be expensive, and performance (throughput, latency) needs to be considered.
- Knowledge Cut-off: Models have a knowledge cut-off date, requiring RAG systems to access up-to-date information.

Inference at Scale

Bottleneck: Traditional inference runtimes cannot efficiently handle the load of concurrent user requests.
Purpose-Built Inference Engines: Inference runtimes like TRT-LLM, and VLLM are needed for scale.
Pain Points:
- Evaluating model inference performance under enterprise-level workload scenarios is complicated.
- It requires manual setup of evaluation runs with various parameters.
- The compute load for performance evaluations is taxing.
- Data sets used for benchmarking must be compatible with the models.
- Resource optimization and identifying sizing for efficient hardware use is a challenge.
- Cost estimation is difficult and requires backwards math mapping inference performance to tokens.

Examples of Challenges

Stable Diffusion Bias: Most data is eurocentric, leading to bias in image generation.
The Glue Incident: An AI overview tool used satirical information from Reddit without proper mitigation, leading to incorrect suggestions.
Synthetic Data: Each generation of AI models consumes more AI-generated data, leading to a loss of output diversity and precision.

Preventing Issues at Scale

Evaluation and Benchmarking: Evaluation is a comprehensive process to assess a model end-to-end, while benchmarking is a specific comparison of models using controlled data sets and tasks.
Managing Risk: It's crucial to manage risk for customers in production environments.
Credibility: Model failures can damage a company's credibility.
Continuous Improvement: Evaluation frameworks need continuous improvement through a CI process.
System-Specific Evaluations: The type of system (RAG, agents) determines the specific metrics to evaluate.
Incremental Approach: Start with evaluating specific components (e.g., chunk retrieval in RAG) and then expand to a full system evaluation.

Model Evaluation Pyramid

Base Layer: System Performance: Ensure fast throughput, ability to handle concurrent users, and efficient GPU utilization.
Formatting: Ensure the model outputs data in the required format (e.g., JSON).
Factual Accuracy: Evaluate the model's accuracy on various subjects using benchmarks like MMLU.
Safety and Bias: Implement custom evaluations specific to the application.

Hands-on Activities

System Performance:
- Using Guide LLM to benchmark latency and throughput.
- Adjusting input and output tokens based on the use case.
- Selecting a model and data set to test.
- Visualizing results in the Guide LLM UI.
Factual Accuracy:
- Using MMLU Pro with the MLEval harness.
- Customizing evaluations with proprietary data.
Safety Evaluation:
- Using PromptFu to customize and run safety evaluations.
- Exploring examples in the PromptFu repository.

Guide LLM

A project associated with the VLLM inference runtime.
Used for system performance benchmarks like latency and throughput.
Allows users to select a model and data set, adjust input/output tokens, and visualize results.

VLLM

An inference runtime that is compatible with safe tensor formats.
Requires fewer configuration steps than TRT-LLM.
Offers various configuration options for adjusting performance.

MLEval Harness

A framework that allows users to run various evaluation benchmarks.
Can be used to customize accuracy evaluations with proprietary data.

PromptFu

A tool that allows users to customize and run their own evaluations.
Offers a variety of examples for different types of evaluations.

CI/CD and Evaluation

Implement a CI/CD framework that includes evaluation tests, similar to unit testing in software engineering.

Conclusion

Setting up generative AI in production requires careful consideration of various challenges, including policy restrictions, legal risks, bias, cost, and knowledge cut-off. Implementing a robust evaluation framework is crucial for managing risk, maintaining credibility, and continuously improving model performance. The model evaluation pyramid provides a structured approach to evaluating different aspects of the system, starting with system performance and moving up to factual accuracy, safety, and custom evaluations. Tools like Guide LLM, VLLM, MLEval harness, and PromptFu can be used to implement these evaluations.

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith