Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

By AI Engineer

AITechnologyBusiness
Share:

Key Concepts

  • Large Language Models (LLMs) in production
  • Evaluations and Benchmarks for LLMs
  • Inference at scale (performance, reliability, affordability)
  • Bias and discrimination in LLMs
  • Knowledge cut-off limitation
  • Retrieval-Augmented Generation (RAG) systems
  • Agent systems
  • Model evaluation pyramid (system performance, formatting, factual accuracy, safety, bias, custom evaluations)
  • Benchmarking (latency, throughput, MMLU)
  • Guide LLM
  • VLLM inference runtime
  • MLEval harness
  • PromptFu

Setting up Generative AI Tech in Production: Challenges

  • Complexity: Setting up generative AI to be scalable, reliable, and safe is challenging due to the creative nature of the technology.
  • Incremental Approach: Enterprises should take an incremental approach to implementing GenAI, starting with automating tasks and chatbots before moving to more complex systems like RAG and agents.
  • Drawbacks of GenAI Models:
    • Policy Restrictions: Companies often restrict the AI tools developers can use.
    • Legal Exposures and Risks: Models can generate inappropriate or harmful content.
    • Bias and Discrimination: Models trained on internet data can be skewed due to the eurocentric and US-based nature of the data.
    • Cost and Performance: Running GenAI models at scale in production can be expensive, and performance (throughput, latency) needs to be considered.
    • Knowledge Cut-off: Models have a knowledge cut-off date, requiring RAG systems to access up-to-date information.

Inference at Scale

  • Bottleneck: Traditional inference runtimes cannot efficiently handle the load of concurrent user requests.
  • Purpose-Built Inference Engines: Inference runtimes like TRT-LLM, and VLLM are needed for scale.
  • Pain Points:
    • Evaluating model inference performance under enterprise-level workload scenarios is complicated.
    • It requires manual setup of evaluation runs with various parameters.
    • The compute load for performance evaluations is taxing.
    • Data sets used for benchmarking must be compatible with the models.
    • Resource optimization and identifying sizing for efficient hardware use is a challenge.
    • Cost estimation is difficult and requires backwards math mapping inference performance to tokens.

Examples of Challenges

  • Stable Diffusion Bias: Most data is eurocentric, leading to bias in image generation.
  • The Glue Incident: An AI overview tool used satirical information from Reddit without proper mitigation, leading to incorrect suggestions.
  • Synthetic Data: Each generation of AI models consumes more AI-generated data, leading to a loss of output diversity and precision.

Preventing Issues at Scale

  • Evaluation and Benchmarking: Evaluation is a comprehensive process to assess a model end-to-end, while benchmarking is a specific comparison of models using controlled data sets and tasks.
  • Managing Risk: It's crucial to manage risk for customers in production environments.
  • Credibility: Model failures can damage a company's credibility.
  • Continuous Improvement: Evaluation frameworks need continuous improvement through a CI process.
  • System-Specific Evaluations: The type of system (RAG, agents) determines the specific metrics to evaluate.
  • Incremental Approach: Start with evaluating specific components (e.g., chunk retrieval in RAG) and then expand to a full system evaluation.

Model Evaluation Pyramid

  • Base Layer: System Performance: Ensure fast throughput, ability to handle concurrent users, and efficient GPU utilization.
  • Formatting: Ensure the model outputs data in the required format (e.g., JSON).
  • Factual Accuracy: Evaluate the model's accuracy on various subjects using benchmarks like MMLU.
  • Safety and Bias: Implement custom evaluations specific to the application.

Hands-on Activities

  1. System Performance:
    • Using Guide LLM to benchmark latency and throughput.
    • Adjusting input and output tokens based on the use case.
    • Selecting a model and data set to test.
    • Visualizing results in the Guide LLM UI.
  2. Factual Accuracy:
    • Using MMLU Pro with the MLEval harness.
    • Customizing evaluations with proprietary data.
  3. Safety Evaluation:
    • Using PromptFu to customize and run safety evaluations.
    • Exploring examples in the PromptFu repository.

Guide LLM

  • A project associated with the VLLM inference runtime.
  • Used for system performance benchmarks like latency and throughput.
  • Allows users to select a model and data set, adjust input/output tokens, and visualize results.

VLLM

  • An inference runtime that is compatible with safe tensor formats.
  • Requires fewer configuration steps than TRT-LLM.
  • Offers various configuration options for adjusting performance.

MLEval Harness

  • A framework that allows users to run various evaluation benchmarks.
  • Can be used to customize accuracy evaluations with proprietary data.

PromptFu

  • A tool that allows users to customize and run their own evaluations.
  • Offers a variety of examples for different types of evaluations.

CI/CD and Evaluation

  • Implement a CI/CD framework that includes evaluation tests, similar to unit testing in software engineering.

Conclusion

Setting up generative AI in production requires careful consideration of various challenges, including policy restrictions, legal risks, bias, cost, and knowledge cut-off. Implementing a robust evaluation framework is crucial for managing risk, maintaining credibility, and continuously improving model performance. The model evaluation pyramid provides a structured approach to evaluating different aspects of the system, starting with system performance and moving up to factual accuracy, safety, and custom evaluations. Tools like Guide LLM, VLLM, MLEval harness, and PromptFu can be used to implement these evaluations.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video