Lessons from Trillion Token Deployments at Fortune 500s — Alessandro Cappelli, Adaptive ML

By AI Engineer

Share:

Key Concepts

  • RL Ops (Reinforcement Learning Operations): The systematic process of building, evaluating, and deploying LLMs using reinforcement learning.
  • The Last Mile Myth: The misconception that reaching an MVP is the hard part, whereas the true challenge is the continuous refinement required to reach and maintain production-grade performance.
  • Tokenomics: The economic viability of an LLM use case based on the cost of tokens relative to the value generated.
  • LLM-as-a-Judge: Using a high-performing LLM to evaluate the output of a smaller, specialized model based on predefined rubrics.
  • Synthetic Data Pipeline: Using an environment and reward function to generate high-quality training trajectories, effectively creating data where none existed.
  • PPO (Proximal Policy Optimization): A complex RL algorithm that requires orchestrating multiple LLMs simultaneously.

1. The Challenge of Production-Grade GenAI

Alessandro Cappelli argues that 95% of GenAI pilots fail to reach production because they rely on Instruction Fine-Tuning (SFT) or System Prompting. These methods lack a mathematical, systematic way to address defects.

  • The Problem with SFT: Iterating over datasets is expensive and unsustainable for long-term maintenance.
  • The Problem with Prompting: Changing a system prompt to fix one defect often introduces new ones, creating a "whack-a-mole" scenario without a scientific way to monitor progress.
  • The Solution: Reinforcement Learning (RL) allows for continuous, automated refinement driven by real-world feedback and business metrics.

2. Advantages of RL-Enabled Models

RL allows enterprises to move away from massive, expensive proprietary models toward smaller, specialized models.

  • Cost Efficiency: Smaller models significantly reduce the cost of token consumption at scale.
  • Latency: Smaller models (e.g., 10B parameter range) meet the strict latency requirements (ideally < 0.3 seconds) necessary for real-time applications like speech-to-speech customer support.
  • Ownership: By training on proprietary business data, companies retain control over their models, preventing performance shifts caused by updates to third-party frontier models.

3. RL for Agents

Agents introduce higher complexity because they interact with databases and tools, leaving less room for error.

  • Environment Integration: RL is naturally suited for agents because it treats the business workflow as an "environment."
  • Synthetic Data Generation: When training data for agentic workflows is unavailable, RL creates a synthetic pipeline. By defining a reward function, the model generates "trajectories" (sequences of actions), allowing for rejection sampling to bootstrap the initial training.

4. Human-in-the-Loop and Reward Signals

Cappelli emphasizes that "Human-in-the-Loop" should not mean expensive, manual annotation campaigns.

  • Systematic Rewards: Use hard KPIs (e.g., "containment rate" in customer support) or technical checks (e.g., "does the code run?").
  • LLM-as-a-Judge: Humans define the rubrics and system prompts for an LLM judge, which then evaluates the model’s behavior against business guidelines. This reduces the human effort from weeks of annotation to hours of rubric design.

5. The Adaptive Engine

Adaptive ML provides an "RL Ops" platform designed to handle the technical difficulty of RL.

  • Holistic Approach: The platform integrates observation, training, and serving into one lifecycle.
  • Complexity Management: Implementing algorithms like PPO is difficult because it requires orchestrating four LLMs simultaneously. The Adaptive Engine abstracts this complexity by providing pre-built recipes, allowing teams to focus on defining business goals rather than low-level RL engineering.

6. Q&A: Handling Implicit Feedback

In response to a question about using production data (e.g., user acceptance of a completion), Cappelli explains a two-stage approach:

  1. Early Stage: Use a large, powerful model (e.g., 235B parameters) as an LLM-as-a-judge to provide feedback.
  2. Production Stage: Once sufficient production data is collected, use that data to train a Reward Model. This allows the system to scale human feedback automatically, moving from manual oversight to active, automated training.

Synthesis

The core takeaway is that reinforcement learning is the "missing link" for industrializing GenAI. While SFT and prompting are sufficient for demos, they fail to provide the systematic, iterative improvement required for production. By leveraging RL, enterprises can build smaller, faster, and cheaper models that are continuously refined by business-specific reward signals, ultimately solving the "last mile" problem of GenAI deployment.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video