Make your LLM app a Domain Expert: How to Build an Expert System — Christopher Lovejoy, Anterior

Key Concepts:

1. Introduction and Background (Christopher Lovejoy):

Christopher Lovejoy, a medical doctor turned AI engineer, presents a playbook for building domain-native LLM applications.
He has experience building AI systems incorporating medical domain expertise at startups like Serakare (which reached $500 million ARR) and currently at Anterior.
Anterior provides clinical reasoning tools for healthcare administration, serving health insurance providers covering 50 million lives in the US.
The core argument is that for vertical AI applications, the system for incorporating domain insights is more critical than model sophistication. The key is enabling the model to understand context and iterate quickly with customers.

2. The Last Mile Problem:

The primary challenge in applying LLMs to specialized industries is the "last mile problem": giving the AI system context and understanding of the specific workflow for a customer or industry.
Example: A 78-year-old female patient with right knee pain recommended for knee arthroscopy. The AI needs to determine if there's documentation of unsuccessful conservative therapy for at least six weeks.
- Complexity: Defining "conservative therapy" (physiotherapy, weight loss, medication – ambiguous), "unsuccessful" (partial vs. full resolution of symptoms), and "documentation" (explicit vs. inferred).
The system is more important than the model itself. The company that wins is the one that builds the best system for taking those domain insights and quickly translating them into the pipeline giving it that context and iterating to create these improvements.

3. Performance Saturation and the Need for Domain Context:

Models can reach a baseline performance (around 95% accuracy in Anterior's case for approving care requests).
Iterating based on the "Adaptive Domain Intelligence Engine" led to accuracy improvements up to 99%.
Models reason well to a baseline, but achieving the final mile of performance requires providing domain-specific context.

4. Adaptive Domain Intelligence Engine:

This engine converts customer-specific domain insights into performance improvements.
Two main parts: measurement and improvement.

5. Measurement of Domain-Specific Performance:

Define Key Metrics: Identify what users care about most. In healthcare, customers prioritize minimizing false approvals (approving care that isn't needed).
Collaboration: Defining metrics should involve domain experts and customers.
Failure Mode Ontology: Identify all the ways the AI can fail.
- Example: For medical necessity review, failure modes are categorized as medical record extraction, clinical reasoning, and rules interpretation, with subtypes within each.
- Domain experts should lead this process.
Dashboard: A dashboard displays the patient's medical record, guidelines, and AI outputs (decision, reasoning). Domain experts mark decisions as correct/incorrect and define the failure mode.
Analysis: Correlate failure modes with the key metric (e.g., number of false approvals) to prioritize areas for improvement.

6. Improvements with Domain-Specific Context:

Readymade Datasets: Failure mode labeling creates datasets from production data, representative of real-world inputs.
Iteration: Engineers iterate against these datasets, tracking performance improvements for specific failure modes.
Domain Expert Involvement: Domain experts suggest changes to the application pipeline and provide new domain knowledge.
Domain Knowledge Addition: Domain experts can add domain knowledge through a dedicated button on the dashboard.
Data-Driven Evaluation: Domain evals (failure set evals, generic eval sets) determine if domain knowledge suggestions improve performance.
Rapid Iteration: The process allows for same-day fixes by adding domain knowledge, proving its impact with evals, and deploying it live.

7. Putting It All Together:

Overall Flow: Production application generates AI outputs, domain experts review and provide performance insights (metrics, failure modes), and the domain expert PM prioritizes based on this information.
Engineer Iteration: Engineers work to fix specific failure modes up to a defined performance threshold, using failure mode datasets and evals.
PM Decision: The PM decides whether to deploy changes to production based on eval metrics and wider context.

8. Conclusion:

Building a domain-native LLM application requires solving the last mile problem, which isn't just about more powerful models.
An "Adaptive Domain Intelligence Engine" is needed.
Domain experts power the system by reviewing AI outputs, generating metrics, failure modes, and suggested improvements.
This creates a self-improving, data-driven process managed by a domain expert PM.

9. Notable Quotes:

"When it comes to vertical AI applications, the system that you build for incorporating your domain insights is far more important than the sophistication of your models and your pipelines."
"The models reason very well, they get to a great baseline. But if you're in an industry where you really need to ek out that like final mile of performance, you need to be able to then kind of give the model give the pipeline that context."

10. Technical Terms and Concepts:

LLM (Large Language Model): A type of AI model trained on vast amounts of text data, capable of generating human-like text, translating languages, and answering questions.
ARR (Annual Recurring Revenue): A measure of the revenue that a company expects to receive from its recurring subscriptions in a year.
Medical Necessity Review: The process of evaluating whether a requested medical service or procedure is appropriate and necessary for a patient's condition.
False Approval: Incorrectly approving a medical service or procedure that is not medically necessary.
Failure Mode Ontology: A structured classification of the different ways in which an AI system can fail.
Domain Expert: An individual with extensive knowledge and experience in a specific field or industry.
Eval Sets: Datasets used to evaluate the performance of an AI model or system.
Vertical AI Applications: AI applications designed for specific industries or domains.

11. Logical Connections:

The presentation starts by establishing the importance of domain expertise in AI.
It then introduces the "last mile problem" as the key challenge in applying LLMs to specialized industries.
The "Adaptive Domain Intelligence Engine" is presented as a solution to this problem, with its two main components: measurement and improvement.
The measurement section focuses on defining key metrics and creating a failure mode ontology.
The improvement section describes how to use failure mode data to iterate on the AI system and involve domain experts in the process.
Finally, the presentation concludes by summarizing the overall flow and emphasizing the importance of a domain expert PM.