SE Radio 661: Sunil Mallya on Small Language Models

Software Engineering Radio - Small Language Models

Key Concepts: Small Language Models (SLMs), Large Language Models (LLMs), Parameters, Mixture of Experts, Training Data, Evaluation Data, Fine-tuning, Retrieval Augmented Generation (RAG), Hallucinations, Computational Footprint, Inference Latency, Tokenization, Bias, Data Curation, Deployment (On-Prem vs. SaaS).

Defining Small Language Models (SLMs)

The definition of "small" is time-bound and relative to current hardware capabilities.
A practical definition of SLMs focuses on resource accessibility and speed.
Example (Early 2025): A 10 billion parameter model with a maximum context length of 10K words and an inference latency of around 1 second.

Parameters in Language Models

Parameter: A computational unit analogous to a neuron in a biological brain. Parameters connect to synthesize a final response.
The number of parameters directly impacts memory requirements.
- Example: A 10 billion parameter model using 32-bit representation (4 bytes per parameter) requires approximately 40 GB of memory. Half-precision (16-bit) reduces this to 20 GB.

SLMs vs. LLMs: Differences and Attributes

General Purpose vs. Expert Models: LLMs are trained on broad internet data, making them general-purpose. Expert models are focused on specific domains (e.g., coding, healthcare).
Computational Footprint: SLMs generally have a smaller computational footprint due to their smaller size (fewer parameters).
- Example: A 10 billion parameter model has a significantly smaller footprint than a 175 billion parameter model (potentially two orders of magnitude difference in speed).
Training Data Size: LLMs require vast amounts of training data. SLMs can be effective with less data, especially when fine-tuned from a pre-trained base model.
- Example: Llama 3 was trained on 14 trillion tokens, while specialized models like Meditron (healthcare) or models from Flip AI can be trained on hundreds of billions of tokens.

Training Strategies for SLMs

Training from Scratch (Zero-Shot): Starting with a model with all parameters initialized to zero and training it on a dataset.
Transfer Learning (Fine-tuning): Using a pre-trained LLM as a base and fine-tuning it on a specific task or domain.
Model Compression Techniques:
- Pruning: Removing less important parameters from a large model.
- Distillation: Training a smaller model to mimic the behavior of a larger model.
- Quantization: Reducing the precision of the model's parameters (e.g., from 32-bit to 8-bit or 4-bit), reducing memory footprint.

Addressing Bias in SLMs

Data Discipline: Ensuring training and evaluation data reflect real-world diversity to mitigate decision-making biases.
Blind Annotation Process: Using multiple annotators from diverse backgrounds to label data independently.
Importance of Evaluation Data: Thoroughly curating evaluation data to accurately reflect real-world scenarios and potential biases.

Hallucinations in SLMs vs. LLMs

Hallucination: When a language model generates incorrect or nonsensical information.
SLMs tend to have lower hallucination rates than LLMs due to their focused training and smaller knowledge space.
Example: An LLM might confuse a variable named "Napoleon" with historical context, while an SLM trained primarily on coding data is more likely to interpret it as a variable.

Real-World Examples of SLM Success

Meditron (Healthcare): An open-source healthcare model that outperformed larger general-purpose models (e.g., Google's 540 billion parameter PaLM) in healthcare-specific tasks.
Rad GPT/Rad Oncology GPT (Radiology): Achieved significantly higher accuracy (60-70%) compared to LLMs (1%) in radiology oncology report generation due to specialized training data.
DeepSeek R1: A mixture of experts model with approximately 35 billion active parameters that outperforms larger models like OpenAI's GPT-4 and Claude 3.

Enterprise Adoption of SLMs

Last Mile Problem: Enterprises have unique data that LLMs haven't seen, requiring fine-tuning or RAG.
Key Advice:
- Focus on obtaining clean and representative training data.
- Prioritize creating high-quality evaluation data.
- Optimize for accuracy vs. cost trade-offs.

Challenges Faced by Enterprises

Data Drift: The underlying data and user behavior change over time, requiring continuous model iteration.
Lack of Observability: Difficulty in capturing model failures and user feedback for continuous improvement.
Deployment Complexity: Determining the right resource footprint (GPUs), expertise in deployment frameworks (TensorFlow, PyTorch), and GPU-specific DevOps skills.

Deployment Options: On-Prem vs. SaaS

SaaS (API-Based Services): Simple to use but lack control over SLAs, availability, and data governance.
On-Prem (VPC): Offers greater control over compliance, data governance, and availability but requires more in-house expertise.
Factors Driving Choice: Compliance requirements, data governance policies, desired robustness and availability guarantees, and in-house expertise.

Key Skills for Building and Deploying SLMs

Model deployment frameworks (TensorFlow, PyTorch).
Knowledge of GPU selection and workload optimization.
GPU-specific DevOps skills for monitoring and managing GPU infrastructure.
Understanding of data curation and evaluation techniques.

Retrieval Augmented Generation (RAG)

RAG: Augmenting a base model with external knowledge retrieved from a knowledge base (e.g., company wiki) to improve performance on specific tasks.
Alternative to Fine-tuning: RAG can be used instead of or in conjunction with fine-tuning.

Conclusion

SLMs offer a practical and efficient alternative to LLMs, particularly for specialized tasks and enterprise applications. Key considerations include data curation, bias mitigation, deployment strategy (on-prem vs. SaaS), and the necessary skill sets for building and maintaining these models. The trend towards mixture of experts models suggests that collections of smaller, specialized models may outperform monolithic LLMs in the future.