Back to all videos

How to change model behavior! Context engineering, fine-tuning and more

By John Savill's Technical Training

context engineering (prompting RAG)and fine-tuning.Constraint: No broad terms (e.g.

Share:

Key Concepts

Hidden Layers/Dimensions: The internal neural network layers that process patterns; "hidden size" refers to the number of neurons per layer.
Parameters: The weights and biases (floating-point numbers) that define the model's learned behavior.
Embeddings: High-dimensional vectors representing the semantic meaning of tokens, allowing models to understand relationships between concepts across different languages.
Inference: The process of generating output tokens based on learned statistical patterns.
Context Engineering: The practice of manipulating the input (prompts, examples, RAG) to influence model behavior.
Fine-tuning: The process of updating a model's internal weights using labeled datasets to change its style, format, or domain expertise.
LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique that updates only a small subset of weights, significantly reducing compute and memory requirements.

1. Understanding Model Internals

Generative models function by processing input tokens through multiple hidden layers. Each layer consists of:

Weights: Floating-point numbers representing the strength of connections between neurons.
Biases: Values that shift the activation function of neurons.
Matrices: Large structures (e.g., 4,096 x 4,096) that manage the complex relationships between tokens (Query, Key, Value).

The model does not "read" words; it maps tokens to high-dimensional embedding vectors that capture semantic meaning. This is why models can translate between languages easily—the word "dog" in English and "chien" in French occupy similar proximity in the embedding space.

2. Context Engineering: Modifying Behavior at Inference

These methods involve altering the input prompt to guide the model without changing its underlying weights.

Zero/One/Few-Shot Prompting: Providing the model with zero, one, or multiple examples of desired input-output pairs to establish a pattern.
- Pros: Easy to implement.
- Cons: Does not scale for complex logic; increases token costs for every interaction.
System Prompts: Persistent instructions (e.g., "Act as a senior enterprise architect") that define tone, style, and guardrails.
- Note: While effective for persona, they can drift and compete with user prompts.
Retrieval-Augmented Generation (RAG): Injecting external, high-quality, or private data into the prompt.
- Application: Essential for providing up-to-date information or private enterprise data that the model was not trained on. It helps reduce hallucinations by providing the model with the specific facts it needs.

3. Fine-Tuning: Modifying Model Weights

Fine-tuning involves training the model on a specific dataset of labeled examples to permanently alter its behavior.

Methodology: The model is fed pairs of inputs and desired outputs. The training process "nudges" the weights and biases to align with the new patterns.
Requirements: High-quality data is critical; even 10% poor-quality data can degrade performance. A minimum of ~500 examples is typically required for basic tasks.
Risks: Overfitting, where the model memorizes specific examples rather than learning the underlying pattern, leading to poor performance on new, unseen data.

LoRA (Low-Rank Adaptation)

LoRA optimizes fine-tuning by decomposing the weight updates into two smaller, "tall and skinny" matrices.

Process: Instead of updating a full 4,096 x 4,096 matrix, LoRA updates a low-rank subspace (e.g., rank 8).
Benefit: Drastically reduces GPU/memory requirements and allows for "composable" models where the LoRA adapter is added to the base model at inference time.

4. Synthesis: The Relationship Between RAG and Fine-Tuning

The video emphasizes that these techniques are not mutually exclusive; they are complementary:

Fine-tuning is best for teaching the model "how to speak" (style, domain-specific language, professional tone).
RAG is best for providing "what to say" (factual, time-sensitive, or private data).

Example: In a legal context, fine-tuning creates a model that understands legal terminology and formatting, while RAG provides the specific, current case files required to answer a client's question.

Conclusion

The decision to use these methods depends on the balance of cost, complexity, and the need for consistency. While fine-tuning (specifically via LoRA) reduces token costs by shortening the required system prompts, it does not eliminate the need for context engineering. A robust system often utilizes a fine-tuned model for behavioral consistency, supplemented by RAG for factual accuracy.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video