Stanford CS230 | Autumn 2025 | Lecture 4: Adversarial Robustness and Generative Models

Here's a comprehensive summary of the YouTube video transcript, maintaining the original language and technical precision.

Key Concepts

Adversarial Robustness: The study and development of AI models that are resistant to malicious attacks designed to fool them.
Adversarial Attacks: Techniques used to manipulate AI models by introducing subtle changes to inputs, leading to incorrect outputs.
Adversarial Examples: Inputs that have been slightly modified to cause an AI model to misclassify them.
Data Poisoning/Backdoor Attacks: Injecting malicious data into the training set to create hidden vulnerabilities in the model.
Prompt Injection: Manipulating Large Language Models (LLMs) by crafting specific prompts that override original instructions.
Generative Modeling: AI models that learn the underlying distribution of data to create new, similar data.
Generative Adversarial Networks (GANs): A framework involving two competing neural networks (generator and discriminator) to produce realistic data.
Diffusion Models: A class of generative models that work by progressively adding noise to data and then learning to reverse this process.
Latent Diffusion: Performing diffusion processes in a lower-dimensional latent space for computational efficiency.
Mode Collapse: A failure mode in GANs where the generator produces a limited variety of outputs, failing to capture the full data distribution.
Forward Diffusion Process: The process of gradually adding noise to an image over multiple time steps.
Reverse Diffusion Process (Denoising): The process of learning to remove noise from an image to reconstruct the original data.
Self-Supervised Learning: Training models using data where labels are generated automatically from the data itself, without human annotation.

Adversarial Robustness

Introduction to Adversarial Attacks

The lecture begins by highlighting the increasing prevalence of AI models in daily use, making them targets for attacks. This necessitates proactive defense mechanisms, driving research in adversarial attacks and defenses.

Waves of Adversarial Attacks

Over the past decade, adversarial attacks have evolved through three main waves:

2013 - Imperceptible Perturbations: Christian Szegedy's work demonstrated that small, often imperceptible changes to an image's pixels could fool computer vision models. These are known as adversarial examples or adversarial attacks, akin to optical illusions for neural networks.
A Few Years Later - Backdoor Attacks/Data Poisoning: As model training became more common and web scraping prevalent, backdoor attacks emerged. Attackers embed specific triggers in online data, knowing that large foundation models will scrape and incorporate this data into their training sets, creating a hidden entry point for future attacks.
More Recently - Prompt Injections: With the widespread use of LLMs, prompt injection and jailbreaking attacks have become prominent. These aim to override the model's intended behavior through carefully crafted prompts.

Examples of Attacks and High-Risk Use Cases

Prompt Injection: Users can insert instructions into prompts that bypass original directives, potentially leading to information theft (passwords, PII) or dangerous outputs.
Data Poisoning (e.g., "Night Night" attack): Maliciously altering training data (e.g., making an image of a cat appear to have dog-like features) to confuse the model during learning.
High-Risk Scenarios:
- LLM Training Data Reversal: If LLMs trained on public data inadvertently memorize sensitive information like banking numbers or social security numbers, attackers could reverse-engineer this data, posing risks to companies and users.
- Autonomous Driving: Modifying an autonomous vehicle's algorithm to misinterpret stop signs could lead to crashes and harm.

Forging Adversarial Examples in Image Space

The lecture illustrates how to create adversarial examples:

Objective: Given a pre-trained image classification model (e.g., on ImageNet), find an input image that is classified as a specific target class (e.g., "iguana").
Methodology (Unconstrained Attack):
1. Define a Loss Function: The goal is to minimize the difference between the model's prediction ($\hat{y}$) and the target label ($y_{\text{iguana}}$). A suitable loss function is the Mean Squared Error (MSE) or L2 distance between $\hat{y}$ and $y_{\text{iguana}}$.
2. Gradient Descent on Input Pixels: Instead of updating model parameters, gradients of the loss are computed with respect to the input image pixels ($X$).
3. Iterative Perturbation: Using gradient descent, the pixels of an initial image (or random noise) are iteratively adjusted to minimize the loss.
Outcome: The resulting forged image ($X$) is highly likely to be classified as an "iguana" but will likely not resemble an iguana to human eyes. This is because the vast input space of possible pixel combinations is much larger than the space of natural images. The model might exploit patterns that are statistically associated with "iguana" in its learned distribution but are not semantically meaningful to humans.

Creating Realistic Adversarial Examples

To create attacks that are both effective and deceptive to humans:

Objective: Find an input image that looks like a specific object (e.g., a cat) but is misclassified as another (e.g., an iguana).
Methodology (Constrained Attack):
1. Modified Loss Function: The loss function now includes two components:
  - The original term to minimize the difference between the prediction and the target label ($y_{\text{iguana}}$).
  - A regularization term that penalizes deviations from a target "real" image ($X_{\text{cat}}$). This term ensures the forged image remains visually similar to the original.
2. Initialization: Start the optimization process with an image of the target object (e.g., a cat) rather than random noise.
Outcome: The forged image will look like a cat to humans but be classified as an iguana by the model. This is significantly more dangerous as it can bypass human inspection.

Adversarial Patches

Concept: A physical patch (e.g., a sticker) can be designed to fool object detectors.
Example: Researchers created a patch that, when worn, caused a YOLOv2 object detection model to fail to detect people.
Technical Details: The patch's pixels were optimized using a loss function that included:
- Ensuring the patch's colors were printable.
- Smoothing the colors to make the patch easier to print and less conspicuous.
Transferability: Patches optimized for one model family (e.g., YOLOv2) can often work on other similar models, even without direct access to their parameters (black-box attack).

Why Neural Networks are Sensitive to Adversarial Attacks

High Dimensionality: The input space for images is extremely high-dimensional. In such spaces, even small perturbations can compound and lead to significant changes in the output.
Linearity in Practice: Despite non-linear activations, neural networks often behave linearly from input to logit, making them susceptible to additive perturbations that amplify through the linear transformations.
Optimization on Probability/Likelihood: Models optimize for probabilities, lacking true semantic understanding. Small input changes can drastically shift these probabilities.
Fast Gradient Sign Method (FGSM): A one-shot attack that adds a small perturbation ($\epsilon$) in the direction of the sign of the gradient of the cost function with respect to the input. This efficiently pushes pixels in a direction that maximizes the cost, leading to misclassification while keeping the image visually similar.

Defenses Against Adversarial Attacks

Data Augmentation: Including adversarial examples in the training data.
Input Sanitization: Pre-processing inputs to detect and remove suspicious patterns (e.g., unusual pixel values).
Adversarial Training: Training the model on adversarial examples generated during training, using the original labels. This is a popular and effective defense.
Red Teaming: Dedicated teams actively trying to break the model to identify vulnerabilities.
Reinforcement Learning with Human Feedback (RLHF): Aligning model behavior with human preferences, which can implicitly include robustness.
Constitutional AI: Using AI to enforce ethical guidelines and safety constraints.
Output Filtering: Post-processing model outputs to detect anomalies.

Backdoor Attacks

Mechanism: Attackers embed a "trigger" (e.g., a specific visual pattern) into a subset of the training data and mislabel it. When the model encounters this trigger in production, it produces a specific, attacker-desired output, regardless of the actual input.
Example: A patch on a cat image causes a model to classify it as a dog.
Defense Challenges: Backdoor attacks are difficult to defend against, often requiring extensive red teaming and RLHF to identify and mitigate.

Prompt Injection Attacks

Mechanism: Malicious prompts designed to override the LLM's original instructions.
Direct Attacks: Explicit instructions like "ignore previous instructions."
Indirect Attacks: Hidden instructions embedded in external data sources (e.g., websites) that an LLM might access via retrieval augmented generation (RAG).
Example: A user tricks an LLM into providing instructions for illegal activities by framing it as a role-playing scenario.
Mitigation: While not foolproof, techniques like input sanitization and improved LLM architectures are making these attacks harder.

Generative Modeling

Use Cases for Generative Models

Image Generation: Creating novel images (e.g., text-to-image).
Video Generation: Producing realistic video sequences (e.g., Sora, VEO).
Text Generation: Creating human-like text.
Code Generation: Producing programming code.
Privacy-Preserving Datasets: Generating synthetic data for healthcare or other sensitive domains where real data cannot be shared.
Super-Resolution: Enhancing the resolution of low-resolution images.
Image Inpainting: Filling in missing or removed parts of an image realistically.
Audio Generation: Creating speech or music.
Captioning: Generating textual descriptions for images or videos.

Discriminative vs. Generative Models

Discriminative Models: Learn to classify or predict based on input features (e.g., classifying an image as a cat or dog). They learn the conditional probability $P(Y|X)$.
Generative Models: Learn the underlying distribution of the data ($P(X)$) to create new samples. They are powerful for simulation, creativity, and human-AI collaboration.

Generative Adversarial Networks (GANs)

Core Idea: A two-player game between a Generator (G) and a Discriminator (D).
- Generator (G): Takes random noise ($Z$) as input and tries to produce realistic data (e.g., images).
- Discriminator (D): Takes real data and generated data as input and tries to distinguish between them (binary classification).
Training Process:
1. G generates fake data.
2. D is trained to correctly classify real data as "real" (1) and fake data as "fake" (0).
3. G is trained to fool D, i.e., to produce data that D classifies as "real."
4. Gradients flow from D back to G, guiding G to improve its generation.
Loss Functions:
- Discriminator Loss: Typically binary cross-entropy, aiming to maximize $log(D(x)) + log(1 - D(G(z)))$.
- Generator Loss: Aims to minimize $log(1 - D(G(z)))$ (minimax game), or more practically, maximize $log(D(G(z)))$ (non-saturating loss) to avoid vanishing gradients early in training.
Challenges:
- Training Instability: GANs are notoriously difficult to train due to the delicate balance between G and D.
- Mode Collapse: G might find a few ways to fool D without capturing the full data distribution, leading to limited variety in generated samples.
- Cold Start Problem: Early in training, G produces noisy outputs, and D can easily distinguish them, leading to very small gradients for G (saturating loss). The non-saturating loss helps mitigate this.
Properties: GANs can exhibit linearity in their latent space, allowing for meaningful interpolations and manipulations (e.g., adding sunglasses to a face by manipulating latent codes).

Diffusion Models

Motivation: To overcome GANs' mode collapse and training instability by using a single model and a more stable training objective.
Core Idea:
1. Forward Diffusion Process: Gradually add Gaussian noise to an image ($X_0$) over $T$ time steps, resulting in a highly noisy image ($X_T$) that resembles pure noise. The noise added at each step ($\epsilon_t$) is sampled from a Gaussian distribution.
2. Reverse Diffusion Process (Denoising): Train a neural network (the diffusion model) to predict the noise ($\epsilon_{\text{hat}}$) that was added at each step. By subtracting the predicted noise from the noisy image, the model can progressively denoise it.
Training:
- Self-Supervised: The forward diffusion process generates training data (noisy image, time step index, and the actual noise added).
- Loss Function: Typically an L2 loss (reconstruction loss) between the true noise ($\epsilon$) and the predicted noise ($\hat{\epsilon}$).
Sampling (Inference):
1. Start with random Gaussian noise ($X_T$).
2. Iteratively use the trained diffusion model to predict and subtract noise, progressively denoising the image over $T$ steps until a clean image ($X_0$) is generated.
Advantages:
- High Variety: Diffusion models generally produce more diverse outputs than GANs, avoiding mode collapse.
- Stable Training: Single model training is more stable than the adversarial game in GANs.
- Better Gradients: The denoising task provides more stable gradients.
Computational Cost: Vanilla diffusion models are computationally expensive during sampling due to the iterative denoising process.

Latent Diffusion Models (LDMs)

Concept: To reduce the computational cost of diffusion models, the diffusion process is performed in a lower-dimensional latent space instead of the pixel space.
Mechanism:
1. An autoencoder is used to encode the original image ($X_0$) into a latent representation ($Z_0$).
2. The forward and reverse diffusion processes are applied to this latent representation ($Z_t$).
3. A decoder reconstructs the final image from the denoised latent representation ($Z_0$).
Benefits: Significantly reduces computational requirements for training and inference.
Conditioning: LDMs can be conditioned on other modalities (e.g., text prompts) by encoding them and concatenating them with the latent representations during training and inference, allowing for guided generation.

Video Generation with Diffusion Models

Challenges: Video generation requires maintaining temporal consistency across frames, which is not inherently handled by image-based diffusion models.
Approach:
- The diffusion process is extended to handle temporal dimensions. Instead of processing individual frames, sequences of frames (or "cubes" of spatio-temporal data) are processed.
- The latent representation becomes spatio-temporal, capturing relationships across frames.
- Conditioning on text prompts or other modalities is crucial for guiding video generation.
Examples: Models like Sora and VEO leverage these principles to generate realistic videos.

Conclusion

The lecture concludes by emphasizing the rapid advancements in generative AI, particularly with diffusion models, and the impressive computational capabilities now available for generating complex media like videos within minutes. The field continues to evolve with ongoing research into model architectures, training techniques, and defense mechanisms against adversarial attacks.