Stanford CS230 | Autumn 2025 | Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning

By Unknown Author

AITechnologyScience
Share:

Key Concepts

  • Model Architecture: The blueprint or skeleton of an AI model.
  • Parameters: The trainable weights and biases within an AI model, ranging from a few to billions.
  • Feature Engineering: Manually designing algorithms to detect specific features (e.g., an eye detector).
  • Feature Learning: Automatic learning of features by a neural network from data, an end-to-end process.
  • Encoding: Any vector representation derived from an input by a neural network.
  • Embedding: An encoding where the distances between vectors have semantic meaning or logic.
  • One-hot Vector: A vector with a single '1' and the rest '0's, used for categorical labels where only one category applies.
  • Multi-hot Vector: A vector with multiple '1's, used when an input can belong to multiple categories simultaneously.
  • Network Capacity: The ability of a neural network to learn complex patterns; deeper networks generally have more capacity.
  • Triplet Loss: A loss function used in face verification that minimizes the distance between an anchor and a positive example while maximizing the distance between the anchor and a negative example.
  • Self-Supervised Learning (SSL): A paradigm where a model learns from the data itself by creating supervisory signals from the data, without requiring manual labels.
  • Contrastive Learning: A self-supervised learning method that teaches a model to distinguish between similar and dissimilar pairs of data points, pushing similar embeddings closer and dissimilar ones further apart.
  • Next Token Prediction: A self-supervised learning task in natural language processing where the model predicts the next word (or token) in a sequence.
  • Emergent Behaviors: Unexpected capabilities that arise from simple training objectives at scale, without being explicitly taught or labeled.
  • Weakly Supervised Learning: Learning from naturally occurring pairings in data (e.g., images with captions) rather than explicitly hand-labeled datasets.
  • Shared Embedding Space: A common vector space where different modalities (e.g., text, image, audio) are represented, allowing for cross-modal understanding and comparison.

Comprehensive Summary of CS230 Lecture: Decision Making in AI Projects

This lecture, co-taught by Keon Katon Fuj, focuses on bringing industry-specific insights and examples to the CS230 curriculum, emphasizing practical decision-making in AI projects. The session covers a recap of deep learning fundamentals, delves into supervised learning projects, and introduces the concepts of self-supervised and weakly supervised learning.

1. Recap of Deep Learning Fundamentals

The lecture begins with a recap of core concepts learned online, including neurons, layers, and deep neural networks.

  • Traditional Supervised Learning: AI learns from data with labeled inputs and outputs. An example is classifying a "confused cat" image (input) to predict the probability (0-1) of a cat being present (output).
  • Model Components: An AI model consists of an architecture (the blueprint or skeleton) and parameters (the trainable weights and biases). When deployed, these are typically stored as two files in the cloud for inference.
  • Learning Process (Gradient Descent Optimization):
    1. An input (e.g., cat image) is fed to an untrained model, yielding an initial, likely incorrect prediction.
    2. A loss function compares this prediction to the ground truth, calculating a penalty.
    3. Gradient descent updates the model's parameters iteratively, using batches of data, to minimize this loss until the model's predictions align with the ground truth.
  • Variations in Neural Network Setup:
    • Input: Can be diverse (images, text, audio, video, structured data, spreadsheets), influencing architecture.
    • Output: Not limited to binary classification (0/1); can be regression (e.g., estimating cat's age) or generative tasks (e.g., high-resolution image from low-resolution input, where output can be larger than input).
    • Architecture: Beyond the basic multi-layer perceptron, students will learn about RNNs, CNNs, and Transformer models, all built upon fundamental neural network principles.
    • Loss Function: A critical component, its design is considered an "art" in deep learning research (e.g., YOLO's complex loss function).
  • Neurons and Multi-Animal Classification:
    • A neuron is analogous to logistic regression: it takes a flattened vector of input (e.g., RGB pixels of an image), applies a linear transformation (W^T X + B), and then an activation function (e.g., sigmoid) to produce an output, often a probability (0-1).
    • For detecting multiple animals (cat, dog, giraffe), the output layer needs to be modified to have one neuron per animal.
    • Labels must be adjusted from binary (0/1) to one-hot vectors (e.g., [0,1,0] for a cat) or multi-hot vectors (e.g., [1,1,0] for a cat and a dog if both are present). A common mistake is to update data but forget to adjust labels.
  • Network Capacity and Layer Intuition:
    • Network capacity refers to a model's ability to learn complex patterns. Deeper networks have more capacity. A shallow network might lack the flexibility to learn from a large dataset, while an overly deep network might overfit, memorizing the training data rather than learning generalizable features.
    • Layer-wise Feature Learning (Facial Images Example): When a network is trained on faces, its layers learn features of increasing complexity:
      • First layers: Detect low-level features like diagonal, vertical, or horizontal edges from raw pixels.
      • Middle layers: Combine these edges to detect higher-level features like eyes, noses, or ears.
      • Deeper layers: Detect larger facial features, closer to the overall task of facial analysis.
    • This process of encoding information into vectors, especially embeddings where distances between vectors are meaningful, is crucial for tasks like database search.

2. Supervised Learning Projects: Case Studies

The lecture then moves to practical decision-making through three supervised learning case studies.

2.1. Case Study 1: Day and Night Classification

  • Problem: Classify an image as "day" or "night."
  • Data Collection:
    • Initial ideas might involve feature engineering (e.g., analyzing pixel differences for color changes).
    • For neural networks, a dataset of many images (e.g., 10,000) is needed. The quantity and diversity depend on the task's scope (e.g., specific location vs. worldwide).
    • Hard Cases: Indoor pictures (lack of natural light cues), varying weather (sunny/cloudy), extreme latitudes (long day/night), and dawn/dusk (requiring precise semantic definitions of "day" and "night") pose significant challenges.
  • Input Resolution:
    • Importance: Low resolution loses critical information (e.g., a clock), while high resolution demands more compute and slows down iteration cycles.
    • Determining Resolution: A human proxy experiment is suggested: show humans images at different resolutions and identify the minimum resolution at which they can reliably classify day/night. For this task, 64x64x3 pixels (including color channels) was found effective, as colors provide inherent information.
  • Output and Architecture: The output is binary (0 or 1), using a sigmoid activation. A shallow network, likely a Convolutional Neural Network (CNN), is suitable for this image-based task.
  • Loss Function: Logistic loss or binary cross-entropy loss is appropriate.
  • Hardware Influence: The available hardware significantly impacts decisions, particularly regarding resolution and model complexity, as it dictates the speed of iteration cycles.
  • Key Takeaways: Utilize proxy projects and human experiments to make quick, informed decisions in AI projects.

2.2. Case Study 2: Trigger Word Detection

  • Context: Virtual assistants (e.g., Alexa) use a cascade of models for efficiency: a lightweight activity detection model, followed by a trigger word detection model, and then a heavier model for understanding commands. This case focuses on the trigger word model.
  • Problem: Detect the word "activate" in a 10-second audio clip.
  • Data Collection:
    • Audio is pre-processed (e.g., using Fourier transform) to extract frequencies and consider sequence length.
    • Data needs to include people saying the positive word ("activate"), negative words (e.g., "deactivate," "kitchen") to help the model differentiate, and general sentences.
    • Distribution Matters: Data diversity is crucial to handle variations in accents (e.g., German speakers struggling with early models), age (different vocal frequencies), cadence (speaking speed), gender, and background noise (e.g., metro sounds).
  • Input Resolution: A speech expert or hyperparameter search on GitHub for similar projects can determine the optimal sample rate for human voice.
  • Output and Labeling Strategy:
    • Initial thought: 0/1 for the entire 10-second clip. A human experiment (with an Italian word "pomeriggio") demonstrated that labeling specific time segments where the word occurs (Scheme 2) is significantly easier for humans to discern and leads to much faster model learning, requiring less data than a simple 0/1 label for the whole clip (Scheme 1).
    • Addressing Skewed Labels: To prevent the model from always predicting "0" (no trigger word) due to data imbalance, a more balanced labeling scheme is used: labeling the exact time steps where the positive word is spoken with '1's, and '0's otherwise. This uses a sequential sigmoid activation at every time step.
    • Synthetic Data Generation:
      1. Collect three databases: positive words, negative words, and background noise (often freely available).
      2. A Python script randomly inserts positive and negative words into background noise, ensuring non-overlapping words.
      3. The script automatically labels the audio based on where it inserted the words, generating millions of data points rapidly.
      4. Data augmentation (e.g., frequency reduction/augmentation, acceleration/deceleration) further enhances the dataset.
      5. Test sets are manually labeled with real-world data for accurate evaluation.
  • Architecture and Loss Function: An Recurrent Neural Network (RNN) is likely suitable for this sequential problem. Binary cross-entropy loss is applied sequentially at every time step.
  • Key Takeaways: Data collection and labeling strategies are paramount. Human experiments provide quick insights. Expert advice can save significant development time. While architecture search is less common with foundation models today, understanding underlying principles is vital for custom constraints or fine-tuning.

2.3. Case Study 3: Face Verification

  • Problem: A school wants to use face verification to validate student IDs (e.g., at a gym) by comparing a student's database picture to a live camera picture.
  • Data: Student ID pictures are available.
  • Resolution: Higher resolution (e.g., 412x412x3) is needed compared to day/night classification to capture fine details like eye color. Human experiments with twins can help determine optimal resolution.
  • Output: Binary (0 or 1) – "same person" or "not same person."
  • Limitations of Basic Comparisons:
    • Pixel-wise comparison is ineffective due to variations in lighting, background, and geometric transformations (translation, rotation, scale invariance).
    • Feature engineering (manually defining features like eyes, nose) is labor-intensive and struggles with variations like glasses, hats, hairstyles, or age-related changes.
  • Solution: Encoding Network:
    1. Both the student ID picture and the live camera picture are fed through the same deep neural network.
    2. The network outputs a vector (encoding) for each image, representing facial features. These vectors are typically taken from deeper layers of the network.
    3. The distance between these two vectors is calculated.
    4. A threshold (e.g., 0.5) is set to determine if the distance is small enough to confirm the same person. This threshold balances true positives, false positives, and false negatives.
  • Training the Network (Triplet Loss):
    • Goal: The network should produce similar vectors for pictures of the same person and distinct vectors for different people.
    • Data: A dataset of triplets is created:
      • Anchor (A): A picture of a person.
      • Positive (P): Another picture of the same person.
      • Negative (N): A picture of a different person.
    • Loss Function: The Triplet Loss function is designed to minimize the distance between the anchor and the positive (distance(A, P)) and maximize the distance between the anchor and the negative (distance(A, N)). The formula max(0, distance(A, P) - distance(A, N) + alpha) includes an alpha margin for stability.
    • Process: The three pictures in a triplet are processed in parallel through the network, their vectors are compared using the triplet loss, and parameters are updated. This allows the model to learn meaningful encodings without explicit feature engineering.
    • Reference: This approach is based on the FaceNet paper (Schroff et al., 2015).
  • Variations and Applications:
    • Face Identification: To recognize a student without a card swipe (e.g., Global Entry), the database stores vectors of all students. A new camera picture's vector is compared to these stored vectors using a K-Nearest Neighbors (KNN) algorithm to find the closest match(es).
    • Face Clustering: To group pictures of the same person (e.g., in a phone's photo album), all pictures are vectorized. An unsupervised learning algorithm like K-Means clustering is then applied to group similar vectors. New pictures are compared to cluster centroids.
  • Key Takeaways: Understanding encoder networks, the triplet loss with its positive, anchor, and negative components, and the variations for face verification, identification, and clustering are crucial.

3. Self-Supervised Learning and Weakly Supervised Learning

This section introduces advanced learning paradigms that address the high cost of manual labeling.

3.1. Self-Supervised Learning (SSL)

  • Core Idea: The network learns from the data itself by generating its own supervisory signals, eliminating the need for manual labels.
  • Image Example (Contrastive Learning - Sinclair):
    1. Take an image (e.g., a dog).
    2. Apply data augmentation (e.g., rotate 90 degrees, add noise, crop, translate) to create variations of the same image.
    3. The network is trained to produce similar embeddings for these augmented versions of the same image, effectively learning that they represent the same underlying concept.
    4. This method, called contrastive learning, allows training on billions of unlabeled images, pushing similar embeddings closer and dissimilar ones apart. This is a significant shift from supervised triplet-based methods like FaceNet.
  • Text Example (Next Token Prediction - GPT):
    1. The task is to predict the next word (or token) in a sentence.
    2. This is self-supervised because text data can be scraped online, and the "label" (the next word) is inherently present in the sequence itself.
    3. Emergent Behaviors: This simple task, when scaled, leads to unexpected capabilities:
      • "I poured myself a cup of ___": Learns co-occurrence patterns (e.g., liquids, things that fit in cups).
      • "The capital of France is ___": Learns real-world facts.
      • "She unlocked her phone using her ___": Learns semantic understanding of unlocking mechanisms (face, fingerprint, password).
      • "The cat chased the ___": Learns probabilistic reasoning based on common actions.
      • "If it's raining I should bring an ___": Learns reasoning and inference, connecting conditions to actions.
    • These emergent behaviors are also observed in deep reinforcement learning (e.g., AlphaGo).
  • Other Modalities for SSL:
    • Audio: Mask out portions of an audio clip and predict the missing segments.
    • Video: Mask out frames and predict the missing frames.
    • Biology: Mask portions of protein or DNA structures and predict the missing parts.

3.2. Weakly Supervised Learning and Multimodality

  • Core Idea: Leveraging naturally occurring pairings of different data modalities in the real world, rather than explicit manual labeling.
  • Connecting Modalities: The goal is to represent different modalities (e.g., text, images, audio) in a shared embedding space where their vectors are close if they represent semantically similar concepts.
  • Image and Text: Images with captions (e.g., Instagram posts) are a prime example of naturally occurring pairings.
  • Other Naturally Occurring Pairings:
    • Audio and Video: YouTube videos provide synchronized audio and visual data (e.g., a dog barking in both modalities).
    • Video and Text: Movies with subtitles connect visual streams to textual descriptions.
    • Music and Song Title: Audio linked to text.
    • Genotype and Phenotype: Biological data.
    • Medical Imaging and Ultrasound: Different imaging types for the same medical condition.
    • Game Footage and Keyboard Actions: Player input linked to visual gameplay.
  • Shared Embedding Space (ImageBind Example): Research like Meta's ImageBind demonstrates that modalities can be connected even indirectly (e.g., thermal data connects to images, images connect to text, thus thermal data connects to text via images). This allows for cross-modal queries, where an input from one modality (e.g., text "drums") can retrieve relevant information from other modalities (e.g., audio of drums, image of drums).
  • Key Takeaways: Embeddings are crucial for representing meaning. Self-supervised learning (e.g., contrastive learning, next token prediction) enables training on vast unlabeled datasets, leading to emergent behaviors. Weakly supervised learning leverages natural data pairings to create shared embedding spaces, often with text as a central pivot, allowing for powerful multimodal understanding.

The lecture concluded by noting that adversarial attacks and defenses, though planned, would be covered in a subsequent session.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Stanford CS230 | Autumn 2025 | Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video