Stanford CS231N | Spring 2025 | Lecture 4: Neural Networks and Backpropagation

Key Concepts

Neural Networks
Backpropagation
Loss Functions (Softmax, Hinge Loss/SVM Loss)
Regularization
Optimization (Gradient Descent, SGD, Momentum, RMSProp, Adam)
Learning Rate Scheduling
Activation Functions (ReLU, Leaky ReLU, ELU, GeLU, SiLU, Sigmoid, Tanh)
Fully Connected Networks (Multi-Layer Perceptrons - MLPs)
Computational Graphs
Local Gradient
Upstream Gradient
Downstream Gradient
Jacobian Matrix

Loss Functions and Regularization

Loss Functions: The video reiterates that while Softmax is widely used for classification, it's not the only option. Hinge Loss (SVM Loss), discussed in lecture two, is another option, especially in early neural networks.
- Hinge Loss: Unlike Softmax, it doesn't convert scores to probabilities. It encourages the score of the correct item (Syi) to be higher than the scores of all other items (Sj) by at least a margin (typically 1). If this condition is violated, the loss increases proportionally from the margin.
General Optimization: The goal is to find the best parameters (W) that minimize the loss landscape. This involves taking the gradient of the loss function (L) with respect to W and using it for optimization.
Gradient Descent: Weights are updated by taking a step in the negative direction of the gradient, scaled by a step size (learning rate).
Numerical vs. Analytical Gradients: Analytical gradients are preferred in practice, but numerical gradients can be used to check implementations.
Mini-Batches: To address the computational cost of using the entire dataset, mini-batches (e.g., 32, 64, 128, or 256 examples) are used to estimate the gradients.
SGD Optimizations: Momentum, RMSProp, and Adam are mentioned as optimizations of Stochastic Gradient Descent (SGD).
Learning Rate Scheduling: Starting with a larger learning rate and then decaying it is often necessary, although some modern optimizers (e.g., Adam variants) handle this automatically.

Neural Networks: Architecture and Non-Linearity

Basic Neural Network: A single layer can be represented as a linear function: W * x (weights multiplied by input).
Multi-Layer Neural Network: A two-layer network can be defined as W2 * max(0, W1 * x).
- Dimensionality: D is the dimensionality of the input (number of features), C is the number of classes (number of outputs), and H is the number of neurons in the hidden layer.
- Non-Linearity: The max function (ReLU) introduces non-linearity between the linear transformations done by W1 and W2. This is crucial for solving problems that are not linearly separable.
Fully Connected Networks (MLPs): Networks that only involve weights, inputs, and layers with multiplication are called fully connected networks or multi-layer perceptrons.
Templates: Neural networks learn templates or representatives of the images from the data they are trained on. Multiple layers allow the network to create templates for parts of objects, rather than just entire objects. For example, a hidden layer with 100 neurons can create 100 templates, potentially representing parts of objects shared between classes (e.g., eyes in bird, cat, deer, dog, frog, horse).

Activation Functions

Role: Activation functions introduce non-linearity, which is essential for solving non-linear problems.
ReLU (Rectified Linear Unit): A popular activation function. However, it can sometimes create "dead neurons" because it outputs 0 for non-positive inputs.
Leaky ReLU, ELU (Exponential Linear Unit): Alternatives to ReLU that address the dead neuron problem. ELU has a better zero-centered function.
GeLU (Gaussian Error Linear Unit): Used more often in transformers.
SiLU (Sigmoid Linear Unit) / Swish: Used in some modern CNN architectures, such as EfficientNet.
Sigmoid and Tanh: Can be used as activation functions, but they can suffer from vanishing gradients because they squash values into a narrow range. They are often used in later layers for binarizing outputs.
Choosing Activation Functions: It's often empirical, starting with standard activation functions for specific architectures (CNNs, transformers).

Neural Network Implementation

Python Implementation: A two-layer neural network can be implemented in Python with less than 20 lines of code.
Forward Pass: Applying weights to the inputs layer by layer to create the output (predicted y).
Analytical Gradients: Calculating analytical gradients is the most important part of the process.
Overfitting: More neurons often mean more capacity to learn complex functions, but it can also lead to overfitting.
Regularization vs. Network Size: It's generally better to use a slightly larger network and then use regularization to prevent overfitting, rather than trying to fine-tune the network size itself. The regularization hyperparameter is tuned more often than the network size.

Biological Inspiration

Neurons: The video draws a loose analogy between biological neurons and neural networks. The cell body aggregates impulses carried through dendrites, and axons carry impulses to other neurons. This is similar to how neural networks capture signals from previous layers, operate on them, and pass them to the next layer.
Caution: The video warns against taking brain analogies too literally, as there are many differences between biological neurons and artificial neural networks.

Computational Graphs and Backpropagation

Computational Graphs: Represent all operations in the neural network step-by-step, from inputs and parameters to the final loss.
Backpropagation: A solution to the challenges of manually calculating derivatives, especially for complex loss functions. It's a recursive process that starts at the end of the network and backpropagates gradients.
Example Function: f(x, y, z) = (x + y) * z
- Forward Pass: Calculates the output of each operation step-by-step.
- Backward Pass: Calculates the partial derivatives of the output with respect to each input, starting from the end of the network.
Chain Rule: Used to calculate derivatives when variables are not directly connected.
Local Gradient: The gradient of the node's output with respect to its input.
Upstream Gradient: The gradient that comes from the end of the network to the current node.
Downstream Gradient: The product of the upstream gradient and the local gradient. It becomes the upstream gradient for the previous layer.
Modularization: Computational graphs allow for modularization, where each node calculates its local gradients and uses the upstream gradient to calculate the downstream gradient.
Intuition: Backpropagation is about moving the gradient of the loss function (L) with respect to all variables in the network, back to every single value of the network, without writing the function for the entire network.
Example Function: Sigmoid function example with detailed forward and backward pass calculations.
Patterns in Data:
- Add Gate: Gradient distributor (gradients remain the same).
- Multiply Gate: Swapper (gradients are swapped between inputs).
- Copy Gate: Addition of incoming gradients.
- Max Gate: Routes gradients towards the direction of the max value.

Backpropagation with Vectors and Matrices

Scalar to Scalar: If x and y are scalars, the derivative is also a scalar.
Vector to Scalar: If x is a vector and y is a scalar, the derivative is a vector.
Vector to Vector: If x and y are vectors, the derivative is a Jacobian matrix.
Jacobian Matrix: Each element represents how much each element of y will change if each element of x changes by a small amount.
Backprop with Vectors: The loss derivative (L) is always a scalar. The upstream gradient is a vector. Local gradients become Jacobian matrices. Downstream gradients are the product of upstream and local gradients.
ReLU Example: The Jacobian matrix is sparse because the operation is element-wise.
Matrices and Tensors: The gradients with respect to the variables would be of the same size as that specific variable.
Matrix Multiplication Example: Calculating the Jacobian matrix for matrix multiplication can be computationally expensive. Instead, the backward pass functions are written specifically for matrix multiplication in a more efficient way.

Conclusion

The video provides a detailed explanation of neural networks and backpropagation, covering key concepts, implementation details, and mathematical foundations. It emphasizes the importance of non-linearity, activation functions, and computational graphs for building and training effective neural networks. The discussion of vectorization and matrix operations highlights the challenges and optimizations involved in scaling backpropagation to larger models and datasets. The key takeaway is that backpropagation, facilitated by computational graphs, is a fundamental algorithm that enables neural networks to learn from data by efficiently calculating and propagating gradients through multiple layers.