Ridge Regression From Scratch in Python (Mathematical)

Rich Regression Implementation from Scratch in Python

Key Concepts:

Rich Regression (L2 Regularization): A linear regression technique that adds a penalty term to the cost function, proportional to the square of the magnitude of the coefficients. This helps prevent overfitting by discouraging large coefficient values.
Closed-Form Solution: Directly calculating the optimal parameters (coefficients) of a linear regression model using a mathematical formula, rather than iterative methods like gradient descent.
L2 Norm (Euclidean Norm): A measure of the distance from the origin to a point in a vector space, calculated as the square root of the sum of the squares of the coordinates. In rich regression, it's used to penalize large coefficients.
Residual Sum of Squares (RSS): A measure of the difference between the predicted values and the actual values in a regression model.
Lambda (α/Alpha): The regularization parameter that controls the strength of the penalty term. Higher values lead to stronger regularization.
Matrix Notation: Representing equations and operations using matrices and vectors for efficient computation.
Inverse Matrix: A matrix that, when multiplied by the original matrix, results in the identity matrix. Used in the closed-form solution to solve for the coefficients.

1. Mathematical Derivation of Rich Regression

The video begins by outlining the mathematical foundation of rich regression. The core idea is to modify the ordinary least squares (OLS) cost function by adding a regularization term.

Ordinary Least Squares (OLS): The initial cost function is the Residual Sum of Squares (RSS): RSS = (y - Xβ)ᵀ(y - Xβ), where y is the vector of actual values, X is the input matrix, and β is the vector of coefficients.
Regularization Term: A penalty term is added to the RSS: λβᵀβ, where λ (alpha in the code) is the regularization parameter. This term penalizes large values of the coefficients β.
Modified Cost Function: The complete cost function becomes: RSS + λβᵀβ.
Optimal Solution Derivation: To find the optimal coefficients (β̂), the partial derivative of the modified cost function with respect to β is calculated and set to zero.
- ∂(RSS + λβᵀβ) / ∂β = 0
- This leads to the following equation: XᵀXβ̂ + λβ̂ = Xᵀy
- Rearranging and factoring out β̂: (XᵀX + λI)β̂ = Xᵀy where I is the identity matrix.
- Finally, the closed-form solution for β̂ is: β̂ = (XᵀX + λI)⁻¹Xᵀy

The speaker emphasizes that this derivation requires a solid understanding of linear algebra and calculus. He suggests reviewing previous videos on linear regression and gradient descent for those unfamiliar with the concepts.

2. L1 vs. L2 Regularization

A brief comparison is made between L2 (rich) and L1 (Lasso) regularization:

L2 (Rich): Uses the square of the coefficients as the penalty term (βᵀβ). It shrinks coefficients towards zero but rarely sets them exactly to zero. It focuses on a few key features.
L1 (Lasso): Uses the absolute value of the coefficients as the penalty term (|β|). It tends to set some coefficients exactly to zero, effectively performing feature selection. The penalty gradient is more "strict" than L2.

3. Python Implementation

The implementation focuses on directly applying the derived formula without using optimization techniques like gradient descent.

Libraries: Only NumPy is used for matrix operations. Scikit-learn is used solely for evaluation and comparison.
Class Structure: A RichRegression class is created to encapsulate the model.
- __init__(self, alpha=1): Initializes the model with a regularization parameter alpha (equivalent to lambda).
- fit(self, X, y): Calculates the optimal coefficients using the closed-form solution.
  - Adds a bias term (intercept) to the input data X.
  - Creates an identity matrix I of the appropriate dimension.
  - Calculates β̂ = (XᵀX + αI)⁻¹Xᵀy using NumPy's linalg.inv function for matrix inversion.
  - Stores the calculated coefficients and intercept.
- predict(self, X): Predicts the output values for new input data X using the calculated coefficients and intercept.

Code Snippet (Key Calculation):

a = np.linalg.inv(Xb.T @ Xb + alpha * I) @ Xb.T @ y
self.coefficients = a[1:]
self.intercept = a[0]

4. Evaluation and Comparison with Scikit-learn

The implemented rich regression model is evaluated using the California Housing dataset from Scikit-learn.

Dataset: The fetch_california_housing function is used to load the dataset.
R² Score: The R² score (coefficient of determination) is used to measure the model's performance.
Comparison: The R² score obtained from the implemented RichRegression class is compared to the R² score obtained from Scikit-learn's built-in Ridge regression model. The scores are nearly identical, validating the implementation.
Impact of Alpha: Changing the value of alpha (the regularization parameter) affects the R² score, demonstrating the influence of regularization on model performance.

5. Notable Quotes

“This video is going to be quite mathematical. So if you like it, let me know by hitting a like button and subscribing.”
“We’re going to directly calculate the optimal parameters considering a penalty term, a regularization term, which is the L2 norm.”
“Rich is going to make certain coefficients very very small which means it makes certain dimensions of the input very irrelevant and it focuses on focuses on a couple of key features so to say to make a prediction thus regularizing our line.”

Conclusion:

This video provides a comprehensive, from-scratch implementation of rich regression in Python. It meticulously derives the mathematical formula for the closed-form solution and translates it into efficient NumPy code. The comparison with Scikit-learn’s implementation confirms the accuracy of the approach. The video emphasizes the importance of understanding the underlying mathematical principles and provides a solid foundation for implementing and applying rich regression in various machine learning tasks. The key takeaway is that rich regression effectively prevents overfitting by penalizing large coefficients, leading to more robust and generalizable models.