Linear Regression From Scratch in Python (Mathematical, Closed-Form)

Linear Regression from Scratch: Closed Form Solution

Key Concepts:

Linear Regression: A statistical method for modeling the relationship between a dependent variable and one or more independent variables.
Closed Form Solution: A direct, algebraic solution for calculating optimal parameters without iterative methods like gradient descent.
Residual Sum of Squares (RSS): A measure of the difference between the actual and predicted values, used as a loss function.
Least Squares: The principle of minimizing the RSS to find the best-fitting line.
Matrix Notation: Using matrices to represent and manipulate linear equations, enabling efficient calculations for multiple dimensions.
Coefficient (Beta): The parameters of the linear equation (slope and intercept) that define the relationship between variables.
Intercept (b): The point where the regression line crosses the y-axis.
Gradient Descent (Comparison): An iterative optimization algorithm used in a previous video to find the optimal parameters, contrasted with the direct approach of the closed form solution.

1. Introduction to Linear Regression

The video begins by introducing the concept of linear regression and its application in modeling relationships between variables. A simple example is used: time spent studying (independent variable, x) and exam score (dependent variable, y). While individual data points may vary, a general trend often emerges – more study time tends to correlate with higher scores. The goal of linear regression is to find the "best-fit" line that represents this trend, enabling predictions of exam scores based on study time. This concept extends to multiple dimensions, where the "line" becomes a hyperplane in higher-dimensional space.

2. Defining the "Best" Line & Loss Function

The core challenge is defining what constitutes the "best" line. The video explains that this is typically achieved by minimizing the distances between the line and all data points. This distance is quantified using the Residual Sum of Squares (RSS), which represents the sum of the squared differences between the actual and predicted values. Squaring the differences ensures that positive and negative deviations don't cancel each other out. The RSS serves as the loss function – a measure of how well the line fits the data. Minimizing this loss function is the objective of linear regression.

3. Mathematical Formulation & Matrix Notation

The video transitions into the mathematical formulation of linear regression. The basic equation of a line is presented: y = mx + b, where m is the slope and b is the intercept. This is then extended to multiple dimensions using vector and matrix notation.

Vector Representation: The equation is rewritten as y = βx, where β represents the vector of coefficients (including the intercept) and x is the vector of input features.
Matrix Representation: When dealing with multiple data samples, x becomes a matrix where each row represents a sample and each column represents a feature. The RSS is then expressed in matrix form as: RSS = (y - Xβ)ᵀ(y - Xβ).
Bias Term: A trick is used to incorporate the intercept term into the matrix equation by adding a column of ones to the x matrix. This allows the intercept to be treated as another coefficient in the β vector.

4. Deriving the Closed Form Solution

The video details the derivation of the closed form solution for finding the optimal coefficients β. This involves:

Expanding the RSS equation: The matrix equation is expanded to reveal its components.
Taking the partial derivative: The partial derivative of the RSS with respect to β is calculated.
Setting the derivative to zero: The derivative is set to zero to find the minimum point of the RSS.
Solving for β: The resulting equation is solved for β, leading to the closed form solution: β̂ = (XᵀX)⁻¹Xᵀy, where β̂ represents the optimal coefficients, Xᵀ is the transpose of X, and (XᵀX)⁻¹ is the inverse of the matrix product XᵀX. The assumption is made that (XᵀX) is invertible (full rank).

5. Python Implementation

The theoretical concepts are then translated into a Python implementation using the NumPy library.

Class Structure: A LinearRegressionClosed class is created, mirroring the structure of scikit-learn models.
__init__ Method: Initializes the coefficients to None and the intercept to 0.0.
fit Method:
- Converts input data to NumPy arrays.
- Adds a column of ones to the input matrix x to account for the intercept.
- Calculates the optimal coefficients β̂ using the closed form solution formula: (XᵀX)⁻¹Xᵀy.
predict Method: Calculates predictions by multiplying the input matrix x with the calculated coefficients β̂ and adding the intercept.

6. Validation and Comparison with Scikit-learn

The implemented linear regression model is validated using the California Housing dataset from scikit-learn.

Dataset Loading: The dataset is loaded using fetch_california_housing.
Model Training: The fit method is used to train the model on the dataset.
Prediction: Predictions are made on the training data.
R² Score: The R² score (coefficient of determination) is calculated to evaluate the model's performance. An R² score of approximately 0.6 is achieved.
Comparison: The same process is repeated using scikit-learn's LinearRegression model, yielding the same R² score, confirming the correctness of the implementation.

Notable Quotes:

“This video is going to be quite mathematical.”
“We don't iteratively get closer. We just calculate it with a formula.”
“The problem with this is of course if I just sum them up they can cancel out.” (referring to the need for squaring the errors)
“This is the closed form solution to linear regression.”

Technical Terms:

Transpose (Xᵀ): Switching the rows and columns of a matrix.
Inverse (X⁻¹): A matrix that, when multiplied by the original matrix, results in the identity matrix.
Dot Product: A fundamental operation in linear algebra involving the sum of the products of corresponding entries in two vectors or matrices.
Full Rank: A matrix is said to have full rank if its rows (or columns) are linearly independent.
R² Score (Coefficient of Determination): A statistical measure that represents the proportion of variance in the dependent variable that can be predicted from the independent variable(s).

Logical Connections:

The video follows a logical progression: introduction to the problem, mathematical formulation, derivation of the solution, implementation in Python, and validation. Each section builds upon the previous one, culminating in a working linear regression model. The comparison with scikit-learn provides a benchmark and confirms the accuracy of the implementation.

Conclusion:

This video provides a comprehensive and detailed explanation of linear regression using the closed form solution. It emphasizes the mathematical foundations of the method and demonstrates how to implement it from scratch in Python. The use of matrix notation and the derivation of the closed form solution offer a deeper understanding of the underlying principles. The validation against scikit-learn confirms the correctness of the implementation and highlights the practical applicability of the method. The key takeaway is the ability to directly calculate optimal parameters without relying on iterative optimization techniques like gradient descent.