Store Sales Prediction in Python - Time Series Machine Learning Project

Key Concepts

Time Series Prediction: Forecasting future values based on historical data.
Temporal Convolutional Neural Network (TCN): A deep learning architecture using 1D convolutional layers for sequence modeling.
Encoder-Decoder Architecture: A common neural network structure where an encoder processes input and a decoder generates output.
Kaggle Store Sales Dataset: A benchmark dataset for time series forecasting, involving predicting store sales for different items and stores.
Pandas Pivot: A data manipulation technique to reshape data from a long format to a wide format.
StandardScaler: A preprocessing technique from scikit-learn to standardize features by removing the mean and scaling to unit variance.
PyTorch Tensors: The fundamental data structure in PyTorch for numerical computation, similar to NumPy arrays but with GPU acceleration.
DataLoader and Dataset: PyTorch utilities for efficient data loading and batching during model training.
1D Convolutional Layers: Layers that apply filters to sequential data, capturing local patterns.
Dilation and Padding: Techniques in convolutional layers to expand the receptive field and maintain temporal dimensions.
Mean Squared Error (MSE) / Root Mean Squared Error (RMSE): Common loss functions for regression tasks.
Backpropagation: The process of adjusting model weights based on the calculated loss.
Optimizer (Adam): An algorithm used to update model weights during training.
Inference: The process of using a trained model to make predictions on new, unseen data.
Inverse Transform: Reversing the scaling applied during preprocessing to obtain original data values.

Machine Learning Process for Time Series Prediction

This video details the end-to-end machine learning process for time series prediction, specifically using a Temporal Convolutional Neural Network (TCN) on the Kaggle Store Sales dataset. The goal is to predict store sales for the next 16 days based on the past 120 days of data.

1. Dataset Overview and Setup

Dataset: Kaggle Store Sales dataset, a permanent competition for time series forecasting.
Data Files:
- train.csv: Contains historical sales data.
- test.csv: Contains the structure for prediction, but without sales data.
- sample_submission.csv: Shows the required submission format (ID, Sales).
Objective: Predict sales for each store and item category for 16 days.
Environment Setup:
- Utilized Jupyter Lab for interactive development.
- Recommended using uv for package management (pip install uv, uv init, uv add pandas numpy torch scikit-learn matplotlib jupyterlab).
- Installed necessary libraries: pandas (data manipulation), numpy (numerical operations), torch (PyTorch for deep learning), scikit-learn (preprocessing), matplotlib (visualization).

2. Data Exploration and Preprocessing

Loading Data: pandas.read_csv('data/train.csv') to load the training data.
Data Structure: The raw data has columns like ID, date, store_nbr, family, on_promotion, and sales.
Problem Formulation: The task requires predicting sales for each unique combination of store_nbr and family for a future time period. This implies creating a separate time series for each combination.
Data Reshaping (Pivoting):
- Feature Engineering: A new column store_family was created by combining store_nbr and family using df.apply(lambda x: f"{x['store_nbr']}_{x['family']}", axis=1).
- Pivoting: The pandas.pivot function was used to transform the data into a wide format:
  - index='date'
  - columns='store_family'
  - values='sales'
- This resulted in a DataFrame df_pivoted where each row is a date, and each column represents a unique store_family time series.
Data Visualization:
- Used matplotlib.pyplot to visualize a sample of time series (8 families x 3 stores) to understand patterns, seasonality, and potential outliers.
- Observed varying sales patterns across different store-family combinations, including zero sales for some periods.
Data Scaling:
- Necessity: Neural networks are sensitive to feature scales, so standardization is crucial.
- Method: sklearn.preprocessing.StandardScaler was used.
- Train-Validation Split: The pivoted data was split into training (80%) and validation (20%) sets. Crucially, no shuffling was performed to preserve the temporal order of the data.
- scaler.fit_transform(train_data): Fitted the scaler on the training data and transformed it.
- scaler.transform(test_data): Transformed the validation data using the scaler fitted on the training data.
Creating Input/Output Sequences:
- A helper function create_xy(data, input_length, output_length) was defined.
- This function iterates through the scaled data to create sequences of input_length (120 days) as input features (X) and output_length (16 days) as target values (Y).
- The iteration range was len(data) - input_length - output_length + 1 to ensure enough data for both input and output.
- The output sequences (Y) were the output_length days immediately following the input sequence.
- The function returned X and Y as NumPy arrays.
- Applied this function to train_data_scaled and test_data_scaled with input_length=120 and output_length=16.
PyTorch Tensor Conversion:
- Converted NumPy arrays to PyTorch tensors: torch.FloatTensor(numpy_array).
- Moved tensors to GPU if available: .to('cuda').
DataLoader Setup:
- Created torch.utils.data.TensorDataset for training and testing data.
- Created torch.utils.data.DataLoader with batch_size=32 for efficient batching.
- shuffle=True for the training loader to introduce randomness during training.
- shuffle=False for the test loader as shuffling is not needed for inference.

3. Temporal Convolutional Neural Network (TCN) Model

Architecture: An encoder-decoder structure using 1D convolutional layers.
Key Components:
- nn.Conv1d: 1D convolutional layers.
  - in_channels: Number of input features per time step (initially 1782, which is 33 families * 54 stores).
  - out_channels: Number of filters (e.g., 64).
  - kernel_size: The size of the sliding window (e.g., 3).
  - padding: Adds zero-padding to the input to maintain temporal dimensions.
  - dilation: Spreads out the kernel to increase the receptive field without increasing kernel size or depth. Dilation rates were doubled in subsequent layers (1, 2, 4).
- Activation Function: nn.ReLU (Rectified Linear Unit) to introduce non-linearity.
- Cropping: After applying convolution and activation, the padding was cropped out to maintain the original temporal length. The amount of cropping matched the padding applied.
- nn.Linear (Fully Connected Layer): A final layer to map the extracted features to the desired output shape.
  - in_features: Number of channels from the last convolutional layer (64).
  - out_features: The total number of output values required, which is output_length * number_of_channels (16 days * 1782 store-family combinations).
- view() method: Reshaped the output of the linear layer into the desired format: (batch_size, output_length, number_of_channels).
Forward Pass:
- The input tensor x was transposed to match the expected (batch_size, channels, sequence_length) format for nn.Conv1d.
- Data was passed sequentially through convolutional layers, activation functions, and cropping.
- The output of the last convolutional block was passed through the linear layer.
- The output was reshaped using view() to the final prediction format.

4. Model Training

Initialization:
- Instantiated the TCNModel.
- Defined the optimizer: torch.optim.Adam with a learning rate of 0.0001.
- Defined the loss function: nn.MSELoss (Mean Squared Error). The square root was applied manually to calculate RMSE.
Training Loop:
- Iterated for a fixed number of epochs (e.g., 30) to avoid overfitting.
- Set the model to training mode: model.train().
- Iterated through batches from the train_loader.
- Steps within each batch:
  1. Zero gradients: optimizer.zero_grad().
  2. Forward pass: predictions = model(x_batch).
  3. Calculate loss: loss = torch.sqrt(criterion(predictions, y_batch)).
  4. Backward pass: loss.backward().
  5. Optimizer step: optimizer.step().
  6. Accumulate epoch loss.
- Printed the epoch loss every 5 epochs.
Evaluation on Validation Set:
- Set the model to evaluation mode: model.eval().
- Disabled gradient calculation: with torch.no_grad():.
- Made predictions on the X_test_tensor.
- Calculated the test loss (RMSE).
- Observation: The test loss was higher than the training loss, potentially due to outliers in the test data. The speaker noted that a large outlier in Y_test significantly impacted the RMSE, but the median values were similar, suggesting the model might still perform reasonably on Kaggle.

5. Full Dataset Training and Prediction

Retraining on Full Data:
- The entire training dataset (without the train-validation split) was used to retrain the model.
- The StandardScaler was refitted on the entire pivoted training data.
- The create_xy function was applied to the scaled full data.
- Tensors and DataLoaders were created for the full dataset.
- A new model (final_model) was initialized and trained for the same number of epochs using the full_loader.
Generating Test Predictions:
- The final_model was set to evaluation mode (model.eval()).
- Crucial Step: The last 120 days of the full scaled training data (full_data_scaled[-120:]) were used as input to predict the first 16 days of the test set. This is because the test set's sales are unknown, and the model needs historical context from the training data to make predictions.
- The input sequence was unsqueezed to add a batch dimension.
- Predictions were generated: predictions = final_model(last_sequence).
Post-processing Predictions:
- Predictions were moved back to CPU (.cpu()) and converted to NumPy arrays (.numpy()).
- The extra dimension was squeezed out.
- Inverse Scaling: scaler.inverse_transform(predictions) was applied to convert scaled predictions back to original sales values.
- Capping: Negative predictions were capped at zero using np.maximum(predictions, 0).
Formatting for Submission:
- The test.csv file was loaded into a DataFrame (test_df).
- The store_family feature was recreated for the test DataFrame.
- The unique dates from the test set were extracted.
- A prediction DataFrame (prediction_df) was created with the predicted sales, indexed by date and with columns representing store_family.
- The prediction_df was "stacked" using .stack() and then .reset_index() to transform it back into a long format, similar to the original training data structure.
- The test_df and the long prediction DataFrame were merged on date and store_family to align predictions with the correct IDs.
- The final submission DataFrame was created by selecting the ID and sales columns.
- The submission DataFrame was saved to submission.csv using to_csv(index=False).

6. Kaggle Submission and Conclusion

Submission: The generated submission.csv file was uploaded to the Kaggle Store Sales competition.
Result: Achieved a score of 0.48, placing 94th on the leaderboard.
Key Takeaway: The presented approach provides a solid baseline for time series prediction using TCNs. The speaker encourages further experimentation with more complex architectures, feature engineering, and hyperparameter tuning to improve the score.

This comprehensive process demonstrates how to handle time series data, build and train a TCN model, and generate predictions for submission on a platform like Kaggle.