This project involves data exploration, manipulation, modeling, and visualization of sales and location data. The goal is to analyze the data, build predictive models, and visualize the results to gain insights.
The data exploration phase is conducted in the 0_data_exploration.ipynb
notebook. This notebook includes:
- Import necessary libraries such as
pandas
,plotly
, andydata_profiling
. - Initial data profiling using
ProfileReport
. - Visual inspection of the feature distributions including notes on data quality.
- Parsing, splitting and visualizationon geometric location data (
lat
,lon
) - Removal of duplicates in transaction data.
The initial data manipulation processes are detailed in the 1_data_manipulation.ipynb
notebook. Key components include:
- Use of
pandas
for data manipulation tasks such as filtering, grouping, and aggregating data. - Application of custom functions to transform data columns, such as converting timestamps to different formats.
- Creation of new features to enhance the dataset, such as extracting day, month, and year from timestamps.
- Merging datasets for analysis and modeling.
The modeling benchmark is detailed in the 2_modelling_benchmark.ipynb
notebook. Key components include:
- Data Preparation:
- Load and preprocess transaction data.
- Handle missing values by imputing 'Online' for missing location data.
- Use of
CatBoostRegressor
for sales forecasting. - Evaluation of model performance using MAE and MAPE.
The vis_utils.py
file provides utility functions for data visualization. Notably, it includes:
- Function
stat_by_month
:- Generates subplots of box plots for each year to visualize the distribution of a target column over time.
The time_utils.py
file provides utility functions for time manipulation. Notably, it includes:
- Function
break_timestamp
:- Parses a timestamp column and creates new columns for date, day, weekday, month, season, and year.
- The data exploration phase highlighted potential issues with data quality, particularly in location data.
- The benchmark model using CatBoost provided a reasonable starting point for sales forecasting, with MAPE ranging from 7-20%.
- Visualization utilities facilitate the analysis of data distributions over time, aiding in the identification of trends and anomalies.
- Thorough error analysis to identify volatile locations and potential reasons for discrepancies.
- Session with domain experts to understand basic questions like how can we get promotions data and what the customer ID represents.
- Expand visualization capabilities to include more detailed insights into sales trends and customer behavior.
- Python 3.11.11
- Libraries: pandas, plotly, ydata_profiling, catboost, sklearn