Project Title

This project involves data exploration, manipulation, modeling, and visualization of sales and location data. The goal is to analyze the data, build predictive models, and visualize the results to gain insights.

Data Exploration

The data exploration phase is conducted in the 0_data_exploration.ipynb notebook. This notebook includes:

Import necessary libraries such as pandas, plotly, and ydata_profiling.
Initial data profiling using ProfileReport.
Visual inspection of the feature distributions including notes on data quality.
Parsing, splitting and visualizationon geometric location data (lat, lon)
Removal of duplicates in transaction data.

Data Manipulation

The initial data manipulation processes are detailed in the 1_data_manipulation.ipynb notebook. Key components include:

Use of pandas for data manipulation tasks such as filtering, grouping, and aggregating data.
Application of custom functions to transform data columns, such as converting timestamps to different formats.
Creation of new features to enhance the dataset, such as extracting day, month, and year from timestamps.
Merging datasets for analysis and modeling.

Modeling Benchmark

The modeling benchmark is detailed in the 2_modelling_benchmark.ipynb notebook. Key components include:

Data Preparation:
- Load and preprocess transaction data.
- Handle missing values by imputing 'Online' for missing location data.
- Use of CatBoostRegressor for sales forecasting.
- Evaluation of model performance using MAE and MAPE.

Visualization Utilities

The vis_utils.py file provides utility functions for data visualization. Notably, it includes:

Function stat_by_month:
- Generates subplots of box plots for each year to visualize the distribution of a target column over time.

Time Utilities

The time_utils.py file provides utility functions for time manipulation. Notably, it includes:

Function break_timestamp:
- Parses a timestamp column and creates new columns for date, day, weekday, month, season, and year.

Conclusions

The data exploration phase highlighted potential issues with data quality, particularly in location data.
The benchmark model using CatBoost provided a reasonable starting point for sales forecasting, with MAPE ranging from 7-20%.
Visualization utilities facilitate the analysis of data distributions over time, aiding in the identification of trends and anomalies.

Future Work

Thorough error analysis to identify volatile locations and potential reasons for discrepancies.
Session with domain experts to understand basic questions like how can we get promotions data and what the customer ID represents.
Expand visualization capabilities to include more detailed insights into sales trends and customer behavior.

Requirements

Python 3.11.11
Libraries: pandas, plotly, ydata_profiling, catboost, sklearn

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
0_data_exploration.ipynb		0_data_exploration.ipynb
1_data_manipulation.ipynb		1_data_manipulation.ipynb
2_modelling_benchmark.ipynb		2_modelling_benchmark.ipynb
README.md		README.md
config.py		config.py
general_utils.py		general_utils.py
pyproject.toml		pyproject.toml
time_utils.py		time_utils.py
uv.lock		uv.lock
vis_utils.py		vis_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Title

Table of Contents

Data Exploration

Data Manipulation

Modeling Benchmark

Visualization Utilities

Time Utilities

Conclusions

Future Work

Requirements

About

Releases

Packages

Languages

ShaulAb/dropit_shopping

Folders and files

Latest commit

History

Repository files navigation

Project Title

Table of Contents

Data Exploration

Data Manipulation

Modeling Benchmark

Visualization Utilities

Time Utilities

Conclusions

Future Work

Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages