Skip to content
This repository was archived by the owner on Mar 26, 2025. It is now read-only.

ShaulAb/dropit_shopping

Repository files navigation

Project Title

This project involves data exploration, manipulation, modeling, and visualization of sales and location data. The goal is to analyze the data, build predictive models, and visualize the results to gain insights.

Table of Contents

Data Exploration

The data exploration phase is conducted in the 0_data_exploration.ipynb notebook. This notebook includes:

  • Import necessary libraries such as pandas, plotly, and ydata_profiling.
  • Initial data profiling using ProfileReport.
  • Visual inspection of the feature distributions including notes on data quality.
  • Parsing, splitting and visualizationon geometric location data (lat, lon)
  • Removal of duplicates in transaction data.

Data Manipulation

The initial data manipulation processes are detailed in the 1_data_manipulation.ipynb notebook. Key components include:

  • Use of pandas for data manipulation tasks such as filtering, grouping, and aggregating data.
  • Application of custom functions to transform data columns, such as converting timestamps to different formats.
  • Creation of new features to enhance the dataset, such as extracting day, month, and year from timestamps.
  • Merging datasets for analysis and modeling.

Modeling Benchmark

The modeling benchmark is detailed in the 2_modelling_benchmark.ipynb notebook. Key components include:

  • Data Preparation:
    • Load and preprocess transaction data.
    • Handle missing values by imputing 'Online' for missing location data.
    • Use of CatBoostRegressor for sales forecasting.
    • Evaluation of model performance using MAE and MAPE.

Visualization Utilities

The vis_utils.py file provides utility functions for data visualization. Notably, it includes:

  • Function stat_by_month:
    • Generates subplots of box plots for each year to visualize the distribution of a target column over time.

Time Utilities

The time_utils.py file provides utility functions for time manipulation. Notably, it includes:

  • Function break_timestamp:
    • Parses a timestamp column and creates new columns for date, day, weekday, month, season, and year.

Conclusions

  • The data exploration phase highlighted potential issues with data quality, particularly in location data.
  • The benchmark model using CatBoost provided a reasonable starting point for sales forecasting, with MAPE ranging from 7-20%.
  • Visualization utilities facilitate the analysis of data distributions over time, aiding in the identification of trends and anomalies.

Future Work

  • Thorough error analysis to identify volatile locations and potential reasons for discrepancies.
  • Session with domain experts to understand basic questions like how can we get promotions data and what the customer ID represents.
  • Expand visualization capabilities to include more detailed insights into sales trends and customer behavior.

Requirements

  • Python 3.11.11
  • Libraries: pandas, plotly, ydata_profiling, catboost, sklearn

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published