Skip to content

jairus-m/dagster-sdlc

Repository files navigation

The Software Development Lifecycle within a Modern Data Engineering Framework

To get a more in-depth write-up of this project, visit my Medium page!

Slide1

While the short-term goal is to learn these tools, the greater goal is to understand and flesh out what the full development and deployment cycle can look like for orchestrating a data platform and deploying custom pipelines. There is a great process in the transformation layer using dbt where we have local development, testing, versioning/branching, CICD, code-review, separation of dev and prod, project structure/cohesion etc., but how can we apply that to the entire data platform and espeacially, the 10-20% of ingestion jobs that cannot be done in a managed tool like Airbyte and/or is best done using a custom solution?

Current Status

Screenshot 2025-01-05 at 11 51 00 AM

Dagster

  • Orchestrated ingest, transformation, and downstream dependecies (ML/Analytics) with Dagster - jairus-m/pokemon-dagster-dlt#2, jairus-m/pokemon-dagster-dlt#6
    • Developed in dev environment and materaizlied in dagster dev server
    • Configured resources / credentials in a root .env file
    • Current Dagster folder structure (dependencies managed by UV) - https://github.com/jairus-m/dagster-dlt/pull/15
      • One code location: dagster_proj/
        • Assets: dagster_proj/assets/
        • Resources: dagster_proj/resources/__init__.py
        • Jobs: dagster_proj/jobs/__init__.py
        • Schedules: dagster_proj/schedules/__init__.py
        • Utils: dagster_proj/utils/__init__.py
        • Definitions: dagster_proj/__init__.py
      • The structure is experimental and based on the DagsterU courses

dltHub

dbt-core

  • Built a dbt-core project to transform the activities data in analytics_dbt/models

Sklearn ML Pipeline

  • Created an Sklearn ML pipeline to predict energy expenditure for a given cycling activity
    • WIP but the general flow of preprocessing, building the ML model, training, testing/evaluation, and prediction can be found in dagster_proj/assets/ml_analytics/energy_prediction.py
    • This a downstream dependency of a dbt asset materialized in duckdb

Analytics

Deployment Status

TODO:

  • Beef up the ML pipeline with dagster-mlflow for experiment tracking, model versioning, better model observability, etc
  • Utilize Snowflake Cloning/dbt Slim CI for CI
  • Implement partitions/backfilling with dlt/Dagster

Getting Started:

For local development only:

  1. Clone this repo locally
  2. Create a .env file at the root of the directory:
# these are the config values for local dev and will change in branch/prod deployment
DBT_TARGET=dev
DAGSTER_ENVIRONMENT=dev
DUCKDB_DATABASE=data/dev/strava.duckdb

#strava
CLIENT_ID= 
CLIENT_SECRET=
REFRESH_TOKEN=
  1. Download uv and run uv sync
  2. Build the Python package in developer mode via uv pip install -e ".[dev]"
  3. Run the dagster daemon locally via dagster dev
  4. Materialize the pipeline!

Additional Notes:

  • The refresh_token in the Strava UI produces an access_token that is limited in scope. Please follow these Strava Dev Docs to generate the proper refresh_token which will then produce an access_token with the proper scopes.
  • If you want to run the dbt project locally, outside of dagster, you need to add a DBT_PROFILES_DIR environment variable to the .env file and export it
    • For example, my local env var is: DBT_PROFILES_DIR=/Users/jairusmartinez/Desktop/dlt-strava/analytics_dbt
    • Yours will be: DBT_PROFILES_DIR=/PATH_TO_YOUR_CLONED_REPO_DIR/analytics_dbt

About

Refactoring my Strava pipeline to use dlt, dagster, duckdb, and dbt-core

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages