Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate DBT to Augur for Analytical Data Transformation #3048

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

AllenHsm
Copy link

Description
This PR integrates dbt (Data Build Tool) into Augur to enable analytical transformations on the database. It creates a tiny "hello world" dbt project under Augur.

  • Added profiles.yml to configure Augur’s Postgres database for dbt. It takes in the env parameters set by Augur.
  • Created a Hello World Model and Test for DBT: A model called repo_activity.sql aggregates commit and issue counts per repository. A simple test under the same is also added.
  • Added default dbt_project.yml which guides dbt execution and configures it.
  • Integrated DBT with Augur CLI to allow running dbt directly within Augur. If this step seems to be meaningless, please let me know.

This PR fixes #2295

Notes for Reviewers

  1. DBT requires a separate Python virtual environment to prevent dependency conflicts. After setting up the new virtual environment, you can run dbt in the original augur virtual environment.

    • Create a new virtual environment

      python3 -m venv ~/.virtualenvs/dbt_venv
    • Activate the virtual environment

      source ~/.virtualenvs/dbt_venv/bin/activate  # Mac/Linux
      dbt_venv\Scripts\activate  # Windows
    • Install dbt and the Postgres adapter

      pip install dbt-core dbt-postgres
  2. Before running DBT, configure the profiles.yml to make sure it matches your PostgreSQL setup. It is located at augur/augur/application/dbt/profiles.yml. By default, it takes environment parameters such as AUGUR_DB_PASSWORD. Modify the file if needed to include your database credentials.

  3. Once the environment and profiles are set up, run DBT within Augur:

    • Check DBT configuration
      augur dbt debug    
    • Run DBT models
      augur dbt run    
    • Run DBT tests
      augur dbt test   
  4. Expected outputs:

  • Running augur dbt run should generate a table repo_activity, showing commit and issue counts per repository.
  • The results should be accessible in the public.repo_activity table inside the Augur database.

Signed commits

  • Yes, I signed my commits.

P.S. The commits are shown as "Unverified" because previously I set my .edu email as my local git user email. But I set up GPG using my personal email. I have corrected the local git user email.

@AllenHsm AllenHsm marked this pull request as draft March 12, 2025 17:44
@AllenHsm AllenHsm marked this pull request as ready for review March 12, 2025 17:45
@sgoggins
Copy link
Member

@AllenHsm : What is the benefit to Augur users?

@AllenHsm
Copy link
Author

@AllenHsm : What is the benefit to Augur users?

Thanks for the question! I think DBT has the potential to make it easier for user to work with Augur data, because dbt enables users to build clean and reusable views, such that we do not have to write complex SQL every time. It also adds testing and is easier to scale analytics.

@sgoggins sgoggins requested a review from Copilot April 1, 2025 19:24
@sgoggins sgoggins self-assigned this Apr 1, 2025
@sgoggins sgoggins added the feature-request Request for a new feature in Augur label Apr 1, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates DBT into Augur to enable analytical transformations on the database by adding a basic “hello world” DBT project.

  • Added configuration files (dbt_project.yml, profiles.yml) to set up DBT with Augur’s database.
  • Introduced a model (repo_activity) with accompanying tests and integrated DBT commands into the Augur CLI.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

File Description
dbt_project.yml Defines the DBT project configuration for the Augur project
augur/application/dbt/tests/repo_activity.yml Adds a test file for the repo_activity model
augur/application/dbt/profiles.yml Configures the database connection parameters using environment variables
augur/application/cli/dbt.py Integrates DBT commands into the Augur CLI
Files not reviewed (1)
  • augur/application/dbt/models/repo_activity.sql: Language not supported

@cli.command("run")
def run_dbt():
"""Run DBT models."""
if run_dbt_command(["run", "--profiles-dir", "/augur/application/dbt"]):
Copy link
Preview

Copilot AI Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DBT commands in the CLI functions include a manually specified '--profiles-dir' parameter while run_dbt_command already appends this flag using the computed profiles path. Consider removing the hardcoded '--profiles-dir' arguments from the CLI command invocations to avoid duplication and potential conflicts.

Suggested change
if run_dbt_command(["run", "--profiles-dir", "/augur/application/dbt"]):
if run_dbt_command(["run"]):

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot

The DBT commands in the CLI functions include a manually specified '--profiles-dir' parameter while run_dbt_command already appends this flag using the computed profiles path. Consider removing the hardcoded '--profiles-dir' arguments from the CLI command invocations to avoid duplication and potential conflicts.

I do not believe we can trust dbt to accurately compute the profile path to where we want it to be.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot

The DBT commands in the CLI functions include a manually specified '--profiles-dir' parameter while run_dbt_command already appends this flag using the computed profiles path. Consider removing the hardcoded '--profiles-dir' arguments from the CLI command invocations to avoid duplication and potential conflicts.

I do not believe we can trust dbt to accurately compute the profile path to where we want it to be.

I just read it through and found that I have already resolved the profile path at line 26:
result = subprocess.run([dbt_executable] + command + ["--profiles-dir", dbt_profiles_path], check=True). So at line 35 when it calls run_dbt_command, it is duplicate to add the path again.
I think it would be better to delete + ["--profiles-dir", dbt_profiles_path] in line 26, because maybe line 35's call to run_dbt_command is more straightforward to understand and easier to modify.

@sgoggins
Copy link
Member

sgoggins commented Apr 1, 2025

@AllenHsm : thank you for addressing this issue! I am curious if there are some instructions for manifesting the "hello world" you could provide? (i.e., what are the steps for use? )

Copy link
Member

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide some instructions on how to use this?

@@ -0,0 +1,25 @@
-- SPDX-License-Identifier: MIT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AllenHsm : So dbt would call this function in the hello world? Or would the materialized view be created automatically?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgoggins Yes, when the user calls "augur dbt run", dbt will first look at dbt_project.yml and scan the path at line 15: model-paths: ["augur/application/dbt/models"]. After that, it executes all the sql files in the models folder.

@sgoggins sgoggins requested a review from Copilot April 1, 2025 19:30
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates DBT into Augur to enable analytical data transformations by adding a basic "Hello World" DBT project and CLI commands for running, debugging, testing, and compiling DBT models.

  • Added dbt_project.yml with project configuration and model settings.
  • Added repo_activity.sql test configuration in repo_activity.yml along with a corresponding profiles.yml for database connection.
  • Integrated DBT commands into Augur CLI through a new dbt.py file.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

File Description
dbt_project.yml Adds the DBT project configuration and model settings.
augur/application/dbt/tests/repo_activity.yml Provides a test configuration for the repo_activity model.
augur/application/dbt/profiles.yml Configures database connection settings for DBT in Augur.
augur/application/cli/dbt.py Introduces CLI commands to run, debug, test, and compile DBT models.
Files not reviewed (1)
  • augur/application/dbt/models/repo_activity.sql: Language not supported

@cli.command("run")
def run_dbt():
"""Run DBT models."""
if run_dbt_command(["run", "--profiles-dir", "/augur/application/dbt"]):
Copy link
Preview

Copilot AI Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI command 'run_dbt' explicitly passes a '--profiles-dir' argument even though run_dbt_command automatically appends it using a computed path. This results in the argument being specified twice, which may cause conflicts. Consider removing the redundant '--profiles-dir' parameter from the command list in this function (and similarly in the other CLI commands).

Suggested change
if run_dbt_command(["run", "--profiles-dir", "/augur/application/dbt"]):
if run_dbt_command(["run"]):

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

@AllenHsm
Copy link
Author

AllenHsm commented Apr 2, 2025

Could you provide some instructions on how to use this?

Of course! Basically, users only need two commands:

augur dbt run

and

augur dbt test

augur dbt run

When users call augur dbt run, dbt will read the config, find the path to the models folder, and run all the sql files at this location. So it is the sql file in models that actually determines the behavior of dbt.

Based on the sql codes, dbt will generate a view or table in Augur Postgres DB. In this hello world project, for example, the only model repo_activity.sql calculates the total number of commits and issues for each repository, and stores the result as a physical table in Augur's DB (because in sql it claims {{ config(materialized='table') }}).

augur dbt test

For the test command, the yml files under the tests folder make assertions about the results generated by models, and after running the test command, dbt will tell you whether the requirements are fulfilled or not.

Users can learn more about creating their own models, corresponding tests, and even other features of dbt here: https://docs.getdbt.com/docs/build/sql-models

Since this pr is more like giving dbt a shot, I did not add other dbt features like macros. Once they work well with DBT, they would also work well after integrating DBT to Augur.

Thank you @sgoggins for reviewing my PR. I do not know if my explanation meets your expectation, so pls let me know if some part is still too vague.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Request for a new feature in Augur
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants