Skip to content

Let users disable datasets renaming for namespaced pipelines #4700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 21, 2025

Conversation

ankatiyar
Copy link
Contributor

@ankatiyar ankatiyar commented May 6, 2025

Description

Fix #4566 This PR is a prototype to demo the solution

Development notes

  • Add a prefix_namespace argument to pipeline() which defaults to True but can be set to False instead of automatic renaming of the datasets

What it would look like for users:

from kedro.pipeline import Pipeline, node, pipeline
from kedro_example.pipelines.data_science.nodes import train_model
from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles


def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),            
        ],
        namespace="data_processing",
        prefix_namespace=False,
    )

TODOs if we decide to go for this solution:

  • docstrings
  • Update docs
  • Add unit tests
  • Release notes
  • target branch main or develop

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested solution looks good to me!

I think one important thing not mentioned here is what will be the API for users to enable/disable this behaviour.

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding extra explanation!

For me rename flag seems a big confusing. Maybe we can consider some other names like namespace_entries/ namespace_datasets. Not a blocker as solution itself looks good.

Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @ankatiyar!

I fully support the solution - adding an option to avoid prefixing of inputs, outputs, parameters and confirms when creating a new pipeline with a namespace makes perfect sense.

A few concerns to consider in the current PR:

  1. Naming of the option ("rename")
    The current name does not seem clear enough. Imagine seeing Pipeline(..., rename=False) — it is not obvious what is not being renamed. Is it the pipeline itself, the nodes, the datasets?
    Given that this is a niche option, the name should be more descriptive and self-explanatory.

  2. Placement of the logic in _rename()
    Currently, the logic to conditionally skip renaming is embedded in the _rename() function. However, based on the name, this function should only be responsible for renaming. It is confusing to pass an option to a renaming function that effectively tells it not to rename.
    A cleaner approach would be to make this decision at a higher level, such as in _process_dataset_names(). That function can decide:

    • If renaming is needed → call _rename()
    • If renaming is not needed → skip _rename() entirely.

@merelcht
Copy link
Member

merelcht commented May 7, 2025

I like the idea of adding this argument, but similar to what @ElenaKhaustova and @DimedS have mentioned, it's not clear to me what "rename" actually means. If rename is False, I would assume no renaming is needed, but it's the other way around 😅 Namespaces is already a difficult feature to understand, so I think we need to be more verbose with the name for this argument to make it clear what the purpose is.

Does this go in main or develop?

While the way this is implemented is a non-breaking change, I'd suggest adding it to 1.0.0 to beef up the new feature set 🙂

@astrojuanlu
Copy link
Member

Instead of rename, what about the ability of setting a custom prefix? And then users could do override_prefix="" or similar ❓

@ankatiyar ankatiyar changed the base branch from main to develop May 19, 2025 09:24
@ankatiyar ankatiyar changed the title [prototype] Let users disable datasets renaming for namespaced pipelines Let users disable datasets renaming for namespaced pipelines May 19, 2025
@ankatiyar ankatiyar marked this pull request as ready for review May 19, 2025 10:24
@ankatiyar ankatiyar requested a review from merelcht as a code owner May 19, 2025 10:24
@ankatiyar
Copy link
Contributor Author

ankatiyar commented May 19, 2025

I changed the name to prefix_namespace but I'm also not sure if it is unambiguous enough. Can I get some votes/ideas?
Some ideas/options:

  • rename
  • rename_datasets
  • prefix_namespace
  • auto_prefix_namespace
  • auto_prefix
  • override_prefix

@merelcht @astrojuanlu @ElenaKhaustova @lrcouto @DimedS

@ankatiyar ankatiyar requested a review from DimedS May 19, 2025 10:26
@DimedS
Copy link
Member

DimedS commented May 19, 2025

I changed the name to prefix_namespace but I'm also not sure if it is unambiguous enough. Can I get some votes/ideas? Some ideas/options:

  • rename
  • rename_datasets
  • prefix_namespace
  • auto_prefix_namespace
  • auto_prefix
  • override_prefix

@merelcht @astrojuanlu @ElenaKhaustova @lrcouto @DimedS

thanks @ankatiyar , I like prefix_datasets_with_namespace, for me prefix_namespace sounds like we plan to prefix namespace itself

@merelcht
Copy link
Member

Initially I thought prefix_namespace is a logical name, but exploring this a bit more it seems like Dmitry's suggestion prefix_datasets_with_namespace is more accurate.

After testing, the namespace is still applied to the node name itself, so when running viz the suggestions to run a node show something like kedro run --to-nodes='data_science.evaluate_model([regressor;X_test;y_test]) -> None'

And running kedro registry describe will show something like:

Nodes:
- data_science.preprocess_data (preprocess_data)
- preprocess_data (preprocess_data)
- preprocess_shuttles (preprocess_shuttles)
- create_model_input_table (create_model_input_table)
- data_science.split_data (split_data)
- data_science.train_model (train_model)
- data_science.evaluate_model (evaluate_model)

@ankatiyar
Copy link
Contributor Author

@merelcht The idea for this feature is to not have the users rename entries in the catalog with prefixes or have to define a mapping dictionary for the inputs/outputs. I think prefixing the name of the node is still valuable because we're still namespacing the pipeline and it should show on viz that way

@merelcht
Copy link
Member

@merelcht The idea for this feature is to not have the users rename entries in the catalog with prefixes or have to define a mapping dictionary for the inputs/outputs. I think prefixing the name of the node is still valuable because we're still namespacing the pipeline and it should show on viz that way

I agree! But that's why I think Dmitry's suggestion for the argument name is the better option.

@ankatiyar
Copy link
Contributor Author

@merelcht oh yeah, I agree! Will update the PR with the suggestion!

ankatiyar and others added 2 commits May 21, 2025 11:54
Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work, @ankatiyar !

@ankatiyar
Copy link
Contributor Author

Placement of the logic in _rename() Currently, the logic to conditionally skip renaming is embedded in the _rename() function. However, based on the name, this function should only be responsible for renaming. It is confusing to pass an option to a renaming function that effectively tells it not to rename. A cleaner approach would be to make this decision at a higher level, such as in _process_dataset_names(). That function can decide:

  • If renaming is needed → call _rename()
  • If renaming is not needed → skip _rename() entirely.

@DimedS I think the placement is not ideal but the argument goes in too deep and I think _rename is still the right place for the logic because there's other rules also that apply to the dataset names
__init__->_map_nodes->_copy_node->_process_dataset_names->_rename

@ankatiyar
Copy link
Contributor Author

ankatiyar commented May 21, 2025

PS will do docs in a separate PR it's too spread across
Made some docs updates in this PR itself

Signed-off-by: Ankita Katiyar <[email protected]>
@ankatiyar ankatiyar requested a review from yetudada as a code owner May 21, 2025 14:55
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ankatiyar this looks great 👍

Makes sense to do the docs separately. That way you can also make sure they are done with the new formatting etc. (cc: @Huongg )

@ankatiyar ankatiyar merged commit 4570923 into develop May 21, 2025
40 of 41 checks passed
@ankatiyar ankatiyar deleted the prototype-namespace-renaming branch May 21, 2025 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants