Skip to content

Cache dictionary inputs processing in Node class #4688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ArnoutVerboven
Copy link
Contributor

@ArnoutVerboven ArnoutVerboven commented Apr 25, 2025

Description

This PR addresses a performance issue in the Kedro framework related to the Node class. When nodes are provided with dictionary inputs, the inputs property recalculates _dict_inputs_to_list on every access, which becomes inefficient when running pipelines with many nodes. This optimization caches the conversion result to avoid redundant computation.

Development notes

I've made the following changes:
1. Added a _dict_inputs_as_list instance variable in Node.init to cache the result of _dict_inputs_to_list for dictionary inputs
2. Modified the inputs property to use this cached value instead of recomputing it on every access
3. Updated the func setter to recalculate the cached value when the function is changed

EDIT:

  1. Changed @property to @cached_property for the inputs method to cache the result
  2. Updated the func setter to clear the cached property with a simple del self.inputs when the function changes

These changes are minimal and tested giving a significant performance improvement on a case with ~1k nodes.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@astrojuanlu
Copy link
Member

astrojuanlu commented Apr 25, 2025

First of all: thanks a lot for this PR @ArnoutVerboven ! 🙌🏼

Comment from the peanut gallery: could we just use @functools.cached_property in this case? Node objects are immutable from a public API perspective, are they not?

Comment on lines 130 to 134
self._inputs = inputs
# Cache the list version of inputs dictionary to avoid recomputing in the inputs property
self._dict_inputs_as_list = None
if isinstance(self._inputs, dict):
self._dict_inputs_as_list = _dict_inputs_to_list(self._func, self._inputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._inputs = inputs
# Cache the list version of inputs dictionary to avoid recomputing in the inputs property
self._dict_inputs_as_list = None
if isinstance(self._inputs, dict):
self._dict_inputs_as_list = _dict_inputs_to_list(self._func, self._inputs)
self._inputs = _dict_inputs_to_list(self._func, self._inputs) if isinstance(inputs, dict) else inputs

Any reason this wouldn't just work?

@deepyaman
Copy link
Member

Comment from the peanut gallery: could we just use @functools.cached_property in this case? Node objects are immutable from a public API perspective, are they not?

@astrojuanlu This seems fine, but I wonder if there's any benefit of doing this over just doing the conversion when setting self._inputs? I guess the main benefit would be deferring the expensive call, in case you don't need to access node.inputs in some code paths (would need to check to see what cases that might be, I feel like node.inputs gets called pretty eagerly).

@ArnoutVerboven
Copy link
Contributor Author

ArnoutVerboven commented Apr 26, 2025

I adjusted the code according to @astrojuanlu 's suggestion of using cached_property.

Node objects are immutable from a public API perspective, are they not?

I see there is a setter for func so I'm clearing the inputs cache in the setter.

@deepyaman I went with this approach as it still achieves the desired performance improvement with minimal changes to the other logic, but if you prefer updating _inputs in Node.__init__() directly, feel free to open it as a separate PR and close this one!

Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is a setter for func so I'm clearing the inputs cache in the setter.

@deepyaman I went with this approach as it still achieves the desired performance improvement with minimal changes to the other logic, but if you prefer updating _inputs in Node.__init__() directly, feel free to open it as a separate PR and close this one!

No super-strong preference. :) Congrats on your first contribution!

@deepyaman
Copy link
Member

CI failure is unrelated, addressing in #4695

@astrojuanlu
Copy link
Member

Thanks @ArnoutVerboven ! Did you test this code with your use case to double check that the performance is still good?

@ArnoutVerboven
Copy link
Contributor Author

Yes @astrojuanlu ! Still achieving same speed-up using cached_property (130s -> 3s spent on 'inputs' in my pipeline)

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArnoutVerboven this looks good to me! Loved reading your blog post too 😄

Can you add a short sentence of this change to the release notes and add yourself as contributor?

Signed-off-by: Arnout Verboven <[email protected]>
@ArnoutVerboven
Copy link
Contributor Author

Thanks @ArnoutVerboven this looks good to me! Loved reading your blog post too 😄

Can you add a short sentence of this change to the release notes and add yourself as contributor?

Done :)

@merelcht merelcht enabled auto-merge (squash) May 7, 2025 11:14
@merelcht merelcht merged commit 507ebe4 into kedro-org:main May 7, 2025
40 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants