Skip to content

Add user-provided documentation to metadata #973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

earangol-stripe
Copy link
Contributor

Summary

Adds CHIP to support user-provided documentation.

Why / Goal

See the CHIP.

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested

Checklist

  • Documentation update

Reviewers

@earangol-stripe earangol-stripe force-pushed the chip-metadata-docs branch 5 times, most recently from 55c03ef to ac6d2c2 Compare April 17, 2025 15:38
@earangol-stripe earangol-stripe marked this pull request as ready for review April 18, 2025 18:51
@earangol-stripe earangol-stripe changed the title CHIP for user-provided documentation Add user-provided documentation to metadata Apr 18, 2025
Programmatic access to feature documentation is useful for integrating with systems aimed at ML explainability (e.g. [SHAP](https://shap.readthedocs.io/en/latest/)) and feature discovery (e.g. feature catalogs).

Currently, it's possible to inspect Chronon definitions to determine _how_ a feature was computed, which is a type of documentation. However, there are other aspects of ML feature development that are not
captured and documented within the feature definition itself, for example: context, domain-specific knowledge, assumptions, caveats. This CHIP aims to fill those gaps with user-provided documentation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a valid motivation!

```thrift
struct Aggregation {
...
xx: optional MetaData metaData
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metaData has more fields that are irrelevant to the description itself.

And the original thought was to keep params non related to feature calculation in the metadata, therefore, when it coms to versioning check etc, we don't have to maintain a potential increasing list of fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metaData has more fields that are irrelevant to the description itself.

Yeah, this is something I noticed too, I added some comments about it in the section about alternatives. I think this could be solved with a separate, more lightweight and specific metadata object (perhaps FeatureMetadata?). I'm not opposed to this.

And the original thought was to keep params non related to feature calculation in the metadata

Sorry, I don't fully understand -- description is not related to feature calculation, so it would belong in metadata, no? Or do you mean that more generally you'd prefer not adding fields to the metadata?

Copy link
Collaborator

@pengyu-hou pengyu-hou Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description is not related to feature calculation, so it would belong in metadata
Yes, correct.

So we can think of a way to add the feature description to the existing metadata, rather than adding a new metadata field or feature metadata in the aggregation itself?

Copy link
Contributor Author

@earangol-stripe earangol-stripe May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like how custom tags are handled? Basically keeping the descriptions in Join metadata through the custom JSON:

def Derivation(name, expression, description):
   derivation = ttypes.Derivation(name, expression)
   if description:
      derivation.description = description
   return derivation


def Join(...):
   ...
   derivation_descriptions = {}
   for derivation in derivations:
      if derivation.description:
         derivation_descriptions[derivation.name] = derivation.description
   if derivation_descriptions:
      custom_json['derivation_descriptions'] = derivation_descriptions
   ...

I would still prefer keeping the metadata closer to the object it refers to (perhaps with a more lightweight metadata object), but I wouldn't be too opposed to this alternative approach if you like that direction better. Or let me know if you meant something else.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a discussion with @hzding621 and @camweston-stripe today. Let's reuse the metaData struct actually to use it in the aggregation itself.

During the versioning check for join job, let's unset the meta data field before calculating the semantic hash so it won't affect the current behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the input! Could you point me to the versioning check that you're referring to? Do you mean this validator?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzding621 hzding621 requested a review from yuli-han May 6, 2025 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants