Skip to content

Add user-provided documentation to metadata #973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions proposals/CHIP-3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# CHIP-3: User-provided documentation for feature definitions

https://github.com/airbnb/chronon/issues/975

## Motivation

Programmatic access to feature documentation is useful for integrating with systems aimed at ML explainability (e.g. [SHAP](https://shap.readthedocs.io/en/latest/)) and feature discovery (e.g. feature catalogs).

Currently, it's possible to inspect Chronon definitions to determine _how_ a feature was computed, which is a type of documentation. However, there are other aspects of ML feature development that are not
captured and documented within the feature definition itself, for example: context, domain-specific knowledge, assumptions, caveats. This CHIP aims to fill those gaps with user-provided documentation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a valid motivation!


## Proposed Change

There are 3 main changes in this proposal:

1. Add a `description` field to `MetaData` in the Thrift API.

```thrift
struct MetaData {
...
xx: optional string description
}
```

2. Add a `metaData` field to `Aggregation` and `Derivation` in the Thrift API.

```thrift
struct Aggregation {
...
xx: optional MetaData metaData
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metaData has more fields that are irrelevant to the description itself.

And the original thought was to keep params non related to feature calculation in the metadata, therefore, when it coms to versioning check etc, we don't have to maintain a potential increasing list of fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metaData has more fields that are irrelevant to the description itself.

Yeah, this is something I noticed too, I added some comments about it in the section about alternatives. I think this could be solved with a separate, more lightweight and specific metadata object (perhaps FeatureMetadata?). I'm not opposed to this.

And the original thought was to keep params non related to feature calculation in the metadata

Sorry, I don't fully understand -- description is not related to feature calculation, so it would belong in metadata, no? Or do you mean that more generally you'd prefer not adding fields to the metadata?

Copy link
Collaborator

@pengyu-hou pengyu-hou Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description is not related to feature calculation, so it would belong in metadata
Yes, correct.

So we can think of a way to add the feature description to the existing metadata, rather than adding a new metadata field or feature metadata in the aggregation itself?

Copy link
Contributor Author

@earangol-stripe earangol-stripe May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like how custom tags are handled? Basically keeping the descriptions in Join metadata through the custom JSON:

def Derivation(name, expression, description):
   derivation = ttypes.Derivation(name, expression)
   if description:
      derivation.description = description
   return derivation


def Join(...):
   ...
   derivation_descriptions = {}
   for derivation in derivations:
      if derivation.description:
         derivation_descriptions[derivation.name] = derivation.description
   if derivation_descriptions:
      custom_json['derivation_descriptions'] = derivation_descriptions
   ...

I would still prefer keeping the metadata closer to the object it refers to (perhaps with a more lightweight metadata object), but I wouldn't be too opposed to this alternative approach if you like that direction better. Or let me know if you meant something else.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a discussion with @hzding621 and @camweston-stripe today. Let's reuse the metaData struct actually to use it in the aggregation itself.

During the versioning check for join job, let's unset the meta data field before calculating the semantic hash so it won't affect the current behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the input! Could you point me to the versioning check that you're referring to? Do you mean this validator?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
struct Derivation {
...
xx: optional MetaData metaData
}
```

3. Update Chronon's Python API to handle an optional `description` parameter for the following objects: `Join`, `GroupBy`, `ExternalSource`, `ContextualSource`, `StagingQuery`, `Aggregation`, `Derivation`. When present, it will be passed through to the enclosed `MetaData`. Example implementation for `Derivation` (the simplest one):

```python
def Derivation(name: str, expression: str, description: Optional[str] = None) -> ttypes.Derivation:
...
metadata = ttypes.MetaData(description=description) if description else None
return ttypes.Derivation(name, expression, metadata)
```

## New or Changed Public Interfaces

The Thrift API will change. However, all the changes to the definition are additive, no existing fields will be touched.

There will be an effect (Chronon object diffs) on existing implementations that happen to coincidentally pass `description` as `kwargs`, since those arbitrary params get thrown into `MetaData.customJson`.
However, this is not a public API contract and would not be expected to have an effect on feature computation.

## Rejected Alternatives

- Support a `description` parameter at the Python API level without changes to the Thrift definitions. In this implementation, the descriptions would be collected and bubbled up to the top-level object (e.g. `Join` or `GroupBy`), similar to how `tags` are handled ([code](https://github.com/airbnb/chronon/blob/3e138e86d9922a6742709adc69b9b6ccbd18852c/api/py/ai/chronon/group_by.py#L529)).
- Pros
- Consistency with implementation for `tags`.
- No changes to the public Thrift API.
- Cons
- Obscures support for feature documentation, since data in `customJson` looks adhoc and generally less discoverable.
- Would require some sort of mapping from objects to descriptions in the top-level metadata. E.g. mapping a description to its corresponding derivation, perhaps through output columns. This could be brittle and significantly increases the size of data in `customJson`.
- Bubbling up parameters via dynamic and undocumented fields (as it's done with `tags`, which are not part of the Thrift definition for `Aggregation`) is less maintainable as things may easily break if code is moved around without handling those fields correctly.
- Defining a separate Thrift struct for holding documentation.
- Pros
- Isolation from existing `MetaData` definition, which has a lot of fields that are only relevant in specific contexts.
- Cons
- Slippery slope towards having various diffent types of metadata.
- Unclear what the actual definition would be and the boundary with the generic `MetaData` definition. After all, data description is a type of metadata.