-
Notifications
You must be signed in to change notification settings - Fork 70
Add user-provided documentation to metadata #973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# CHIP-3: User-provided documentation for feature definitions | ||
|
||
https://github.com/airbnb/chronon/issues/975 | ||
|
||
## Motivation | ||
|
||
Programmatic access to feature documentation is useful for integrating with systems aimed at ML explainability (e.g. [SHAP](https://shap.readthedocs.io/en/latest/)) and feature discovery (e.g. feature catalogs). | ||
|
||
Currently, it's possible to inspect Chronon definitions to determine _how_ a feature was computed, which is a type of documentation. However, there are other aspects of ML feature development that are not | ||
captured and documented within the feature definition itself, for example: context, domain-specific knowledge, assumptions, caveats. This CHIP aims to fill those gaps with user-provided documentation. | ||
|
||
## Proposed Change | ||
|
||
There are 3 main changes in this proposal: | ||
|
||
1. Add a `description` field to `MetaData` in the Thrift API. | ||
|
||
```thrift | ||
struct MetaData { | ||
... | ||
xx: optional string description | ||
} | ||
``` | ||
|
||
2. Add a `metaData` field to `Aggregation` and `Derivation` in the Thrift API. | ||
|
||
```thrift | ||
struct Aggregation { | ||
... | ||
xx: optional MetaData metaData | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The And the original thought was to keep params non related to feature calculation in the metadata, therefore, when it coms to versioning check etc, we don't have to maintain a potential increasing list of fields There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, this is something I noticed too, I added some comments about it in the section about alternatives. I think this could be solved with a separate, more lightweight and specific metadata object (perhaps
Sorry, I don't fully understand -- description is not related to feature calculation, so it would belong in metadata, no? Or do you mean that more generally you'd prefer not adding fields to the metadata? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So we can think of a way to add the feature description to the existing metadata, rather than adding a new metadata field or feature metadata in the aggregation itself? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean something like how custom tags are handled? Basically keeping the descriptions in def Derivation(name, expression, description):
derivation = ttypes.Derivation(name, expression)
if description:
derivation.description = description
return derivation
def Join(...):
...
derivation_descriptions = {}
for derivation in derivations:
if derivation.description:
derivation_descriptions[derivation.name] = derivation.description
if derivation_descriptions:
custom_json['derivation_descriptions'] = derivation_descriptions
... I would still prefer keeping the metadata closer to the object it refers to (perhaps with a more lightweight metadata object), but I wouldn't be too opposed to this alternative approach if you like that direction better. Or let me know if you meant something else. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. had a discussion with @hzding621 and @camweston-stripe today. Let's reuse the During the versioning check for join job, let's unset the meta data field before calculating the semantic hash so it won't affect the current behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great, thanks for the input! Could you point me to the versioning check that you're referring to? Do you mean this validator? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
} | ||
struct Derivation { | ||
... | ||
xx: optional MetaData metaData | ||
} | ||
``` | ||
|
||
3. Update Chronon's Python API to handle an optional `description` parameter for the following objects: `Join`, `GroupBy`, `ExternalSource`, `ContextualSource`, `StagingQuery`, `Aggregation`, `Derivation`. When present, it will be passed through to the enclosed `MetaData`. Example implementation for `Derivation` (the simplest one): | ||
|
||
```python | ||
def Derivation(name: str, expression: str, description: Optional[str] = None) -> ttypes.Derivation: | ||
... | ||
metadata = ttypes.MetaData(description=description) if description else None | ||
return ttypes.Derivation(name, expression, metadata) | ||
``` | ||
|
||
## New or Changed Public Interfaces | ||
|
||
The Thrift API will change. However, all the changes to the definition are additive, no existing fields will be touched. | ||
|
||
There will be an effect (Chronon object diffs) on existing implementations that happen to coincidentally pass `description` as `kwargs`, since those arbitrary params get thrown into `MetaData.customJson`. | ||
However, this is not a public API contract and would not be expected to have an effect on feature computation. | ||
|
||
## Rejected Alternatives | ||
|
||
- Support a `description` parameter at the Python API level without changes to the Thrift definitions. In this implementation, the descriptions would be collected and bubbled up to the top-level object (e.g. `Join` or `GroupBy`), similar to how `tags` are handled ([code](https://github.com/airbnb/chronon/blob/3e138e86d9922a6742709adc69b9b6ccbd18852c/api/py/ai/chronon/group_by.py#L529)). | ||
- Pros | ||
- Consistency with implementation for `tags`. | ||
- No changes to the public Thrift API. | ||
- Cons | ||
- Obscures support for feature documentation, since data in `customJson` looks adhoc and generally less discoverable. | ||
- Would require some sort of mapping from objects to descriptions in the top-level metadata. E.g. mapping a description to its corresponding derivation, perhaps through output columns. This could be brittle and significantly increases the size of data in `customJson`. | ||
- Bubbling up parameters via dynamic and undocumented fields (as it's done with `tags`, which are not part of the Thrift definition for `Aggregation`) is less maintainable as things may easily break if code is moved around without handling those fields correctly. | ||
- Defining a separate Thrift struct for holding documentation. | ||
- Pros | ||
- Isolation from existing `MetaData` definition, which has a lot of fields that are only relevant in specific contexts. | ||
- Cons | ||
- Slippery slope towards having various diffent types of metadata. | ||
- Unclear what the actual definition would be and the boundary with the generic `MetaData` definition. After all, data description is a type of metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a valid motivation!