Skip to content

Destination Snowflake: _AIRBYTE_UNIQUE_KEY dependent on order of composite primary key fields - can cause duplicates #21330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
luke-oa opened this issue Jan 12, 2023 · 1 comment

Comments

@luke-oa
Copy link

luke-oa commented Jan 12, 2023

Environment

  • Airbyte version: 0.40.15
  • Deployment: Kube
  • Destination Connector and version: Snowflake 0.4.41
  • Step where error happened: Sync Normalization

Current Behavior

Airbyte normalization dedupes data using an _AIRBYTE_UNIQUE_KEY, which is the MD5 hash of the values of the primary key(s) for the row; however, if the order in which these field names is changed in the config/catalog (only applicable to composite/multiple primary keys), the MD5 hash value (_AIRBYTE_UNIQUE_KEY) for a given row will also change, resulting in duplicate row values (only applicable when deduping data).

Example query output when producing the *_scd_dbt table

Expected Behavior

Deduplication should not be impacted by the order in which composite primary key field names are stored. Ideally a non-order dependent algorithm could be used, but this would be a breaking change in the future. More realistically, something that ensures a given sort order in the normalization query regardless of config/catalog order would solve this.

Steps to Reproduce

  1. Set a composite key for a table, sync it, and dedupe the data
  2. Change the order of the composite key fields (set catalog via API)
  3. Sync and dedupe again
  4. Check row counts based on a hash of the composite values and see there are now duplicates
@luke-oa luke-oa added needs-triage type/bug Something isn't working labels Jan 12, 2023
@igrankova igrankova changed the title _AIRBYTE_UNIQUE_KEY dependent on order of composite primary key fields - can cause duplicates Destination Snowflake: _AIRBYTE_UNIQUE_KEY dependent on order of composite primary key fields - can cause duplicates Jun 6, 2023
@evantahler
Copy link
Contributor

Closing this issue because normalization as we know it is going away shortly (and the way de-duplication works will be changing) - learn more @ #26028

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants