Skip to content

Rows with identical values get identical hash codes in the CSV driver #180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kolovos opened this issue May 1, 2025 · 2 comments
Closed
Milestone

Comments

@kolovos
Copy link
Contributor

kolovos commented May 1, 2025

This can be problematic in other parts of Epsilon such as ETL's default transformation strategy which assume that model elements have unique hash codes. Options to explore:

  • Add an ordinal nubmer field to CSV model elements to avoid duplicate hash codes
  • Change from hash codes to system identities in ETL's FastTransformationStrategy
@arcanefoam
Copy link
Contributor

arcanefoam commented May 6, 2025 via email

@agarciadom
Copy link
Contributor

agarciadom commented May 14, 2025

I'd avoid making a broad change to ETL as it may have unintended consequences. It may be better to change CSV rows so they have different hashcodes for each other.

When I tried adding a row number to a CSV row, I realized it'd be harder than I expected to keep it up to date as rows are removed/inserted in the middle of the file. We wouldn't really want to expose such a pseudo-row number to users, as its behaviour may not be very reliable.

Why not change the internal representation of a row to a LinkedHashMap subclass which reverts hashCode+equals to be based on object identity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants