Skip to content

[BUG] Spark-delta commitInfo protocol violation #2419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 5 tasks
ion-elgreco opened this issue Dec 30, 2023 · 0 comments
Open
1 of 5 tasks

[BUG] Spark-delta commitInfo protocol violation #2419

ion-elgreco opened this issue Dec 30, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@ion-elgreco
Copy link

Bug

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Describe the problem

Spark-delta cannot deserialize a delta table history where the operationMetrics contain integers instead of string values. This violates the Delta Protocol, since it should be able to deserialize any format: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#commit-provenance-information

Implementations are free to store any valid JSON-formatted data via the commitInfo action.

Steps to reproduce

Write a delta table with Delta-RS, do any operation that provides operationMetrics in the commitInfo and then try to do Table.history() with spark-delta.

Observed results

What you will get is this since it expects a string even though the protocol permits any value.

MismatchedInputException: Cannot deserialize value of type `java.lang.String` from Object value (token `JsonToken.START_OBJECT`)
 at [Source: (String)"{"commitInfo":{"timestamp":1703277009126,"operation":"OPTIMIZE","operationParameters":{"targetSize":"1000000000"},"clientVersion":"delta-rs.0.16.5","operationMetrics":{"filesAdded":{
"avg":717028931.0,"max":940352015,"min":363011010,"totalFiles":4,"totalSize":2868115724},"filesRemoved":{"avg":179660.21727615935,"max":225198,"min":132135,"totalFiles":17747,"totalSize":3188429876},"numB
atches":586551,"numFilesAdded":4,"numFilesRemoved":17747,"partitionsOptimized":2,"preserveInsertionOrder":true,"to"[truncated 70 chars]; line: 1, column: 182] (through reference chain: org.apache.spark.sq
l.delta.actions.SingleAction["commitInfo"]->org.apache.spark.sql.delta.actions.CommitInfo["operationMetrics"]->com.fasterxml.jackson.module.scala.deser.GenericMapFactoryDeserializerResolver$BuilderWrapper
["filesAdded"]) 

Expected results

Properly deserialize any value as stated by the protocol.

Further details

@ion-elgreco ion-elgreco added the bug Something isn't working label Dec 30, 2023
@ion-elgreco ion-elgreco changed the title [BUG] Spark-delta commitInfo protocol not preserved [BUG] Spark-delta commitInfo protocol violation Dec 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant