Skip to content

Schema validation error collection refactor #24853

Closed
@erohmensing

Description

@erohmensing

Tell us about the problem you're trying to solve

The schema validation code has undergone a lot of changes made on some assumptions, and I find it both confusing and don't think it achieves what we want it to.

Current solution

Our current schema validation method:

  1. tries to validate a schema
  2. excepts a JsonValidationException by not using that exception at all, but rather re-calling the validator in order to access the args that were used to create the single JsonValidationException.message containing multiple error messages
  3. Uses those args to create our own validation messages, one per ValidationMessage in the JsonSchemaValidator
  4. Throws a RecordSchemaValidationException containing the validation messages and the original JsonValidationException

We then handle the exceptions by keeping a map of:
stream (key): (concatenated set of error messages, # of record validatons total that make up the errors in that list). We keep this updated so that after we've gotten validaton errors for 10 records of a given stream, we stop running validation on that stream

Issues

By not using the JsonValidationException directly and instead re-calling the validator in order to access the args that were used to create the single JsonValidationException.message, we are losing information and inserting incorrect information:

  1. We make the assumption that every error thrown by the validator is about incorrect types, when that's not the case.
  2. This led us to previously assume that every message would come with 2 args, invalidRecordDataAndType. The 2 args in the case of type mismatch errors are actually expected type and actual type, not record data and actual type. Not grabbing this info here is where we lose the expected type.
  3. Sometimes the JsonValidationException does not come with args, for example in the case of when additional properties are not allowed (0 arguments). We currently make extra checks against this, instead of handling different types of errors that the schema validator is trying to send us. We mask these potential errors as "__ is of an incorrect type" errors with no more information (and the type information is incorrect)

Crossed out the above as it was fixed here - however one issue i didn't raise before is that we were previously calling validate 3 times to get the info we need, now we only do it twice, but should be able to do it in just one go.

By limiting ourselves to 10 total validation failures, we lose some info:

  1. We may not be getting all of the type mismatch errors, so if someone fixing types fixes all the reported types, there may be new errors reported once the previous ones are fixed
  2. Since all validation errors are reported as type errors, 10 type errors could stop us from seeing other completely different types of schema validation errors on further records

What are we trying to achieve? From my understanding:

  • We want to collect and report schema validation errors
    • Ideally all of them, not just type mismatches
  • We want to collect the number of records with schema validation errors per stream in order to limit the amount of validating we do
    • ❓ previously this was due to the amount of logging and performance (I think). Is this solved by moving it to a separate thread?
  • In the case of type mismatches, we want the args (expected, actual types, path to the field) - we should be passing these around as objects and not within strings

Describe the solution you’d like

  • Refactor this in the way that we would if we were implementing it with datadog first in mind before logging.
  • Keep the JsonSchemaValidationError as the source of truth for what the errors are.
    • ❓ Can we access the ValidationMessages directly instead of relying on the JsonSchemaValdiationError raised? This would be the easiest way to know exactly what the error is and the relevant arguments. This comment makes me think it might not be possible, though 🤔
  • Remove regeneration of log messages and RecordSchemaValidationExceptions.
  • Either remove the max count of logs, or only max count type mismatch errors, so that other potential errors are still logged

Additional context

A brief history of the schema validator:

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions