Description
Tell us about the problem you're trying to solve
The schema validation code has undergone a lot of changes made on some assumptions, and I find it both confusing and don't think it achieves what we want it to.
Current solution
Our current schema validation method:
- tries to validate a schema
- excepts a
JsonValidationException
by not using that exception at all, but rather re-calling the validator in order to access the args that were used to create the singleJsonValidationException.message
containing multiple error messages - Uses those args to create our own validation messages, one per
ValidationMessage
in theJsonSchemaValidator
- Throws a
RecordSchemaValidationException
containing the validation messages and the originalJsonValidationException
We then handle the exceptions by keeping a map of:
stream (key): (concatenated set of error messages, # of record validatons total that make up the errors in that list)
. We keep this updated so that after we've gotten validaton errors for 10 records of a given stream, we stop running validation on that stream
Issues
By not using the JsonValidationException
directly and instead re-calling the validator in order to access the args that were used to create the single JsonValidationException.message
, we are losing information and inserting incorrect information:
We make the assumption that every error thrown by the validator is about incorrect types, when that's not the case.This led us to previously assume that every message would come with 2 args,invalidRecordDataAndType
. The 2 args in the case of type mismatch errors are actually expected type and actual type, not record data and actual type. Not grabbing this info here is where we lose the expected type.Sometimes theJsonValidationException
does not come with args, for example in the case of when additional properties are not allowed (0 arguments). We currently make extra checks against this, instead of handling different types of errors that the schema validator is trying to send us. We mask these potential errors as "__ is of an incorrect type" errors with no more information (and the type information is incorrect)
Crossed out the above as it was fixed here - however one issue i didn't raise before is that we were previously calling validate 3 times to get the info we need, now we only do it twice, but should be able to do it in just one go.
By limiting ourselves to 10 total validation failures, we lose some info:
- We may not be getting all of the type mismatch errors, so if someone fixing types fixes all the reported types, there may be new errors reported once the previous ones are fixed
Since all validation errors are reported as type errors, 10 type errors could stop us from seeing other completely different types of schema validation errors on further records
What are we trying to achieve? From my understanding:
- We want to collect and report schema validation errors
- Ideally all of them, not just type mismatches
- We want to collect the number of records with schema validation errors per stream in order to limit the amount of validating we do
- ❓ previously this was due to the amount of logging and performance (I think). Is this solved by moving it to a separate thread?
- In the case of type mismatches, we want the args (expected, actual types, path to the field) - we should be passing these around as objects and not within strings
Describe the solution you’d like
- Refactor this in the way that we would if we were implementing it with datadog first in mind before logging.
- Keep the
JsonSchemaValidationError
as the source of truth for what the errors are.- ❓ Can we access the
ValidationMessage
s directly instead of relying on theJsonSchemaValdiationError
raised? This would be the easiest way to know exactly what the error is and the relevant arguments. This comment makes me think it might not be possible, though 🤔
- ❓ Can we access the
- Remove regeneration of log messages and
RecordSchemaValidationException
s. - Either remove the max count of logs, or only max count type mismatch errors, so that other potential errors are still logged
Additional context
A brief history of the schema validator: