Closed
Description
- First of all, the test is built on a naive assumption that the results of two consecutive
full_refresh
reads must be strictly equal, or at least, one should be the subset of another one (records are compared by primary key if it is present, or by record hash with an option to ignore some fields). That is incorrect, since we have no guarantee no records will be removed. - Second, when trying to figure out if one set of records is the subset of another one, the absence of
symmetric_difference
is used. That is also incorrect, since it indicates that sets are completely equal. Instead,subset
should be used. - Third, when trying to compare records by the primary key, the primary key of a configured stream defined in a
configured_catalog.json
file is often missing and it's not validated or highlighted in any possible way since the field is optional. This leads to comparing records by hash instead of by primary keys. My suggestion here is to log the warning in case ConfiguredStream.Stream.source_defined_primary_key is not falsy but ConfiguredStream.primary_key is.