Description
Description
Consider the following users
table.
id | name | created_at | updated_at | |
---|---|---|---|---|
1 | [email protected] | Evan | 2022-01-01 00:00:00 | 2022-01-01 00:00:00 |
2 | [email protected] | Brian | 2022-01-02 00:00:00 | 2022-01-02 00:00:00 |
3 | [email protected] | Andy | 2022-01-03 00:00:00 | 2022-01-03 00:00:00 |
A common sync would be to create in incremental sync, using the updated_at
timestamp column + de-duped history. As users are added or updated the updated_at
column is changed and the next sync will pick them up as the cursor can only move forward in time.
There's a problem however at the boundary of the sync. If the sync itself was running at 2022-01-03 00:00:00
, right when a new user was added, they would be not be included in the sync:
id | name | created_at | updated_at | |
---|---|---|---|---|
1 | [email protected] | Evan | 2022-01-01 00:00:00 | 2022-01-01 00:00:00 |
2 | [email protected] | Brian | 2022-01-02 00:00:00 | 2022-01-02 00:00:00 |
3 | [email protected] | Andy | 2022-01-03 00:00:00 | 2022-01-03 00:00:00 |
4 | [email protected] | Greg | 2022-01-03 00:00:00 | 2022-01-03 00:00:00 |
We won't pick up user 4 (at least until something about them changes and their updated_at
is bumped) because in subsequent syncs, we use the greater-than comparison to choose which records to import. This can lead to missing records if they were inserted as the sync was running.
Local confirmation
To confirm this behavior, create a source table with the 3 rows listed in the first example, and run the sync. You will have 3 records in your destination. Then add a 4th row with the an updated_at
matching the current cursor ("2022-01-03 00:00:00" in this example). Sync again, and see that user 4 did not make it to the destination.
Possible Solutions
-
Rather than selecting records from SQL sources matching
"SELECT %s FROM %s WHERE %s > ?"
, we could use>=
, e.g."SELECT %s FROM %s WHERE %s >= ?"
. This solution is the most robust because it also works for numeric (vs time) cursors as well. This will have the negative side effect of almost always creating a duplicate entry (the previous sync's most recent record is very likely to be returned again at the start of a subsequent sync), and normalization will need to deal with it. This is not a problem with de-duped history, but is weird in all other sync modes. -
Building from solution 1 above (using
>=
), we could store both the cursor value and the offset in the STATE. This solves the duplicate row problem introduced above. The state stored could remember the previous sync's cursor and OFFSET. In this way, the next sync could continue from the same cursor position + offset to be sure that now new records appeared within the same cursor that we haven't seen yet:
SELECT * from users LIMIT 100 # got some records
SELECT * from users LIMIT 100 OFFSET 3 # got 0 records after using offset
# first sync ends, cursor is {value: "2022-01-03 00:00:00" + offset: 0}
# Next sync starts
SELECT * from users WHERE updated_at >= "2022-01-03 00:00:00" LIMIT 100 OFFSET 1 # use previous offset + 1 # 1 new row found!
So, the state for an incremental SQL cursor source should look like:
{ users: { offset: 0, cursor: "2022-01-03 00:00:00"}}
Probably bad solutions
- If we could detect that the cursor was a time field, we could append
AND ${cursor} >= NOW() - INTERVAL '2 seconds'
. In this way, we would ignore "too recent" values that might be inserted at the time of the sync. This doesn't solve the problem with numeric cursors, only "time"-like cursors. - Introduce a sleep and one more read before the sync is determined to be complete, e.g.:
SELECT * from users LIMIT 100 # got some records
SELECT * from users LIMIT 100 OFFSET 3 # got 0 records after using offset
# SLEEP 2. Sleep enough time to ensure that the cursor's minimum resolution (often 1 second) has passed
SELECT * from users LIMIT 100 OFFSET 3 # got 1 record that was inserted during the sleep
# LOOP or END if still 0 rows
- Don't ever send any RECORDS from the latest cursor value. You'll never see any duplicates, but the destination will always be some level of "behind" the source
Misc
This approach assumes that our source database have an implicit, static sort which will return rows in the same order without needing an ORDER BY
. This is true for postgres and mysql... is it true for all DB sources? We may also need to add an ORDER BY
clause.