Skip to content

SQL Sources should include previous sync's cursor value in SELECT statement as not to skip rows added mid-sync #14732

Closed
@evantahler

Description

@evantahler

Description

Consider the following users table.

id email name created_at updated_at
1 [email protected] Evan 2022-01-01 00:00:00 2022-01-01 00:00:00
2 [email protected] Brian 2022-01-02 00:00:00 2022-01-02 00:00:00
3 [email protected] Andy 2022-01-03 00:00:00 2022-01-03 00:00:00

A common sync would be to create in incremental sync, using the updated_at timestamp column + de-duped history. As users are added or updated the updated_at column is changed and the next sync will pick them up as the cursor can only move forward in time.

There's a problem however at the boundary of the sync. If the sync itself was running at 2022-01-03 00:00:00, right when a new user was added, they would be not be included in the sync:

id email name created_at updated_at
1 [email protected] Evan 2022-01-01 00:00:00 2022-01-01 00:00:00
2 [email protected] Brian 2022-01-02 00:00:00 2022-01-02 00:00:00
3 [email protected] Andy 2022-01-03 00:00:00 2022-01-03 00:00:00
4 [email protected] Greg 2022-01-03 00:00:00 2022-01-03 00:00:00

We won't pick up user 4 (at least until something about them changes and their updated_at is bumped) because in subsequent syncs, we use the greater-than comparison to choose which records to import. This can lead to missing records if they were inserted as the sync was running.

Local confirmation

To confirm this behavior, create a source table with the 3 rows listed in the first example, and run the sync. You will have 3 records in your destination. Then add a 4th row with the an updated_at matching the current cursor ("2022-01-03 00:00:00" in this example). Sync again, and see that user 4 did not make it to the destination.

Possible Solutions

  1. Rather than selecting records from SQL sources matching "SELECT %s FROM %s WHERE %s > ?", we could use >=, e.g. "SELECT %s FROM %s WHERE %s >= ?". This solution is the most robust because it also works for numeric (vs time) cursors as well. This will have the negative side effect of almost always creating a duplicate entry (the previous sync's most recent record is very likely to be returned again at the start of a subsequent sync), and normalization will need to deal with it. This is not a problem with de-duped history, but is weird in all other sync modes.

  2. Building from solution 1 above (using >=), we could store both the cursor value and the offset in the STATE. This solves the duplicate row problem introduced above. The state stored could remember the previous sync's cursor and OFFSET. In this way, the next sync could continue from the same cursor position + offset to be sure that now new records appeared within the same cursor that we haven't seen yet:

SELECT * from users LIMIT 100 # got some records
SELECT * from users LIMIT 100 OFFSET 3 # got 0 records after using offset
# first sync ends, cursor is {value: "2022-01-03 00:00:00" + offset: 0}

# Next sync starts
SELECT * from users WHERE updated_at >= "2022-01-03 00:00:00" LIMIT 100 OFFSET 1 # use previous offset + 1 # 1 new row found!

So, the state for an incremental SQL cursor source should look like:

{ users: { offset: 0, cursor: "2022-01-03 00:00:00"}}

Probably bad solutions

  1. If we could detect that the cursor was a time field, we could append AND ${cursor} >= NOW() - INTERVAL '2 seconds'. In this way, we would ignore "too recent" values that might be inserted at the time of the sync. This doesn't solve the problem with numeric cursors, only "time"-like cursors.
  2. Introduce a sleep and one more read before the sync is determined to be complete, e.g.:
SELECT * from users LIMIT 100 # got some records
SELECT * from users LIMIT 100 OFFSET 3 # got 0 records after using offset
# SLEEP 2. Sleep enough time to ensure that the cursor's minimum resolution (often 1 second) has passed
SELECT * from users LIMIT 100  OFFSET 3 # got 1 record that was inserted during the sleep
# LOOP or END if still 0 rows
  1. Don't ever send any RECORDS from the latest cursor value. You'll never see any duplicates, but the destination will always be some level of "behind" the source

Misc

This approach assumes that our source database have an implicit, static sort which will return rows in the same order without needing an ORDER BY. This is true for postgres and mysql... is it true for all DB sources? We may also need to add an ORDER BY clause.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions