SQL Sources should include previous sync's cursor value in SELECT statement as not to skip rows added mid-sync

## Description

Consider the following `users` table.

| id | email             | name  | created_at          | updated_at          |
|----|-------------------|-------|---------------------|---------------------|
| 1  | evan@example.com  | Evan  | 2022-01-01 00:00:00 | 2022-01-01 00:00:00 |
| 2  | brian@example.com | Brian | 2022-01-02 00:00:00 | 2022-01-02 00:00:00 |
| 3  | andy@example.com  | Andy  | 2022-01-03 00:00:00 | 2022-01-03 00:00:00 |

A common sync would be to create in incremental sync, using the `updated_at` timestamp column + de-duped history.  As users are added or updated the `updated_at` column is changed and the next sync will pick them up as the cursor can only move forward in time.

There's a problem however at the boundary of the sync.  If the sync itself was running at `2022-01-03 00:00:00`, right when a new user was added, they would be not be included in the sync:

| id | email             | name  | created_at          | updated_at          |
|----|-------------------|-------|---------------------|---------------------|
| 1  | evan@example.com  | Evan  | 2022-01-01 00:00:00 | 2022-01-01 00:00:00 |
| 2  | brian@example.com | Brian | 2022-01-02 00:00:00 | 2022-01-02 00:00:00 |
| 3  | andy@example.com  | Andy  | 2022-01-03 00:00:00 | 2022-01-03 00:00:00 |
| 4  | greg@example.com  | Greg  | 2022-01-03 00:00:00 | 2022-01-03 00:00:00 |

We won't pick up user 4 (at least until something about them changes and their `updated_at` is bumped) because in subsequent syncs, we use the [greater-than comparison](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-jdbc/src/main/java/io/airbyte/integrations/source/jdbc/AbstractJdbcSource.java#L274-L278) to choose which records to import.  **This can lead to missing records if they were inserted as the sync was running**.

## Local confirmation

To confirm this behavior, create a source table with the 3 rows listed in the first example, and run the sync.  You will have 3 records in your destination.  Then add a 4th row with the an `updated_at` matching the current cursor ("2022-01-03 00:00:00" in this example).  Sync again, and see that user 4 did not make it to the destination. 

## Possible Solutions

1. Rather than selecting records from SQL sources matching `"SELECT %s FROM %s WHERE %s > ?"`, we could use `>=`, e.g. `"SELECT %s FROM %s WHERE %s >= ?"`.  This solution is the most robust because it also works for numeric (vs time) cursors as well. This will have the negative side effect of almost always creating a duplicate entry (the previous sync's most recent record is very likely to be returned again at the start of a subsequent sync), and normalization will need to deal with it. This is not a problem with de-duped history, but is weird in all other sync modes. 

2. Building from solution 1 above (using `>=`), we could store both the cursor value and the offset in the STATE.  This solves the duplicate row problem introduced above. The state stored could remember the previous sync's cursor and OFFSET.  In this way, the next sync could continue from the same cursor position + offset to be sure that now new records appeared within the same cursor that we haven't seen yet:

```sql
SELECT * from users LIMIT 100 # got some records
SELECT * from users LIMIT 100 OFFSET 3 # got 0 records after using offset
# first sync ends, cursor is {value: "2022-01-03 00:00:00" + offset: 0}

# Next sync starts
SELECT * from users WHERE updated_at >= "2022-01-03 00:00:00" LIMIT 100 OFFSET 1 # use previous offset + 1 # 1 new row found!
```

So, the state for an incremental SQL cursor source should look like:

```js
{ users: { offset: 0, cursor: "2022-01-03 00:00:00"}}
```

## Probably bad solutions

1. If we could detect that the cursor was a time field, we could append `AND ${cursor} >= NOW() - INTERVAL '2 seconds'`.  In this way, we would ignore "too recent" values that might be inserted at the time of the sync. This doesn't solve the problem with numeric cursors, only "time"-like cursors.
2. Introduce a sleep and one more read before the sync is determined to be complete, e.g.:
```sql
SELECT * from users LIMIT 100 # got some records
SELECT * from users LIMIT 100 OFFSET 3 # got 0 records after using offset
# SLEEP 2. Sleep enough time to ensure that the cursor's minimum resolution (often 1 second) has passed
SELECT * from users LIMIT 100  OFFSET 3 # got 1 record that was inserted during the sleep
# LOOP or END if still 0 rows
```
3. Don't ever send any RECORDS from the latest cursor value.  You'll never see any duplicates, but the destination will always be some level of "behind" the source


## Misc

This approach assumes that our source database have an implicit, static sort which will return rows in the same order without needing an `ORDER BY`.  This is true for postgres and mysql... is it true for all DB sources?  We may also need to add an `ORDER BY` clause.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SQL Sources should include previous sync's cursor value in SELECT statement as not to skip rows added mid-sync #14732

Description

Local confirmation

Possible Solutions

Probably bad solutions

Misc

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

id	email	name	created_at	updated_at
1	[email protected]	Evan	2022-01-01 00:00:00	2022-01-01 00:00:00
2	[email protected]	Brian	2022-01-02 00:00:00	2022-01-02 00:00:00
3	[email protected]	Andy	2022-01-03 00:00:00	2022-01-03 00:00:00

SQL Sources should include previous sync's cursor value in SELECT statement as not to skip rows added mid-sync #14732

Description

Description

Local confirmation

Possible Solutions

Probably bad solutions

Misc

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions