Source S3: update docs (#36366)

clnoll · web-flow · commit d4174a08ba5a · 2024-03-21T17:00:49.000-04:00
diff --git a/docs/integrations/sources/s3.md b/docs/integrations/sources/s3.md
@@ -125,11 +125,14 @@ For more information on managing your access keys, please refer to the
 3. Find and select **S3** from the list of available sources.
 4. Enter the name of the **Bucket** containing your files to replicate.
 5. Add a stream
-   1. Write the **File Type**
+   1. Choose the **File Format**
    2. In the **Format** box, use the dropdown menu to select the format of the files you'd like to replicate. The supported formats are **CSV**, **Parquet**, **Avro** and **JSONL**. Toggling the **Optional fields** button within the **Format** box will allow you to enter additional configurations based on the selected format.  For a detailed breakdown of these settings, refer to the [File Format section](#file-format-settings) below.
    3. Give a **Name** to the stream
-   4. (Optional) - If you want to enforce a specific schema, you can enter a **Input schema**. By default, this value is set to `{}` and will automatically infer the schema from the file\(s\) you are replicating. For details on providing a custom schema, refer to the [User Schema section](#user-schema).
-   5. Optionally, enter the **Globs** which dictates which files to be synced. This is a regular expression that allows Airbyte to pattern match the specific files to replicate. If you are replicating all the files within your bucket, use `**` as the pattern. For more precise pattern matching options, refer to the [Path Patterns section](#path-patterns) below.
+   4. (Optional) Enter the **Globs** which dictates which files to be synced. This is a regular expression that allows Airbyte to pattern match the specific files to replicate. If you are replicating all the files within your bucket, use `**` as the pattern. For more precise pattern matching options, refer to the [Globs section](#globs) below.
+   5. (Optional) Modify the **Days To Sync If History Is Full** value. This gives you control of the lookback window that we will use to determine which files to sync if the state history is full. Details are in the [State section](#state) below.
+   6. (Optional) If you want to enforce a specific schema, you can enter a **Input schema**. By default, this value is set to `{}` and will automatically infer the schema from the file\(s\) you are replicating. For details on providing a custom schema, refer to the [User Schema section](#user-schema).
+   7. (Optional) Select the **Schemaless** option, to skip all validation of the records against a schema. If this option is selected the schema will be `{"data": "object"}` and all downstream data will be nested in a "data" field. This is a good option if the schema of your records changes frequently.
+   8. (Optional) Select a **Validation Policy** to tell Airbyte how to handle records that do not match the schema. You may choose to emit the record anyway (fields that aren't present in the schema may not arrive at the destination), skip the record altogether, or wait until the next discovery (which will happen in the next 24 hours).
 6. **To authenticate your private bucket**: 
    - If using an IAM role, enter the **AWS Role ARN**.
    - If using IAM user credentials, fill the **AWS Access Key ID** and **AWS Secret Access Key** fields with the appropriate credentials.
@@ -166,7 +169,7 @@ There is no predefined streams. The streams are based on content of your bucket.
 
 Please let us know any specific compressions you'd like to see support for next!
 
-## Path Patterns
+## Globs
 
 \(tl;dr -&gt; path pattern syntax using [wcmatch.glob](https://facelessuser.github.io/wcmatch/glob/). GLOBSTAR and SPLIT flags are enabled.\)
 
@@ -216,6 +219,11 @@ We want to pick up part1.csv, part2.csv and part3.csv \(excluding another_part1.
 
 As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.
 
+## State
+
+To perform incremental syncs, Airbyte syncs files from oldest to newest. Each file that's synced (up to 10,000 files) will be added as an entry in a "history" section of the connection's state message. 
+Once history is full, we drop the older messages out of the file, and only read files that were last modified between the date of the newest file in history and `Days to Sync if History is Full` days prior.
+
 ## User Schema
 
 Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:
@@ -250,14 +258,6 @@ Please note, the S3 Source connector used to infer schemas from all the availabl
 
 - **AWS Access Key ID**: One half of the [required credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) for accessing a private bucket.
 - **AWS Secret Access Key**: The other half of the [required credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) for accessing a private bucket.
-- **Path Prefix**: An optional string that limits the files returned by AWS when listing files to only those starting with the specified prefix. This is different than the **Path Pattern**, as the prefix is applied directly to the API call made to S3, rather than being filtered within Airbyte. **This is not a regular expression** and does not accept pattern-style symbols like wildcards (`*`). We recommend using this filter to improve performance if the connector if your bucket has many folders and files that are unrelated to the data you want to replicate, and all the relevant files will always reside under the specified prefix.
-  - Together with the **Path Pattern**, there are multiple ways to specify the files to sync. For example, all the following configurations are equivalent:
-    - **Prefix** = `<empty>`, **Pattern** = `path1/path2/myFolder/**/*`
-    - **Prefix** = `path1/`, **Pattern** = `path2/myFolder/**/*.csv`
-    - **Prefix** = `path1/path2/`, **Pattern** = `myFolder/**/*.csv`
-    - **Prefix** = `path1/path2/myFolder/`, **Pattern** = `**/*.csv`
-
-  - The ability to individually configure the prefix and pattern has been included to accommodate situations where you do not want to replicate the majority of the files in the bucket. If you are unsure of the best approach, you can safely leave the **Path Prefix** field empty and just [set the Path Pattern](#path-patterns) to meet your requirements.
 - **Endpoint**: An optional parameter that enables the use of non-Amazon S3 compatible services. If you are using the default Amazon service, leave this field blank.
 - **Start Date**: An optional parameter that marks a starting date and time in UTC for data replication. Any files that have _not_ been modified since this specified date/time will _not_ be replicated. Use the provided datepicker (recommended) or enter the desired date programmatically in the format `YYYY-MM-DDTHH:mm:ssZ`. Leaving this field blank will replicate data from all files that have not been excluded by the **Path Pattern** and **Path Prefix**.