File CDK: update spec & config with new CSV options

The existing S3 connector offers a number of options for configuring connections for [CSV file types](https://github.com/airbytehq/airbyte/blob/f79aa72d64de412c2710ab806cd1148df7314133/airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/formats/csv_spec.py). To ensure backwards compatibility, we'll want to update the `AbstractFileBasedSpec` and [config adapter](https://github.com/airbytehq/airbyte/issues/28131) to handle them. 

In https://github.com/airbytehq/airbyte/issues/28131, we're creating a new S3 FileBasedConfig object. This ticket involves extending that object to handle CSV-specific options, and will also require the creation of a custom parser that handles the old options.

- Verify that all options that we still support are appropriately mapped to the name in the file-based CDK:
  - `delimiter`
  - `quote_char`
  - `escape_char`
  - `encoding`
  - `double_quote`
  - `newlines_in_values`
 
- Handle options that we don't offer in the new file-based CDK
  - `infer_datatypes`: Configures whether a schema for the source should be inferred from the current data or not. If this is set to True, we'll want to [infer & cast types](https://github.com/airbytehq/airbyte/pull/28052/files#diff-2ac011610fd803eed6b2f7f1369e9c9ecc6d99b01ebfb475781eec212d081064R80) even if the user has not provided a schema. Unfortunately this is [set to True for the vast majority of connectors](https://airbyte.metabaseapp.com/question#eyJkYXRhc2V0X3F1ZXJ5Ijp7InR5cGUiOiJuYXRpdmUiLCJuYXRpdmUiOnsicXVlcnkiOiJzZWxlY3RcbiAgICBDb3VudCgqKSxcbiAgICBjb25maWd1cmF0aW9uIC0-ICdmb3JtYXQnIC0-PiAnaW5mZXJfZGF0YXR5cGVzJ1xuZnJvbSBhY3RvciB3aGVyZSBhY3Rvcl9kZWZpbml0aW9uX2lkID0gJzY5NTg5NzgxLTc4MjgtNDNjNS05ZjYzLTg5MjViMWMxY2NjMicgQU5EIGNvbmZpZ3VyYXRpb24gLT4gJ2Zvcm1hdCcgLT4-ICdmaWxldHlwZScgPSAnY3N2JyBHUk9VUCBCWSBjb25maWd1cmF0aW9uIC0-ICdmb3JtYXQnIC0-PiAnaW5mZXJfZGF0YXR5cGVzJzsiLCJ0ZW1wbGF0ZS10YWdzIjp7fX0sImRhdGFiYXNlIjo3Mn0sImRpc3BsYXkiOiJ0YWJsZSIsInBhcmFtZXRlcnMiOltdLCJ2aXN1YWxpemF0aW9uX3NldHRpbmdzIjp7fX0=) so it does feel like we should handle it as opposed to letting this be a breaking change.
  - `additional_reader_options`: Options provided to the CSV reader. There are [only a handful of connectors](https://airbyte.metabaseapp.com/question#eyJkYXRhc2V0X3F1ZXJ5Ijp7InR5cGUiOiJuYXRpdmUiLCJuYXRpdmUiOnsicXVlcnkiOiJzZWxlY3RcbiAgICBDb3VudCgqKSxcbiAgICBjb25maWd1cmF0aW9uIC0-ICdmb3JtYXQnIC0-PiAnYWRkaXRpb25hbF9yZWFkZXJfb3B0aW9ucydcbmZyb20gYWN0b3Igd2hlcmUgYWN0b3JfZGVmaW5pdGlvbl9pZCA9ICc2OTU4OTc4MS03ODI4LTQzYzUtOWY2My04OTI1YjFjMWNjYzInIEFORCBjb25maWd1cmF0aW9uIC0-ICdmb3JtYXQnIC0-PiAnZmlsZXR5cGUnID0gJ2NzdicgR1JPVVAgQlkgY29uZmlndXJhdGlvbiAtPiAnZm9ybWF0JyAtPj4gJ2FkZGl0aW9uYWxfcmVhZGVyX29wdGlvbnMnOyIsInRlbXBsYXRlLXRhZ3MiOnt9fSwiZGF0YWJhc2UiOjcyfSwiZGlzcGxheSI6InRhYmxlIiwicGFyYW1ldGVycyI6W10sInZpc3VhbGl6YXRpb25fc2V0dGluZ3MiOnt9fQ==) with these set. The file-based CDK will be updated to support the following
    - `strings_can_be_null`: this should always be True.
    - `null_values`: this should be offered as a CSV-specific config option, so the spec should be updated accordingly.
    - We should confirm that these options are not necessary: `autogenerate_column_names`, `compression`, `include_missing_columns`, and `check_utf8`.
    - One connector is using `{"column_types":{"Zipcode": "string"}}`. Because this is a single connector we should consider deprecating this option. 
    - Double-check to verify that we have a plan to either support or deprecate all `additional_reader_options` that are in use by connectors in cloud.
  - `advanced_options`: Options provided to Pyarrow, [used by a handful of connectors](https://airbyte.metabaseapp.com/question#eyJkYXRhc2V0X3F1ZXJ5Ijp7InR5cGUiOiJuYXRpdmUiLCJuYXRpdmUiOnsicXVlcnkiOiJzZWxlY3RcbiAgICBDb3VudCgqKSxcbiAgICBjb25maWd1cmF0aW9uIC0-ICdmb3JtYXQnIC0-PiAnYWR2YW5jZWRfb3B0aW9ucydcbmZyb20gYWN0b3Igd2hlcmUgYWN0b3JfZGVmaW5pdGlvbl9pZCA9ICc2OTU4OTc4MS03ODI4LTQzYzUtOWY2My04OTI1YjFjMWNjYzInIEFORCBjb25maWd1cmF0aW9uIC0-ICdmb3JtYXQnIC0-PiAnZmlsZXR5cGUnID0gJ2NzdicgR1JPVVAgQlkgY29uZmlndXJhdGlvbiAtPiAnZm9ybWF0JyAtPj4gJ2FkdmFuY2VkX29wdGlvbnMnOyIsInRlbXBsYXRlLXRhZ3MiOnt9fSwiZGF0YWJhc2UiOjcyfSwiZGlzcGxheSI6InRhYmxlIiwicGFyYW1ldGVycyI6W10sInZpc3VhbGl6YXRpb25fc2V0dGluZ3MiOnt9fQ==). Instead of blindly passing these options to pyarrow, we should deliberately surface those that we want to support, and deprecate the rest, as follows.
    - `column_types`: this allows us to support headerless CSVs. We should surface it as an option in the CSV-specific section of the spec.
    - `skip_rows` & `skip_rows_after_names`: select one of these and offer it as a CSV-specific config option. (For existing connectors, we should be able to support both by calculating the value for `skip_rows` based on `skip_rows_after_names` or vice-versa.)
    - Verify that `encoding` is already handled, and that `compression` will be handled by the stream reader without requiring additional config options.
    - Double-check to verify that we have a plan to either support or deprecate all`advanced_options` that are in use by connectors in cloud.
 
#### Acceptance Criteria
- [ ] The existing CSV config options are mapped and handled appropriately by the S3 connector.
- [ ] Any options that we cannot support are identified, along with the connectors that will be impacted.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

File CDK: update spec & config with new CSV options #28133

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File CDK: update spec & config with new CSV options #28133

Description

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions