Closed
Description
The existing S3 connector offers a number of options for configuring connections for CSV file types. To ensure backwards compatibility, we'll want to update the AbstractFileBasedSpec
and config adapter to handle them.
In #28131, we're creating a new S3 FileBasedConfig object. This ticket involves extending that object to handle CSV-specific options, and will also require the creation of a custom parser that handles the old options.
-
Verify that all options that we still support are appropriately mapped to the name in the file-based CDK:
delimiter
quote_char
escape_char
encoding
double_quote
newlines_in_values
-
Handle options that we don't offer in the new file-based CDK
infer_datatypes
: Configures whether a schema for the source should be inferred from the current data or not. If this is set to True, we'll want to infer & cast types even if the user has not provided a schema. Unfortunately this is set to True for the vast majority of connectors so it does feel like we should handle it as opposed to letting this be a breaking change.additional_reader_options
: Options provided to the CSV reader. There are only a handful of connectors with these set. The file-based CDK will be updated to support the followingstrings_can_be_null
: this should always be True.null_values
: this should be offered as a CSV-specific config option, so the spec should be updated accordingly.- We should confirm that these options are not necessary:
autogenerate_column_names
,compression
,include_missing_columns
, andcheck_utf8
. - One connector is using
{"column_types":{"Zipcode": "string"}}
. Because this is a single connector we should consider deprecating this option. - Double-check to verify that we have a plan to either support or deprecate all
additional_reader_options
that are in use by connectors in cloud.
advanced_options
: Options provided to Pyarrow, used by a handful of connectors. Instead of blindly passing these options to pyarrow, we should deliberately surface those that we want to support, and deprecate the rest, as follows.column_types
: this allows us to support headerless CSVs. We should surface it as an option in the CSV-specific section of the spec.skip_rows
&skip_rows_after_names
: select one of these and offer it as a CSV-specific config option. (For existing connectors, we should be able to support both by calculating the value forskip_rows
based onskip_rows_after_names
or vice-versa.)- Verify that
encoding
is already handled, and thatcompression
will be handled by the stream reader without requiring additional config options. - Double-check to verify that we have a plan to either support or deprecate all
advanced_options
that are in use by connectors in cloud.
Acceptance Criteria
- The existing CSV config options are mapped and handled appropriately by the S3 connector.
- Any options that we cannot support are identified, along with the connectors that will be impacted.