Skip to content

File CDK: update spec & config with new CSV options #28133

Closed
@clnoll

Description

@clnoll

The existing S3 connector offers a number of options for configuring connections for CSV file types. To ensure backwards compatibility, we'll want to update the AbstractFileBasedSpec and config adapter to handle them.

In #28131, we're creating a new S3 FileBasedConfig object. This ticket involves extending that object to handle CSV-specific options, and will also require the creation of a custom parser that handles the old options.

  • Verify that all options that we still support are appropriately mapped to the name in the file-based CDK:

    • delimiter
    • quote_char
    • escape_char
    • encoding
    • double_quote
    • newlines_in_values
  • Handle options that we don't offer in the new file-based CDK

    • infer_datatypes: Configures whether a schema for the source should be inferred from the current data or not. If this is set to True, we'll want to infer & cast types even if the user has not provided a schema. Unfortunately this is set to True for the vast majority of connectors so it does feel like we should handle it as opposed to letting this be a breaking change.
    • additional_reader_options: Options provided to the CSV reader. There are only a handful of connectors with these set. The file-based CDK will be updated to support the following
      • strings_can_be_null: this should always be True.
      • null_values: this should be offered as a CSV-specific config option, so the spec should be updated accordingly.
      • We should confirm that these options are not necessary: autogenerate_column_names, compression, include_missing_columns, and check_utf8.
      • One connector is using {"column_types":{"Zipcode": "string"}}. Because this is a single connector we should consider deprecating this option.
      • Double-check to verify that we have a plan to either support or deprecate all additional_reader_options that are in use by connectors in cloud.
    • advanced_options: Options provided to Pyarrow, used by a handful of connectors. Instead of blindly passing these options to pyarrow, we should deliberately surface those that we want to support, and deprecate the rest, as follows.
      • column_types: this allows us to support headerless CSVs. We should surface it as an option in the CSV-specific section of the spec.
      • skip_rows & skip_rows_after_names: select one of these and offer it as a CSV-specific config option. (For existing connectors, we should be able to support both by calculating the value for skip_rows based on skip_rows_after_names or vice-versa.)
      • Verify that encoding is already handled, and that compression will be handled by the stream reader without requiring additional config options.
      • Double-check to verify that we have a plan to either support or deprecate alladvanced_options that are in use by connectors in cloud.

Acceptance Criteria

  • The existing CSV config options are mapped and handled appropriately by the S3 connector.
  • Any options that we cannot support are identified, along with the connectors that will be impacted.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions