Skip to content

file cdk: handle scalar values that resolve to None #35688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 5, 2024

Conversation

erohmensing
Copy link
Contributor

@erohmensing erohmensing commented Feb 28, 2024

What

How

Handle cases where the python value of a pyarrow scalar is None. This can be due to null values in data, as well as null-like values like NaT (similar to NaN). We previously handled this for None binary types, but now handle this for None of any type.

🚨 User Impact 🚨

No breaking changes. After this CDK version is released we should update the CDK dependency in S3 and any other file sources that parse parquet

Pre-merge Actions

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Connector version is set to 0.0.1
    • Dockerfile has version 0.0.1
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/integrations/<source or destination>/<name>.md including changelog with an entry for the initial version. See changelog example
    • docs/integrations/README.md

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Unit & integration tests added

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
Connector Generator
  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:generateScaffolds then checking in your changes
  • Documentation which references the generator is updated as needed
Updating the Python CDK

Airbyter

Before merging:

  • Pull Request description explains what problem it is solving
  • Code change is unit tested
  • Build and my-py check pass
  • Smoke test the change on at least one affected connector
    • On Github: Run this workflow, passing --use-local-cdk --name=source-<connector> as options
    • Locally: airbyte-ci connectors --use-local-cdk --name=source-<connector> test
  • PR is reviewed and approved

After merging:

  • Publish the CDK
    • The CDK does not follow proper semantic versioning. Choose minor if this the change has significant user impact or is a breaking change. Choose patch otherwise.
    • Write a thoughtful changelog message so we know what was updated.
  • Merge the platform PR that was auto-created for updating the Connector Builder's CDK version
    • This step is optional if the change does not affect the connector builder or declarative connectors.

Copy link

vercel bot commented Feb 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Mar 5, 2024 4:57pm

Copy link
Contributor Author

erohmensing commented Feb 28, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @erohmensing and the rest of your teammates on Graphite Graphite

Comment on lines +193 to +241
@pytest.mark.parametrize(
"parquet_type, parquet_format",
[
pytest.param(pa.bool_(), _default_parquet_format, id="test_parquet_bool"),
pytest.param(pa.int8(), _default_parquet_format, id="test_parquet_int8"),
pytest.param(pa.int16(), _default_parquet_format, id="test_parquet_int16"),
pytest.param(pa.int32(), _default_parquet_format, id="test_parquet_int32"),
pytest.param(pa.int64(), _default_parquet_format, id="test_parquet_int64"),
pytest.param(pa.uint8(), _default_parquet_format, id="test_parquet_uint8"),
pytest.param(pa.uint16(), _default_parquet_format, id="test_parquet_uint16"),
pytest.param(pa.uint32(), _default_parquet_format, id="test_parquet_uint32"),
pytest.param(pa.uint64(), _default_parquet_format, id="test_parquet_uint64"),
pytest.param(pa.float16(), _default_parquet_format, id="test_parquet_float16"),
pytest.param(pa.float32(), _default_parquet_format, id="test_parquet_float32"),
pytest.param(pa.float64(), _default_parquet_format, id="test_parquet_float64"),
pytest.param(pa.time32("s"), _default_parquet_format, id="test_parquet_time32s"),
pytest.param(pa.time32("ms"), _default_parquet_format, id="test_parquet_time32ms"),
pytest.param(pa.time64("us"), _default_parquet_format, id="test_parquet_time64us"),
pytest.param(pa.time64("ns"), _default_parquet_format, id="test_parquet_time64ns"),
pytest.param(pa.timestamp("s"), _default_parquet_format, id="test_parquet_timestamps_s"),
pytest.param(pa.timestamp("ms"), _default_parquet_format, id="test_parquet_timestamp_ms"),
pytest.param(pa.timestamp("s", "utc"), _default_parquet_format, id="test_parquet_timestamps_s_with_tz"),
pytest.param(pa.timestamp("ms", "est"), _default_parquet_format, id="test_parquet_timestamps_ms_with_tz"),
pytest.param(pa.date32(), _default_parquet_format, id="test_parquet_date32"),
pytest.param(pa.date64(), _default_parquet_format, id="test_parquet_date64"),
pytest.param(pa.duration("s"), _default_parquet_format, id="test_duration_s"),
pytest.param(pa.duration("ms"), _default_parquet_format, id="test_duration_ms"),
pytest.param(pa.duration("us"), _default_parquet_format, id="test_duration_us"),
pytest.param(pa.duration("ns"), _default_parquet_format, id="test_duration_ns"),
pytest.param(pa.month_day_nano_interval(), _default_parquet_format, id="test_parquet_month_day_nano_interval"),
pytest.param(pa.binary(), _default_parquet_format, id="test_binary"),
pytest.param(pa.binary(2), _default_parquet_format, id="test_fixed_size_binary"),
pytest.param(pa.string(), _default_parquet_format, id="test_parquet_string"),
pytest.param(pa.utf8(), _default_parquet_format, id="test_utf8"),
pytest.param(pa.large_binary(), _default_parquet_format, id="test_large_binary"),
pytest.param(pa.large_string(), _default_parquet_format, id="test_large_string"),
pytest.param(pa.large_utf8(), _default_parquet_format, id="test_large_utf8"),
pytest.param(pa.dictionary(pa.int32(), pa.string()), _default_parquet_format, id="test_dictionary"),
pytest.param(pa.struct([pa.field("field", pa.int32())]), _default_parquet_format, id="test_struct"),
pytest.param(pa.list_(pa.int32()), _default_parquet_format, id="test_list"),
pytest.param(pa.large_list(pa.int32()), _default_parquet_format, id="test_large_list"),
pytest.param(pa.decimal128(2), _default_parquet_format, id="test_decimal128"),
pytest.param(pa.decimal256(2), _default_parquet_format, id="test_decimal256"),
pytest.param(pa.decimal128(2), _decimal_as_float_parquet_format, id="test_decimal128_as_float"),
pytest.param(pa.decimal256(2), _decimal_as_float_parquet_format, id="test_decimal256_as_float"),
pytest.param(pa.map_(pa.int32(), pa.int32()), _default_parquet_format, id="test_map"),
pytest.param(pa.null(), _default_parquet_format, id="test_null"),
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😻

@erohmensing erohmensing force-pushed the ella/reproduce-parquet-issue branch from b586f42 to b6af5da Compare March 4, 2024 21:31
@erohmensing erohmensing requested a review from clnoll March 4, 2024 22:09
Copy link
Contributor Author

@clnoll thanks for looking at this! Think it's ready for review (and it's downstack PR)

@erohmensing erohmensing marked this pull request as ready for review March 4, 2024 22:53
@erohmensing erohmensing requested a review from a team March 4, 2024 22:53
Copy link
Contributor

@girarda girarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor Author

erohmensing commented Mar 5, 2024

Merge activity

  • Mar 5, 11:56 AM EST: @erohmensing started a stack merge that includes this pull request via Graphite.
  • Mar 5, 11:57 AM EST: Graphite rebased this pull request as part of a merge.
  • Mar 5, 12:07 PM EST: @erohmensing merged this pull request with Graphite.

Base automatically changed from ella/parquet-mypy to master March 5, 2024 16:56
erohmensing added a commit that referenced this pull request Mar 5, 2024
<!--
Thanks for your contribution! 
Before you submit the pull request, 
I'd like to kindly remind you to take a moment and read through our guidelines
to ensure that your contribution aligns with the type of contributions our project accepts.
All the information you need can be found here:
   https://docs.airbyte.com/contributing-to-airbyte/

We truly appreciate your interest in contributing to Airbyte,
and we're excited to see what you have to offer! 

If you have any questions or need any assistance, feel free to reach out in #contributions Slack channel.
-->

## What
* Fix typing and handling of different types in `_to_output_value` - we don't always get a `Scalar`. We already handle the different cases correctly, but the typing doesn't reflect this. 
* Splitting out the methods to do the scalar separately is a helpful precursor to #35688, as the `DictionaryArray` object doesn't have an `as_py()` method.

## 🚨 User Impact 🚨
None

## Pre-merge Actions
*Expand the relevant checklist and delete the others.*

<details><summary><strong>New Connector</strong></summary>

### Community member or Airbyter

- **Community member?** Grant edit access to maintainers ([instructions](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests))
- Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run `./gradlew :airbyte-integrations:connectors:<name>:integrationTest`.
- Connector version is set to `0.0.1`
    - `Dockerfile` has version `0.0.1`
- Documentation updated
    - Connector's `README.md`
    - Connector's `bootstrap.md`. See [description and examples](https://docs.google.com/document/d/1ypdgmwmEHWv-TrO4_YOQ7pAJGVrMp5BOkEVh831N260/edit?usp=sharing)
    - `docs/integrations/<source or destination>/<name>.md` including changelog with an entry for the initial version. See changelog [example](https://docs.airbyte.io/integrations/sources/stripe#changelog)
    - `docs/integrations/README.md`

### Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

- Create a non-forked branch based on this PR and test the below items on it
- Build is successful
- If new credentials are required for use in CI, add them to GSM. [Instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci).

</details>

<details><summary><strong>Updating a connector</strong></summary>

### Community member or Airbyter

- Grant edit access to maintainers ([instructions](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests))
- Unit & integration tests added


### Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

- Create a non-forked branch based on this PR and test the below items on it
- Build is successful
- If new credentials are required for use in CI, add them to GSM. [Instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci).

</details>

<details><summary><strong>Connector Generator</strong></summary>

- Issue acceptance criteria met
- PR name follows [PR naming conventions](https://docs.airbyte.com/contributing-to-airbyte/resources/pull-requests-handbook)
- If adding a new generator, add it to the [list of scaffold modules being tested](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connector-templates/generator/build.gradle#L41)
- The generator test modules (all connectors with `-scaffold` in their name) have been updated with the latest scaffold by running `./gradlew :airbyte-integrations:connector-templates:generator:generateScaffolds` then checking in your changes
- Documentation which references the generator is updated as needed

</details>

<details><summary><strong>Updating the Python CDK</strong></summary>

### Airbyter

Before merging:
- Pull Request description explains what problem it is solving
- Code change is unit tested
- Build and my-py check pass
- Smoke test the change on at least one affected connector
   - On Github: Run [this workflow](https://github.com/airbytehq/airbyte/actions/workflows/connectors_tests.yml), passing `--use-local-cdk --name=source-<connector>` as options
   - Locally: `airbyte-ci connectors --use-local-cdk --name=source-<connector> test`
- PR is reviewed and approved
      
After merging:
- [Publish the CDK](https://github.com/airbytehq/airbyte/actions/workflows/publish-cdk-command-manually.yml)
   - The CDK does not follow proper semantic versioning. Choose minor if this the change has significant user impact or is a breaking change. Choose patch otherwise.
   - Write a thoughtful changelog message so we know what was updated.
- Merge the platform PR that was auto-created for updating the Connector Builder's CDK version
   - This step is optional if the change does not affect the connector builder or declarative connectors.

</details>
@erohmensing erohmensing force-pushed the ella/reproduce-parquet-issue branch from b6af5da to 2801326 Compare March 5, 2024 16:57
@erohmensing erohmensing merged commit a090088 into master Mar 5, 2024
@erohmensing erohmensing deleted the ella/reproduce-parquet-issue branch March 5, 2024 17:07
xiaohansong pushed a commit that referenced this pull request Mar 7, 2024
<!--
Thanks for your contribution! 
Before you submit the pull request, 
I'd like to kindly remind you to take a moment and read through our guidelines
to ensure that your contribution aligns with the type of contributions our project accepts.
All the information you need can be found here:
   https://docs.airbyte.com/contributing-to-airbyte/

We truly appreciate your interest in contributing to Airbyte,
and we're excited to see what you have to offer! 

If you have any questions or need any assistance, feel free to reach out in #contributions Slack channel.
-->

## What
* Fix typing and handling of different types in `_to_output_value` - we don't always get a `Scalar`. We already handle the different cases correctly, but the typing doesn't reflect this. 
* Splitting out the methods to do the scalar separately is a helpful precursor to #35688, as the `DictionaryArray` object doesn't have an `as_py()` method.

## 🚨 User Impact 🚨
None

## Pre-merge Actions
*Expand the relevant checklist and delete the others.*

<details><summary><strong>New Connector</strong></summary>

### Community member or Airbyter

- **Community member?** Grant edit access to maintainers ([instructions](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests))
- Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run `./gradlew :airbyte-integrations:connectors:<name>:integrationTest`.
- Connector version is set to `0.0.1`
    - `Dockerfile` has version `0.0.1`
- Documentation updated
    - Connector's `README.md`
    - Connector's `bootstrap.md`. See [description and examples](https://docs.google.com/document/d/1ypdgmwmEHWv-TrO4_YOQ7pAJGVrMp5BOkEVh831N260/edit?usp=sharing)
    - `docs/integrations/<source or destination>/<name>.md` including changelog with an entry for the initial version. See changelog [example](https://docs.airbyte.io/integrations/sources/stripe#changelog)
    - `docs/integrations/README.md`

### Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

- Create a non-forked branch based on this PR and test the below items on it
- Build is successful
- If new credentials are required for use in CI, add them to GSM. [Instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci).

</details>

<details><summary><strong>Updating a connector</strong></summary>

### Community member or Airbyter

- Grant edit access to maintainers ([instructions](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests))
- Unit & integration tests added


### Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

- Create a non-forked branch based on this PR and test the below items on it
- Build is successful
- If new credentials are required for use in CI, add them to GSM. [Instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci).

</details>

<details><summary><strong>Connector Generator</strong></summary>

- Issue acceptance criteria met
- PR name follows [PR naming conventions](https://docs.airbyte.com/contributing-to-airbyte/resources/pull-requests-handbook)
- If adding a new generator, add it to the [list of scaffold modules being tested](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connector-templates/generator/build.gradle#L41)
- The generator test modules (all connectors with `-scaffold` in their name) have been updated with the latest scaffold by running `./gradlew :airbyte-integrations:connector-templates:generator:generateScaffolds` then checking in your changes
- Documentation which references the generator is updated as needed

</details>

<details><summary><strong>Updating the Python CDK</strong></summary>

### Airbyter

Before merging:
- Pull Request description explains what problem it is solving
- Code change is unit tested
- Build and my-py check pass
- Smoke test the change on at least one affected connector
   - On Github: Run [this workflow](https://github.com/airbytehq/airbyte/actions/workflows/connectors_tests.yml), passing `--use-local-cdk --name=source-<connector>` as options
   - Locally: `airbyte-ci connectors --use-local-cdk --name=source-<connector> test`
- PR is reviewed and approved
      
After merging:
- [Publish the CDK](https://github.com/airbytehq/airbyte/actions/workflows/publish-cdk-command-manually.yml)
   - The CDK does not follow proper semantic versioning. Choose minor if this the change has significant user impact or is a breaking change. Choose patch otherwise.
   - Write a thoughtful changelog message so we know what was updated.
- Merge the platform PR that was auto-created for updating the Connector Builder's CDK version
   - This step is optional if the change does not affect the connector builder or declarative connectors.

</details>
xiaohansong pushed a commit that referenced this pull request Mar 7, 2024
<!--
Thanks for your contribution! 
Before you submit the pull request, 
I'd like to kindly remind you to take a moment and read through our guidelines
to ensure that your contribution aligns with the type of contributions our project accepts.
All the information you need can be found here:
   https://docs.airbyte.com/contributing-to-airbyte/

We truly appreciate your interest in contributing to Airbyte,
and we're excited to see what you have to offer! 

If you have any questions or need any assistance, feel free to reach out in #contributions Slack channel.
-->

## What
* Closes #34151
* Closes https://github.com/airbytehq/oncall/issues/4386

## How
Handle cases where the python value of a pyarrow scalar is None. This can be due to null values in data, as well as null-like values like `NaT` (similar to `NaN`). We previously handled this for `None` binary types, but now handle this for `None` of any type.

## 🚨 User Impact 🚨
No breaking changes. After this CDK version is released we should update the CDK dependency in S3 and any other file sources that parse parquet


## Pre-merge Actions
*Expand the relevant checklist and delete the others.*

<details><summary><strong>New Connector</strong></summary>

### Community member or Airbyter

- **Community member?** Grant edit access to maintainers ([instructions](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests))
- Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run `./gradlew :airbyte-integrations:connectors:<name>:integrationTest`.
- Connector version is set to `0.0.1`
    - `Dockerfile` has version `0.0.1`
- Documentation updated
    - Connector's `README.md`
    - Connector's `bootstrap.md`. See [description and examples](https://docs.google.com/document/d/1ypdgmwmEHWv-TrO4_YOQ7pAJGVrMp5BOkEVh831N260/edit?usp=sharing)
    - `docs/integrations/<source or destination>/<name>.md` including changelog with an entry for the initial version. See changelog [example](https://docs.airbyte.io/integrations/sources/stripe#changelog)
    - `docs/integrations/README.md`

### Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

- Create a non-forked branch based on this PR and test the below items on it
- Build is successful
- If new credentials are required for use in CI, add them to GSM. [Instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci).

</details>

<details><summary><strong>Updating a connector</strong></summary>

### Community member or Airbyter

- Grant edit access to maintainers ([instructions](https://docs.github.com/en/github/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests))
- Unit & integration tests added


### Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

- Create a non-forked branch based on this PR and test the below items on it
- Build is successful
- If new credentials are required for use in CI, add them to GSM. [Instructions](https://docs.airbyte.io/connector-development#using-credentials-in-ci).

</details>

<details><summary><strong>Connector Generator</strong></summary>

- Issue acceptance criteria met
- PR name follows [PR naming conventions](https://docs.airbyte.com/contributing-to-airbyte/resources/pull-requests-handbook)
- If adding a new generator, add it to the [list of scaffold modules being tested](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connector-templates/generator/build.gradle#L41)
- The generator test modules (all connectors with `-scaffold` in their name) have been updated with the latest scaffold by running `./gradlew :airbyte-integrations:connector-templates:generator:generateScaffolds` then checking in your changes
- Documentation which references the generator is updated as needed

</details>

<details><summary><strong>Updating the Python CDK</strong></summary>

### Airbyter

Before merging:
- Pull Request description explains what problem it is solving
- Code change is unit tested
- Build and my-py check pass
- Smoke test the change on at least one affected connector
   - On Github: Run [this workflow](https://github.com/airbytehq/airbyte/actions/workflows/connectors_tests.yml), passing `--use-local-cdk --name=source-<connector>` as options
   - Locally: `airbyte-ci connectors --use-local-cdk --name=source-<connector> test`
- PR is reviewed and approved
      
After merging:
- [Publish the CDK](https://github.com/airbytehq/airbyte/actions/workflows/publish-cdk-command-manually.yml)
   - The CDK does not follow proper semantic versioning. Choose minor if this the change has significant user impact or is a breaking change. Choose patch otherwise.
   - Write a thoughtful changelog message so we know what was updated.
- Merge the platform PR that was auto-created for updating the Connector Builder's CDK version
   - This step is optional if the change does not affect the connector builder or declarative connectors.

</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Source S3 fails to load data from parquet file if value is NoneType
4 participants