Skip to content

iceberg/rest_client: catch exception in retry() attempt #25273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 7, 2025

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Mar 6, 2025

We previously saw error logs when perfoming Iceberg snapshot removal around shutdown:

WARN  2025-03-06 16:50:15,657 [shard 1:main] http - [/iceberg/v1/main/namespaces/redpanda/tables/test~dlq] - client.cc:143 - make_request timed-out connection attempt /iceberg/v1/main/namespaces/redpanda/tables/test~dlq
WARN  2025-03-06 16:50:15,657 [shard 1:main] iceberg - rest_catalog.cc:62 - error returned when executing load_table - http_call_error: Exception during retry sleep: seastar::sleep_aborted (Sleep is aborted)
ERROR 2025-03-06 16:50:15,657 [shard 1:main] datalake - coordinator.cc:125 - Coordinator hit exception while running in term 1: seastar::abort_requested_exception (abort requested)

The issue is that the retry_chain_node::retry call checks the abort source. This commit wraps this call with a catch to return an appropriate non-exceptional error type.

Without this, datalake/schema_evolution_test.py was very flaky. With this change it at least appears to run fine locally.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

We previously saw error logs when perfoming Iceberg snapshot removal
around shutdown:

```
WARN  2025-03-06 16:50:15,657 [shard 1:main] http - [/iceberg/v1/main/namespaces/redpanda/tables/test~dlq] - client.cc:143 - make_request timed-out connection attempt /iceberg/v1/main/namespaces/redpanda/tables/test~dlq
WARN  2025-03-06 16:50:15,657 [shard 1:main] iceberg - rest_catalog.cc:62 - error returned when executing load_table - http_call_error: Exception during retry sleep: seastar::sleep_aborted (Sleep is aborted)
ERROR 2025-03-06 16:50:15,657 [shard 1:main] datalake - coordinator.cc:125 - Coordinator hit exception while running in term 1: seastar::abort_requested_exception (abort requested)
```

The issue is that the retry_chain_node::retry call checks the abort source.
This commit wraps this call with a catch to return an appropriate
non-exceptional error type.

Without this, datalake/schema_evolution_test.py was very flaky. With this
change it at least appears to run fine locally.
Copy link
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. seems like a genuinely stochastic failure, so presumably this was flaking the schema tests specifically because they are many.

@andrwng
Copy link
Contributor Author

andrwng commented Mar 6, 2025

lgtm. seems like a genuinely stochastic failure, so presumably this was flaking the schema tests specifically because they are many.

Yea this seems likely considering how many there are, plus parameterization

@andrwng andrwng enabled auto-merge March 6, 2025 23:09
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#62752
test_id test_kind job_url test_status passed
rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage.cleanup_policy=compact.delete ducktape https://buildkite.com/redpanda/redpanda/builds/62752#01956dd8-7c8f-42f6-99b7-bd044cd4a9ca FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/62752#01956dd8-7c8d-4379-9d64-f3c4d4b57212 FLAKY 1/2
rptest.tests.write_caching_fi_test.WriteCachingFailureInjectionTest.test_unavoidable_data_loss ducktape https://buildkite.com/redpanda/redpanda/builds/62752#01956dd8-7c8f-42f6-99b7-bd044cd4a9ca FLAKY 1/2
rptest.transactions.producers_api_test.ProducersAdminAPITest.test_producers_state_api_during_load ducktape https://buildkite.com/redpanda/redpanda/builds/62752#01956dbd-f322-46d6-9624-4125ed691bfd FLAKY 1/2

@andrwng andrwng merged commit 8af1891 into redpanda-data:dev Mar 7, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants