Skip to content

[v25.1.x] storage: fix index state truncate overflow #26431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: v25.1.x
Choose a base branch
from

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Jun 11, 2025

Backport of #26051
Fixes #26193

CONFLICT:

  • previous test in index_state.cc was different from dev

Similar to #25979, when an index contains state that leads it to overflow, it is unreliable. In this case, this could affect truncation, and result in us removing incorrect index entries.

So, this makes the truncate avoid removing entries if this case is detected, with the expectation that no entries are added that have overflowed.

Note, this bug doesn't appear to be as severe as #25979 in that:

  • suffix truncation is generally more rare
  • the overflowing index problem is generally a problem with:
    • large segments, much larger than the default allowed size
    • compacted segments, which we expect to be relatively rare to suffix truncate with tiered storage

Nonetheless, some kind of fix seems appropriate because it's difficult to reason about the overflow otherwise.

(cherry picked from commit e882d67)

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

  • none

CONFLICT:
- previous test in index_state.cc was different from dev

Similar to redpanda-data#25979, when an
index contains state that leads it to overflow, it is unreliable. In
this case, this could affect truncation, and result in us removing
incorrect index entries.

So, this makes the truncate avoid removing entries if this case is
detected, with the expectation that no entries are added that have
overflowed.

Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that:
- suffix truncation is generally more rare
- the overflowing index problem is generally a problem with:
  - large segments, much larger than the default allowed size
  - compacted segments, which we expect to be relatively rare to
    suffix truncate with tiered storage

Nonetheless, some kind of fix seems appropriate because it's difficult
to reason about the overflow otherwise.

(cherry picked from commit e882d67)
@andrwng andrwng marked this pull request as ready for review June 11, 2025 19:02
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#67220
test_class test_method test_arguments test_kind job_url test_status passed reason
partition_balancer_simulator_test_rpunit partition_balancer_simulator_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/67220#01976060-3386-443a-a111-c6c3415bf805 FLAKY 1/2
CloudStorageChunkReadTest test_read_chunks ducktape https://buildkite.com/redpanda/redpanda/builds/67220#019760b7-f2a5-4dd7-accb-4b4ddfc0fe4e FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
DatalakeBatchingTest test_batching {"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "expect_large_files": true, "query_engine": "spark"} ducktape https://buildkite.com/redpanda/redpanda/builds/67220#019760bb-e80c-43d5-85a6-2d926d78b295 FLAKY 19/21 upstream reliability is '87.87878787878788'. current run reliability is '90.47619047619048'. drift is -2.5974 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": true, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": true, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/67220#019760bb-e80d-424b-9d3e-ac24a0c8e6ff FLAKY 20/21 upstream reliability is '96.96969696969697'. current run reliability is '95.23809523809523'. drift is 1.7316 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "enable_failures": true, "mixed_versions": true, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/67220#019760bb-e80c-43d5-85a6-2d926d78b295 FLAKY 17/21 upstream reliability is '100.0'. current run reliability is '80.95238095238095'. drift is 19.04762 and the allowed drift is set to 50. The test should PASS
RecreateTopicMetadataTest test_recreated_topic_metadata_are_valid {"replication_factor": 3} ducktape https://buildkite.com/redpanda/redpanda/builds/67220#019760bb-e80e-4364-a966-cc6275a8620c FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS

@andrwng andrwng requested review from dotnwat and mmaslankaprv June 13, 2025 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants