Skip to content

[v24.3.x] storage: work around segment index overflows #26022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

vbotbuildovich
Copy link
Collaborator

Backport of PR #25979

andrwng added 2 commits May 1, 2025 22:21
Once values begin coming in whose deltas are beyond uint32_t::max, the
index is basically unusable (queries have somewhat undefined behavior).

This updates the index to skip adding values if the offset delta would
overflow.

Prior to this change, one included test would fail with:

```
src/v/storage/tests/log_segment_reader_test.cc(376): error: in "test_ghost_read_with_index_overflow": check first.last_offset() == model::offset{100} has failed [2147483746 != 100]
src/v/storage/tests/log_segment_reader_test.cc(377): error: in "test_ghost_read_with_index_overflow": check first.header().type == model::record_batch_type::raft_data has failed [batch_type::ghost_batch != batch_type::raft_data]
```

This test failing means that a recovery from such an overflowed topic
could end up missing some batches.

Another included test would fail, this one relying on behavior addressed
by redpanda-data#25974 to cause
trouble:

```
src/v/storage/tests/log_segment_reader_test.cc(382): error: in "test_read_with_index_overflow_base": check first.base_offset() == model::offset{uint32_max} has failed [8589934591 != 4294967295]
src/v/storage/tests/log_segment_reader_test.cc(383): error: in "test_read_with_index_overflow_base": check first.last_offset() == model::offset{uint32_max} has failed [8589934591 != 4294967295]
```

This test failing means that a compaction could attempt to read a whole
segment but only get a part of a segment, and we could end up removing
batches incorrectly.

Just an interesting tidbit, these tests pass if using the batch cache.
This is relevant as we have been chasing issues with the
__consumer_offsets topic, which does not use the batch cache.

(cherry picked from commit 0ed8d64)
Indexes that have overflowed should not be used, because seeking within
is somewhat undefined when considering the potential of overflowed
uint32_t offset delta values.

This commit tweaks find_nearest() to return the first indexed entry if
this is detected. This allows Redpanda to safely proceed even when there
are segment indexes that have overflowed from previous versions of
Redpanda.

Without this change, the included test would fail with:

```
src/v/storage/tests/index_state_test.cc(360): Entering test case "index_overflow"
src/v/storage/tests/index_state_test.cc(377): error: in "index_overflow": check res->offset == model::offset{0} has failed [9 != 0]
src/v/storage/tests/index_state_test.cc(378): error: in "index_overflow": check res->filepos == 1 has failed [4 != 1]
src/v/storage/tests/index_state_test.cc(360): Leaving test case "index_overflow"; testing time: 77us
```

(cherry picked from commit fc1e8bf)
@vbotbuildovich vbotbuildovich added this to the v24.3.x-next milestone May 1, 2025
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label May 1, 2025
@vbotbuildovich vbotbuildovich requested a review from andrwng May 1, 2025 22:21
@secpanda
Copy link

secpanda commented May 1, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@secpanda
Copy link

secpanda commented May 1, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@secpanda
Copy link

secpanda commented May 1, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@vbotbuildovich
Copy link
Collaborator Author

vbotbuildovich commented May 2, 2025

Retry command for Build#65428

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/compaction_recovery_test.py::CompactionRecoveryTest.test_index_recovery
tests/rptest/tests/shard_placement_test.py::ShardPlacementTest.test_node_join@{"disable_license":true}
tests/rptest/tests/partition_balancer_test.py::PartitionBalancerTest.test_fuzz_admin_ops
tests/rptest/tests/topic_creation_test.py::RecreateTopicMetadataTest.test_recreated_topic_metadata_are_valid@{"replication_factor":3}

@vbotbuildovich
Copy link
Collaborator Author

CI test results

test results on build#65428
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/65428#01968e38-fa51-4a1d-9f0a-c2440c573554 FLAKY 12/21
rptest.tests.partition_balancer_test.PartitionBalancerTest.test_fuzz_admin_ops ducktape https://buildkite.com/redpanda/redpanda/builds/65428#01968e38-fa50-41df-b03b-e7ba7901caa7 FLAKY 19/21
rptest.tests.shard_placement_test.ShardPlacementTest.test_node_join.disable_license=True ducktape https://buildkite.com/redpanda/redpanda/builds/65428#01968e4b-8f6b-41c4-94fb-6c73a026f95a FLAKY 20/21

@piyushredpanda piyushredpanda merged commit afb9d59 into redpanda-data:v24.3.x May 2, 2025
17 checks passed
@piyushredpanda piyushredpanda modified the milestones: v24.3.x-next, v24.3.12 May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants