Skip to content

storage: work around segment index overflows #25979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 1, 2025

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Apr 29, 2025

The segment index on offset deltas (difference between offset and segment's base offset) could overflow, resulting in very odd behavior, including returning ghost batches incorrectly which could result in lost batches during recovery.

To fix, this PR does two things:

  • stops indexing offset deltas above uint32_t::max
  • avoids using the index for offsets when we can tell the index spans more than uint32_t::max offsets

These changes feel pretty hacky, and our segment index still has a fundamental limitation for large segments. But this at least stops the bleeding.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

Bug Fixes

  • Fixes an issue with local segment indexes that could result in incorrect Raft recovery, compaction, or fetches in topic partitions with very large offsets (greater than 4294967295).

@secpanda
Copy link

secpanda commented Apr 29, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@secpanda
Copy link

secpanda commented Apr 29, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@andrwng andrwng force-pushed the storage-index-overflow branch 4 times, most recently from 1cfcd7e to c7aa42f Compare April 30, 2025 02:25
@andrwng andrwng changed the title storage/tests: index overflow test storage: work around segment index overflows Apr 30, 2025
@andrwng andrwng marked this pull request as ready for review April 30, 2025 02:29
WillemKauf
WillemKauf previously approved these changes Apr 30, 2025
Copy link
Contributor

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew. What a doozy

@andrwng andrwng force-pushed the storage-index-overflow branch 2 times, most recently from 38f34e8 to 1bd4af7 Compare April 30, 2025 02:58
WillemKauf
WillemKauf previously approved these changes Apr 30, 2025
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 30, 2025

Retry command for Build#65276

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/data_migrations_api_test.py::DataMigrationsApiTest.test_conflicting_names
tests/rptest/tests/partition_reassignments_test.py::PartitionReassignmentsTest.test_add_partitions_with_inprogress_reassignments
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_scaling_up_with_recovered_topic
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeDelayedEnablementTest.test_enabling_iceberg_in_existing_cluster@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/cloud_storage_timing_stress_test.py::CloudStorageTimingStressTest.test_cloud_storage@{"cleanup_policy":"compact,delete"}
tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption
tests/rptest/tests/partition_balancer_test.py::PartitionBalancerTest.test_unavailable_nodes

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 30, 2025

CI test results

test results on build#65276
test_id test_kind job_url test_status passed
rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage.cleanup_policy=compact.delete ducktape https://buildkite.com/redpanda/redpanda/builds/65276#019684fc-6af1-4674-a3c1-d5fa2ccefcfd FLAKY 19/21
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_conflicting_names ducktape https://buildkite.com/redpanda/redpanda/builds/65276#019684f8-7d09-48af-9d43-46f82536f8ff FLAKY 5/21
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_conflicting_names ducktape https://buildkite.com/redpanda/redpanda/builds/65276#019684fc-6af1-4674-a3c1-d5fa2ccefcfd FLAKY 5/21
rptest.tests.datalake.datalake_e2e_test.DatalakeDelayedEnablementTest.test_enabling_iceberg_in_existing_cluster.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC ducktape https://buildkite.com/redpanda/redpanda/builds/65276#019684fc-6af1-4674-a3c1-d5fa2ccefcfd FLAKY 19/21
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_add_partitions_with_inprogress_reassignments ducktape https://buildkite.com/redpanda/redpanda/builds/65276#019684fc-6aef-4820-9ee2-52d34a06e422 FLAKY 17/21
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/65276#019684fc-6aef-4820-9ee2-52d34a06e422 FLAKY 9/21
rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption ducktape https://buildkite.com/redpanda/redpanda/builds/65276#019684fc-6af1-4674-a3c1-d5fa2ccefcfd FLAKY 7/21
rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False ducktape https://buildkite.com/redpanda/redpanda/builds/65276#019684f8-7d08-42e4-8a0a-a5146febed3b FLAKY 16/21
test results on build#65297
test_id test_kind job_url test_status passed
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_conflicting_names ducktape https://buildkite.com/redpanda/redpanda/builds/65297#019686f6-9e6e-423d-aec7-6d336cf8b575 FLAKY 4/21
rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption ducktape https://buildkite.com/redpanda/redpanda/builds/65297#019686f6-9e71-4c7d-b78a-106963fa40c2 FLAKY 19/21
test results on build#65323
test_id test_kind job_url test_status passed
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_conflicting_names ducktape https://buildkite.com/redpanda/redpanda/builds/65323#01968837-b14a-47d2-a472-60d46184f370 FLAKY 8/21
rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_conflicting_names ducktape https://buildkite.com/redpanda/redpanda/builds/65323#0196883c-b316-49e9-952a-e5679a375bb3 FLAKY 7/21
rptest.tests.partition_balancer_test.PartitionBalancerTest.test_recovery_mode_rebalance_finish ducktape https://buildkite.com/redpanda/redpanda/builds/65323#0196883c-b313-4800-a608-c1487204c841 FLAKY 20/21
rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes ducktape https://buildkite.com/redpanda/redpanda/builds/65323#0196883c-b313-4800-a608-c1487204c841 FLAKY 16/21
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_add_partitions_with_inprogress_reassignments ducktape https://buildkite.com/redpanda/redpanda/builds/65323#0196883c-b313-4800-a608-c1487204c841 FLAKY 20/21
rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_consumer_interruption ducktape https://buildkite.com/redpanda/redpanda/builds/65323#0196883c-b316-49e9-952a-e5679a375bb3 FLAKY 9/21
rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False ducktape https://buildkite.com/redpanda/redpanda/builds/65323#01968837-b147-4695-a358-3f113cdc6dd5 FLAKY 13/21
test results on build#65386
test_id test_kind job_url test_status passed
rptest.tests.datalake.compaction_test.CompactionTest.test_compaction.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.NESSIE ducktape https://buildkite.com/redpanda/redpanda/builds/65386#01968b0d-edb6-4fba-96ab-c840210cb636 FLAKY 7/21
rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes ducktape https://buildkite.com/redpanda/redpanda/builds/65386#01968b0d-edb4-4354-8609-c5e3de214319 FLAKY 9/21
rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False ducktape https://buildkite.com/redpanda/redpanda/builds/65386#01968b0d-edb6-4fba-96ab-c840210cb636 FLAKY 15/21

@WillemKauf
Copy link
Contributor

/ci-repeat 1
skip-redpanda-build
skip-units
tests/rptest/tests/data_migrations_api_test.py::DataMigrationsApiTest.test_conflicting_names
tests/rptest/tests/partition_reassignments_test.py::PartitionReassignmentsTest.test_add_partitions_with_inprogress_reassignments
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_scaling_up_with_recovered_topic
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeDelayedEnablementTest.test_enabling_iceberg_in_existing_cluster@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/cloud_storage_timing_stress_test.py::CloudStorageTimingStressTest.test_cloud_storage@{"cleanup_policy":"compact,delete"}
tests/rptest/tests/simple_e2e_test.py::SimpleEndToEndTest.test_consumer_interruption
tests/rptest/tests/partition_balancer_test.py::PartitionBalancerTest.test_unavailable_nodes

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#65297

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/data_migrations_api_test.py::DataMigrationsApiTest.test_conflicting_names

@WillemKauf
Copy link
Contributor

/ci-repeat 1
skip-redpanda-build
skip-units
tests/rptest/tests/data_migrations_api_test.py::DataMigrationsApiTest.test_conflicting_names

@andrwng
Copy link
Contributor Author

andrwng commented Apr 30, 2025

Thanks for the retries Willem!


retval = true;
auto offset_delta = batch_base_offset() - base_offset();
if (offset_delta <= std::numeric_limits<uint32_t>::max()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • is it possible for this to be false in the case where retval is true and would it matter? the initial case of the index the index being empty seems like an important edge case where we might always want the first entry?

  • so skipping an entry is always safe right? it's only real effect is coarser granularity indexing?

Copy link
Contributor Author

@andrwng andrwng Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible for this to be false in the case where retval is true and would it matter?

Ah hm, it is possible, but I don't think it matters: it just affects whether callers reset their respective accumulators, which affects the next time add_entry is evaluated, but presumably all entries at this point aren't going to be added. That said, it is an oversight to have this return true in the || retval case. Will update.

the initial case of the index the index being empty seems like an important edge case where we might always want the first entry?

Hm this is an interesting point. I think if the index stays empty, this is fine because queries into the index will return std::nullopt and callers should read from the start of the segment. And the index should stay empty because we're relying on callers to call this with monotonically increasing records.

so skipping an entry is always safe right? it's only real effect is coarser granularity indexing?

That's my read on it, yeah

} else {
// We can't index anything beyond uint32 space. Presumably no
// further entries will be added because of this same condition.
return false;
Copy link
Contributor

@bharathv bharathv Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if any instrumentation is useful around this (probably makes it easy to monitor all topics hitting such large offset spaces in one place).

edit: I'm thinking metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably add a metric for this, but I don't think one exists other than examining HWM for huge partitions.

@andrwng andrwng force-pushed the storage-index-overflow branch from 87706cd to 66d27b2 Compare May 1, 2025 06:33
@andrwng
Copy link
Contributor Author

andrwng commented May 1, 2025

Latest push: rebase

@andrwng andrwng requested review from WillemKauf and bharathv May 1, 2025 06:36
andrwng added 2 commits May 1, 2025 00:13
Once values begin coming in whose deltas are beyond uint32_t::max, the
index is basically unusable (queries have somewhat undefined behavior).

This updates the index to skip adding values if the offset delta would
overflow.

Prior to this change, one included test would fail with:

```
src/v/storage/tests/log_segment_reader_test.cc(376): error: in "test_ghost_read_with_index_overflow": check first.last_offset() == model::offset{100} has failed [2147483746 != 100]
src/v/storage/tests/log_segment_reader_test.cc(377): error: in "test_ghost_read_with_index_overflow": check first.header().type == model::record_batch_type::raft_data has failed [batch_type::ghost_batch != batch_type::raft_data]
```

This test failing means that a recovery from such an overflowed topic
could end up missing some batches.

Another included test would fail, this one relying on behavior addressed
by redpanda-data#25974 to cause
trouble:

```
src/v/storage/tests/log_segment_reader_test.cc(382): error: in "test_read_with_index_overflow_base": check first.base_offset() == model::offset{uint32_max} has failed [8589934591 != 4294967295]
src/v/storage/tests/log_segment_reader_test.cc(383): error: in "test_read_with_index_overflow_base": check first.last_offset() == model::offset{uint32_max} has failed [8589934591 != 4294967295]
```

This test failing means that a compaction could attempt to read a whole
segment but only get a part of a segment, and we could end up removing
batches incorrectly.

Just an interesting tidbit, these tests pass if using the batch cache.
This is relevant as we have been chasing issues with the
__consumer_offsets topic, which does not use the batch cache.
Indexes that have overflowed should not be used, because seeking within
is somewhat undefined when considering the potential of overflowed
uint32_t offset delta values.

This commit tweaks find_nearest() to return the first indexed entry if
this is detected. This allows Redpanda to safely proceed even when there
are segment indexes that have overflowed from previous versions of
Redpanda.

Without this change, the included test would fail with:

```
src/v/storage/tests/index_state_test.cc(360): Entering test case "index_overflow"
src/v/storage/tests/index_state_test.cc(377): error: in "index_overflow": check res->offset == model::offset{0} has failed [9 != 0]
src/v/storage/tests/index_state_test.cc(378): error: in "index_overflow": check res->filepos == 1 has failed [4 != 1]
src/v/storage/tests/index_state_test.cc(360): Leaving test case "index_overflow"; testing time: 77us
```
@andrwng andrwng force-pushed the storage-index-overflow branch from 66d27b2 to fc1e8bf Compare May 1, 2025 07:13
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#65386

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/compaction_test.py::CompactionTest.test_compaction@{"catalog_type":"nessie","cloud_storage_type":1}

@andrwng
Copy link
Contributor Author

andrwng commented May 1, 2025

/ci-repeat 1
skip-redpanda-build
skip-units
tests/rptest/tests/datalake/compaction_test.py::CompactionTest.test_compaction@{"catalog_type":"nessie","cloud_storage_type":1}

@andrwng andrwng enabled auto-merge May 1, 2025 21:57
@andrwng andrwng merged commit ca81d67 into redpanda-data:dev May 1, 2025
19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-25979-v24.2.x-177 remotes/upstream/v24.2.x
git cherry-pick -x 0ed8d64254 fc1e8bfe72

Workflow run logs.

andrwng added a commit to andrwng/redpanda that referenced this pull request May 5, 2025
Similar to redpanda-data#25979, when an
index contains state that leads it to overflow, it is unreliable. In
this case, this could affect truncation, and result in us removing
incorrect index entries.

So, this makes the truncate avoid removing entries if this case is
detected, with the expectation that no entries are added that have
overflowed.

Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that:
- suffix truncation is generally more rare
- the overflowing index problem is generally a problem with:
  - large segments, much larger than the default allowed size
  - compacted segments, which we expect to be relatively rare to
    suffix truncate with tiered storage

Nonetheless, some kind of fix seems appropriate because it's difficult
to reason about the overflow otherwise.
andrwng added a commit to andrwng/redpanda that referenced this pull request May 5, 2025
Similar to redpanda-data#25979, when an
index contains state that leads it to overflow, it is unreliable. In
this case, this could affect truncation, and result in us removing
incorrect index entries.

So, this makes the truncate avoid removing entries if this case is
detected, with the expectation that no entries are added that have
overflowed.

Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that:
- suffix truncation is generally more rare
- the overflowing index problem is generally a problem with:
  - large segments, much larger than the default allowed size
  - compacted segments, which we expect to be relatively rare to
    suffix truncate with tiered storage

Nonetheless, some kind of fix seems appropriate because it's difficult
to reason about the overflow otherwise.
andrwng added a commit to andrwng/redpanda that referenced this pull request Jun 11, 2025
CONFLICT:
- previous test in index_state.cc was different from dev

Similar to redpanda-data#25979, when an
index contains state that leads it to overflow, it is unreliable. In
this case, this could affect truncation, and result in us removing
incorrect index entries.

So, this makes the truncate avoid removing entries if this case is
detected, with the expectation that no entries are added that have
overflowed.

Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that:
- suffix truncation is generally more rare
- the overflowing index problem is generally a problem with:
  - large segments, much larger than the default allowed size
  - compacted segments, which we expect to be relatively rare to
    suffix truncate with tiered storage

Nonetheless, some kind of fix seems appropriate because it's difficult
to reason about the overflow otherwise.

(cherry picked from commit e882d67)
andrwng added a commit to andrwng/redpanda that referenced this pull request Jun 11, 2025
CONFLICT:
- previous test in index_state.cc was different from dev

Similar to redpanda-data#25979, when an
index contains state that leads it to overflow, it is unreliable. In
this case, this could affect truncation, and result in us removing
incorrect index entries.

So, this makes the truncate avoid removing entries if this case is
detected, with the expectation that no entries are added that have
overflowed.

Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that:
- suffix truncation is generally more rare
- the overflowing index problem is generally a problem with:
  - large segments, much larger than the default allowed size
  - compacted segments, which we expect to be relatively rare to
    suffix truncate with tiered storage

Nonetheless, some kind of fix seems appropriate because it's difficult
to reason about the overflow otherwise.

(cherry picked from commit e882d67)
andrwng added a commit to andrwng/redpanda that referenced this pull request Jun 11, 2025
CONFLICT:
- similar to redpanda-data#26024, the
  appropriate code in this branch is in segment_index instead of
  index_state
- because of this, the index_state unit test is removed

Similar to redpanda-data#25979, when an
index contains state that leads it to overflow, it is unreliable. In
this case, this could affect truncation, and result in us removing
incorrect index entries.

So, this makes the truncate avoid removing entries if this case is
detected, with the expectation that no entries are added that have
overflowed.

Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that:
- suffix truncation is generally more rare
- the overflowing index problem is generally a problem with:
  - large segments, much larger than the default allowed size
  - compacted segments, which we expect to be relatively rare to
    suffix truncate with tiered storage

Nonetheless, some kind of fix seems appropriate because it's difficult
to reason about the overflow otherwise.

(cherry picked from commit e882d67)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants