-
Notifications
You must be signed in to change notification settings - Fork 640
storage: work around segment index overflows #25979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🎉 Snyk checks have passed. No issues have been found so far.✅ security/snyk check is complete. No issues have been found. (View Details) ✅ license/snyk check is complete. No issues have been found. (View Details) |
🎉 Snyk checks have passed. No issues have been found so far.✅ security/snyk check is complete. No issues have been found. (View Details) ✅ license/snyk check is complete. No issues have been found. (View Details) |
1cfcd7e
to
c7aa42f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew. What a doozy
38f34e8
to
1bd4af7
Compare
Retry command for Build#65276please wait until all jobs are finished before running the slash command
|
CI test resultstest results on build#65276
test results on build#65297
test results on build#65323
test results on build#65386
|
/ci-repeat 1 |
Retry command for Build#65297please wait until all jobs are finished before running the slash command
|
/ci-repeat 1 |
Thanks for the retries Willem! |
|
||
retval = true; | ||
auto offset_delta = batch_base_offset() - base_offset(); | ||
if (offset_delta <= std::numeric_limits<uint32_t>::max()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
is it possible for this to be false in the case where retval is true and would it matter? the initial case of the index the index being empty seems like an important edge case where we might always want the first entry?
-
so skipping an entry is always safe right? it's only real effect is coarser granularity indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible for this to be false in the case where retval is true and would it matter?
Ah hm, it is possible, but I don't think it matters: it just affects whether callers reset their respective accumulators, which affects the next time add_entry is evaluated, but presumably all entries at this point aren't going to be added. That said, it is an oversight to have this return true in the || retval
case. Will update.
the initial case of the index the index being empty seems like an important edge case where we might always want the first entry?
Hm this is an interesting point. I think if the index stays empty, this is fine because queries into the index will return std::nullopt
and callers should read from the start of the segment. And the index should stay empty because we're relying on callers to call this with monotonically increasing records.
so skipping an entry is always safe right? it's only real effect is coarser granularity indexing?
That's my read on it, yeah
1bd4af7
to
7446cbd
Compare
} else { | ||
// We can't index anything beyond uint32 space. Presumably no | ||
// further entries will be added because of this same condition. | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious if any instrumentation is useful around this (probably makes it easy to monitor all topics hitting such large offset spaces in one place).
edit: I'm thinking metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably add a metric for this, but I don't think one exists other than examining HWM for huge partitions.
87706cd
to
66d27b2
Compare
Latest push: rebase |
Once values begin coming in whose deltas are beyond uint32_t::max, the index is basically unusable (queries have somewhat undefined behavior). This updates the index to skip adding values if the offset delta would overflow. Prior to this change, one included test would fail with: ``` src/v/storage/tests/log_segment_reader_test.cc(376): error: in "test_ghost_read_with_index_overflow": check first.last_offset() == model::offset{100} has failed [2147483746 != 100] src/v/storage/tests/log_segment_reader_test.cc(377): error: in "test_ghost_read_with_index_overflow": check first.header().type == model::record_batch_type::raft_data has failed [batch_type::ghost_batch != batch_type::raft_data] ``` This test failing means that a recovery from such an overflowed topic could end up missing some batches. Another included test would fail, this one relying on behavior addressed by redpanda-data#25974 to cause trouble: ``` src/v/storage/tests/log_segment_reader_test.cc(382): error: in "test_read_with_index_overflow_base": check first.base_offset() == model::offset{uint32_max} has failed [8589934591 != 4294967295] src/v/storage/tests/log_segment_reader_test.cc(383): error: in "test_read_with_index_overflow_base": check first.last_offset() == model::offset{uint32_max} has failed [8589934591 != 4294967295] ``` This test failing means that a compaction could attempt to read a whole segment but only get a part of a segment, and we could end up removing batches incorrectly. Just an interesting tidbit, these tests pass if using the batch cache. This is relevant as we have been chasing issues with the __consumer_offsets topic, which does not use the batch cache.
Indexes that have overflowed should not be used, because seeking within is somewhat undefined when considering the potential of overflowed uint32_t offset delta values. This commit tweaks find_nearest() to return the first indexed entry if this is detected. This allows Redpanda to safely proceed even when there are segment indexes that have overflowed from previous versions of Redpanda. Without this change, the included test would fail with: ``` src/v/storage/tests/index_state_test.cc(360): Entering test case "index_overflow" src/v/storage/tests/index_state_test.cc(377): error: in "index_overflow": check res->offset == model::offset{0} has failed [9 != 0] src/v/storage/tests/index_state_test.cc(378): error: in "index_overflow": check res->filepos == 1 has failed [4 != 1] src/v/storage/tests/index_state_test.cc(360): Leaving test case "index_overflow"; testing time: 77us ```
66d27b2
to
fc1e8bf
Compare
Retry command for Build#65386please wait until all jobs are finished before running the slash command
|
/ci-repeat 1 |
/backport v25.1.x |
/backport v24.3.x |
/backport v24.2.x |
Failed to create a backport PR to v24.2.x branch. I tried:
|
Similar to redpanda-data#25979, when an index contains state that leads it to overflow, it is unreliable. In this case, this could affect truncation, and result in us removing incorrect index entries. So, this makes the truncate avoid removing entries if this case is detected, with the expectation that no entries are added that have overflowed. Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that: - suffix truncation is generally more rare - the overflowing index problem is generally a problem with: - large segments, much larger than the default allowed size - compacted segments, which we expect to be relatively rare to suffix truncate with tiered storage Nonetheless, some kind of fix seems appropriate because it's difficult to reason about the overflow otherwise.
Similar to redpanda-data#25979, when an index contains state that leads it to overflow, it is unreliable. In this case, this could affect truncation, and result in us removing incorrect index entries. So, this makes the truncate avoid removing entries if this case is detected, with the expectation that no entries are added that have overflowed. Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that: - suffix truncation is generally more rare - the overflowing index problem is generally a problem with: - large segments, much larger than the default allowed size - compacted segments, which we expect to be relatively rare to suffix truncate with tiered storage Nonetheless, some kind of fix seems appropriate because it's difficult to reason about the overflow otherwise.
CONFLICT: - previous test in index_state.cc was different from dev Similar to redpanda-data#25979, when an index contains state that leads it to overflow, it is unreliable. In this case, this could affect truncation, and result in us removing incorrect index entries. So, this makes the truncate avoid removing entries if this case is detected, with the expectation that no entries are added that have overflowed. Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that: - suffix truncation is generally more rare - the overflowing index problem is generally a problem with: - large segments, much larger than the default allowed size - compacted segments, which we expect to be relatively rare to suffix truncate with tiered storage Nonetheless, some kind of fix seems appropriate because it's difficult to reason about the overflow otherwise. (cherry picked from commit e882d67)
CONFLICT: - previous test in index_state.cc was different from dev Similar to redpanda-data#25979, when an index contains state that leads it to overflow, it is unreliable. In this case, this could affect truncation, and result in us removing incorrect index entries. So, this makes the truncate avoid removing entries if this case is detected, with the expectation that no entries are added that have overflowed. Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that: - suffix truncation is generally more rare - the overflowing index problem is generally a problem with: - large segments, much larger than the default allowed size - compacted segments, which we expect to be relatively rare to suffix truncate with tiered storage Nonetheless, some kind of fix seems appropriate because it's difficult to reason about the overflow otherwise. (cherry picked from commit e882d67)
CONFLICT: - similar to redpanda-data#26024, the appropriate code in this branch is in segment_index instead of index_state - because of this, the index_state unit test is removed Similar to redpanda-data#25979, when an index contains state that leads it to overflow, it is unreliable. In this case, this could affect truncation, and result in us removing incorrect index entries. So, this makes the truncate avoid removing entries if this case is detected, with the expectation that no entries are added that have overflowed. Note, this bug doesn't appear to be as severe as redpanda-data#25979 in that: - suffix truncation is generally more rare - the overflowing index problem is generally a problem with: - large segments, much larger than the default allowed size - compacted segments, which we expect to be relatively rare to suffix truncate with tiered storage Nonetheless, some kind of fix seems appropriate because it's difficult to reason about the overflow otherwise. (cherry picked from commit e882d67)
The segment index on offset deltas (difference between offset and segment's base offset) could overflow, resulting in very odd behavior, including returning ghost batches incorrectly which could result in lost batches during recovery.
To fix, this PR does two things:
uint32_t::max
uint32_t::max
offsetsThese changes feel pretty hacky, and our segment index still has a fundamental limitation for large segments. But this at least stops the bleeding.
Backports Required
Release Notes
Bug Fixes