Skip to content

Stuck in infinite loop after disk buffer corruption #17644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NicolasFloquet opened this issue Jun 8, 2023 · 12 comments · Fixed by #17657
Closed

Stuck in infinite loop after disk buffer corruption #17644

NicolasFloquet opened this issue Jun 8, 2023 · 12 comments · Fixed by #17657
Labels
domain: buffers Anything related to Vector's memory/disk buffers type: bug A code related bug.

Comments

@NicolasFloquet
Copy link

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello,

We've been facing multiple buffer corruption & partial write since before 0.30.0, due to hardware reset of our hosts.
0.30.0 has greatly improved the situation, however we're still facing issues from time to time.
Those issues are critical and unrecoverable without human intervention for two reasons:

  • Restarting vector doesn't fix the issue: buffer needs to be manually erased to recover
  • Vector stop responding to SIGINT signal, only SIGKILL works.

What we see is that vector quickly ends up in an infinite loop, as shown in logs below:

2023-06-08T09:51:40.330230Z DEBUG vector::app: Internal log rate limit configured. internal_log_rate_secs=10
2023-06-08T09:51:40.331718Z  INFO vector::app: Log level is enabled. level="vector=debug,codec=debug,vrl=debug,file_source=debug,tower_limit=trace,rdkafka=debug,buffers=debug,lapin=debug,kube=debug"
2023-06-08T09:51:40.336196Z  INFO vector::app: Loading configs. paths=["config.yaml"]
2023-06-08T09:51:40.348440Z DEBUG vector::config::loading: No secret placeholder found, skipping secret resolution.
2023-06-08T09:51:40.368077Z DEBUG vector::topology::builder: Building new source. component=input
2023-06-08T09:51:40.371782Z DEBUG vector::topology::builder: Building new sink. component=output
2023-06-08T09:51:40.376510Z DEBUG vector_buffers::variants::disk_v2::ledger: Found existing data file. data_file="buffer-data-1.dat" file_size=2296 total_buffer_size=2296
2023-06-08T09:51:40.376541Z DEBUG vector_buffers::variants::disk_v2::ledger: Found existing data file. data_file="buffer-data-2.dat" file_size=6128776 total_buffer_size=6131072
2023-06-08T09:51:40.377051Z DEBUG validate_last_write: vector_buffers::variants::disk_v2::writer: Validating last written record in current data file. current_writer_data_file="./buffer/v2/output/buffer-data-2.dat"
2023-06-08T09:51:40.377155Z DEBUG validate_last_write:ensure_ready_for_write: vector_buffers::variants::disk_v2::writer: Opened data file for writing. data_file_path="./buffer/v2/output/buffer-data-2.dat" existing_file_size=6128776
2023-06-08T09:51:40.384223Z ERROR validate_last_write: vector_buffers::variants::disk_v2::writer: Last record written to data file is behind expected position. Events have likely been lost. ledger_next=666400 last_record_id=666376 record_events=23
2023-06-08T09:51:40.385170Z DEBUG vector_buffers::variants::disk_v2::reader: Seeking to last acknowledged record for reader. last_acknowledged_record_id=666399
2023-06-08T09:51:40.385743Z DEBUG vector_buffers::variants::disk_v2::reader: Opened data file for reading. data_file_path="./buffer/v2/output/buffer-data-2.dat"
2023-06-08T09:51:40.457316Z DEBUG vector_buffers::variants::disk_v2::reader: Current data file has no more data. bytes_read=6128776
2023-06-08T09:51:40.457367Z DEBUG vector_buffers::variants::disk_v2::reader: Current data file has no more data. bytes_read=6128776
2023-06-08T09:51:40.457401Z DEBUG vector_buffers::variants::disk_v2::reader: Current data file has no more data. bytes_read=6128776
2023-06-08T09:51:40.457421Z DEBUG vector_buffers::variants::disk_v2::reader: Current data file has no more data. bytes_read=6128776
2023-06-08T09:51:40.457458Z DEBUG vector_buffers::variants::disk_v2::reader: Current data file has no more data. bytes_read=6128776

… forever …

Configuration

I imported our production buffer directory on my local setup, and loaded it using following configuration:

data_dir: "./"
sources:
  input:
    type: demo_logs
    format: apache_common
    lines:
      - line1
sinks:
  output:
    type: console
    buffer:
    - type: disk
      max_size: 1073741824 # 1GiB.
      when_full: drop_newest
    encoding:
      codec: json
    inputs:
      - input

Version

0.30.0

Debug Output

2023-06-08T12:56:10.674301Z DEBUG vector::app: Internal log rate limit configured. internal_log_rate_secs=10
2023-06-08T12:56:10.674409Z  INFO vector::app: Log level is enabled. level="vector=trace,codec=trace,vrl=trace,file_source=trace,tower_limit=trace,rdkafka=trace,buffers=trace,lapin=trace,kube=trace"
2023-06-08T12:56:10.675480Z  INFO vector::app: Loading configs. paths=["config.yaml"]
2023-06-08T12:56:10.677789Z DEBUG vector::config::loading: No secret placeholder found, skipping secret resolution.
2023-06-08T12:56:10.679238Z DEBUG vector::topology::builder: Building new source. component=input
2023-06-08T12:56:10.679721Z DEBUG vector::topology::builder: Building new sink. component=output
2023-06-08T12:56:10.682285Z DEBUG vector_buffers::variants::disk_v2::ledger: Found existing data file. data_file="buffer-data-1.dat" file_size=2296 total_buffer_size=2296
2023-06-08T12:56:10.682334Z DEBUG vector_buffers::variants::disk_v2::ledger: Found existing data file. data_file="buffer-data-2.dat" file_size=6128776 total_buffer_size=6131072
2023-06-08T12:56:10.682386Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=0 new_buffer_size=6131072
2023-06-08T12:56:10.682446Z DEBUG validate_last_write: vector_buffers::variants::disk_v2::writer: Validating last written record in current data file. current_writer_data_file="./buffer/v2/output/buffer-data-2.dat"
2023-06-08T12:56:10.682637Z DEBUG validate_last_write:ensure_ready_for_write: vector_buffers::variants::disk_v2::writer: Opened data file for writing. data_file_path="./buffer/v2/output/buffer-data-2.dat" existing_file_size=6128776
2023-06-08T12:56:10.688705Z ERROR validate_last_write: vector_buffers::variants::disk_v2::writer: Last record written to data file is behind expected position. Events have likely been lost. ledger_next=666400 last_record_id=666376 record_events=23
2023-06-08T12:56:10.689234Z DEBUG vector_buffers::variants::disk_v2::reader: Seeking to last acknowledged record for reader. last_acknowledged_record_id=666399
2023-06-08T12:56:10.689548Z DEBUG vector_buffers::variants::disk_v2::reader: Opened data file for reading. data_file_path="./buffer/v2/output/buffer-data-2.dat"
2023-06-08T12:56:10.690606Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=6131072 new_buffer_size=6083104
2023-06-08T12:56:10.690889Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=6083104 new_buffer_size=6074616
2023-06-08T12:56:10.690929Z TRACE vector_buffers::topology::acks: Added acknowledgements. unclaimed_acks=112 added_acks=112
2023-06-08T12:56:10.690940Z TRACE vector_buffers::topology::acks: Searching for eligible marker. acked_marker_id=666400 effective_acked_marker_id=666512 unclaimed_acks=112 pending_markers=0
2023-06-08T12:56:10.691082Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=6074616 new_buffer_size=6066128
2023-06-08T12:56:10.691112Z TRACE vector_buffers::topology::acks: Added acknowledgements. unclaimed_acks=136 added_acks=24
2023-06-08T12:56:10.691121Z TRACE vector_buffers::topology::acks: Searching for eligible marker. acked_marker_id=666400 effective_acked_marker_id=666536 unclaimed_acks=136 pending_markers=0
2023-06-08T12:56:10.691213Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=6066128 new_buffer_size=6057968
2023-06-08T12:56:10.691235Z TRACE vector_buffers::topology::acks: Added acknowledgements. unclaimed_acks=160 added_acks=24
2023-06-08T12:56:10.691244Z TRACE vector_buffers::topology::acks: Searching for eligible marker. acked_marker_id=666400 effective_acked_marker_id=666560 unclaimed_acks=160 pending_markers=0
2023-06-08T12:56:10.691381Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=6057968 new_buffer_size=6049624
2023-06-08T12:56:10.691410Z TRACE vector_buffers::topology::acks: Added acknowledgements. unclaimed_acks=170 added_acks=10
2023-06-08T12:56:10.691419Z TRACE vector_buffers::topology::acks: Searching for eligible marker. acked_marker_id=666400 effective_acked_marker_id=666570 unclaimed_acks=170 pending_markers=0
2023-06-08T12:56:10.691455Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=6049624 new_buffer_size=6048760
2023-06-08T12:56:10.691466Z TRACE vector_buffers::topology::acks: Added acknowledgements. unclaimed_acks=193 added_acks=23
2023-06-08T12:56:10.691474Z TRACE vector_buffers::topology::acks: Searching for eligible marker. acked_marker_id=666400 effective_acked_marker_id=666593 unclaimed_acks=193 pending_markers=0
2023-06-08T12:56:10.691535Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=6048760 new_buffer_size=6044768
2023-06-08T12:56:10.691551Z TRACE vector_buffers::topology::acks: Added acknowledgements. unclaimed_acks=194 added_acks=1
2023-06-08T12:56:10.691559Z TRACE vector_buffers::topology::acks: Searching for eligible marker. acked_marker_id=666400 effective_acked_marker_id=666594 unclaimed_acks=194 pending_markers=0
2023-06-08T12:56:10.691572Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=6044768 new_buffer_size=6044248
2023-06-08T12:56:10.691582Z TRACE vector_buffers::topology::acks: Added acknowledgements. unclaimed_acks=195 added_acks=1
2023-06-08T12:56:10.691591Z TRACE vector_buffers::topology::acks: Searching for eligible marker. acked_marker_id=666400 effective_acked_marker_id=666595 unclaimed_acks=195 pending_markers=0
2023-06-08T12:56:10.692388Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=6044248 new_buffer_size=5996280
2023-06-08T12:56:10.692504Z TRACE vector_buffers::topology::acks: Added acknowledgements. unclaimed_acks=196 added_acks=1
2023-06-08T12:56:10.692513Z TRACE vector_buffers::topology::acks: Searching for eligible marker. acked_marker_id=666400 effective_acked_marker_id=666596 unclaimed_acks=196 pending_markers=0
2023-06-08T12:56:10.692651Z TRACE vector_buffers::variants::disk_v2::ledger: Updated buffer size. previous_buffer_size=5996280 new_buffer_size=5987792


… forever …

Example Data

No response

Additional Context

Right now I can't share my buffer-data-1.dat file because it contains business critical data, but I can try scrubing any sensitive data if needed

References

No response

@NicolasFloquet NicolasFloquet added the type: bug A code related bug. label Jun 8, 2023
@NicolasFloquet
Copy link
Author

So I digged a little bit in the code.
What I see is that I'm reaching this line:

if maybe_record.is_none() && self.last_reader_record_id == 0 {

In my case, maybe_record.is_none() is true, but self.last_reader_record_id is equal to ledger_last-1, so i'm not entering the if block, and continue the loop forever.

My understanding is that reader entry in the ledger refers to a record id that doesn't exist in the data file.

I tried removing the self.last_reader_record_id == 0 condition from the if, and my setup recovers properly and carry-on sending new logs to stdout.

I guess there is two potential fixes:

  • remove self.last_reader_record_id == 0 condition from the if: I'm actually not sure what nominal situation correspond to self.last_reader_record_id != 0 in this if. Either the value is 0 because buffer is empty, or value is something else than ledger value because of some weird situation.
  • Add another condition to this if to cover my situation, but i'm not exactly sure what condition can discriminate this situation other than "none" record being returned by the next() function.

@tobz
Copy link
Contributor

tobz commented Jun 9, 2023

Not sure what we did to get so lucky, but it's sure nice to open an issue that already has some debugging done on it. 🥲

I'll take a look at this today. That area of the code has caused problems in the past, and I fixed another infinite-loop-after-crash bug there recently, so I'm sad, although perhaps not surprised, that it had another one hiding in there. 😞

@NicolasFloquet
Copy link
Author

No worries! As a fervent vector user, I'm happy to contribute improving it when I can :)
Feel free to ask if you can me to try stuff with my broken buffer

@tobz
Copy link
Contributor

tobz commented Jun 9, 2023

@NicolasFloquet If you're up for building Vector from a branch, I have what should be the fix in #17657.

Granted, this is a single variation of futzing with the data files that I came up with for exacerbating it, so your buffer data as-is may be different or theoretically have been triggering the issue in a different way and this fix may not work for you... hopefully it does, though. 😅

@NicolasFloquet
Copy link
Author

At a first glance it looks like you fix is working.
However, running vector with -vv will now crash systematically:


thread 'vector-worker' panicked at 'attempt to subtract with overflow', /Users/nicolasfloquet/vector/lib/vector-buffers/src/variants/disk_v2/ledger.rs:296:31
stack backtrace:
thread 'vector-worker' panicked at 'attempt to add with overflow', /Users/nicolasfloquet/vector/lib/vector-buffers/src/variants/disk_v2/ledger.rs:286:31

The code is within a trace! macro call so it makes sense it only crashes with -vv.

However, it still looks like the buffer is actually fixed after that. I started 0.30.0 with the resulting buffer file, and it doesn't get stuck in the loop anymore.

@NicolasFloquet
Copy link
Author

It looks like at some point decrement_total_buffer_size is called where the current buffer_size is 0. Atomicu64 wraps on add & sub operation, so the buffer size goes below 0, but the u64 operation used for the trace crashes.
Now the good news is that my buffer is now free of any sensitive data so if needed I can send it to you for investigation.

@tobz
Copy link
Contributor

tobz commented Jun 12, 2023

Yeah, feel free to send me the buffer data at [email protected].

Your thesis on why it happens makes sense, although I had run the tests in the PR at trace level when I was originally working on the fix, so it seems like your specific buffer data is likely relevant to exacerbating the overflow issue.

@NicolasFloquet
Copy link
Author

Well this is getting weirder on my side, because i'm getting this crash even without my buggy buffer (ie. I cleaned by buffer directory and started vector with -vvv and the configuration provided in the ticket)

@tobz
Copy link
Contributor

tobz commented Jun 12, 2023

Well, that's interesting... and not great. 😂

Let me put that fix on hold for the moment and see if I can reproduce what you're seeing now.

@tobz
Copy link
Contributor

tobz commented Jun 12, 2023

@NicolasFloquet So was that testing the PR branch, or the official 0.30.0 release?

I used the example configuration you pasted in the issue description, and used a clean release build of my PR, as well as the official 0.30.0 release. I couldn't reproduce a panic in the area of decrement_total_buffer_size with either version, regardless of whether the data directory was clean or not, including even if I created the buffer data directory with one version or Vector, and then restarted and used the other version.

@NicolasFloquet
Copy link
Author

@tobz Ok so following you comment I suspected something was wrong with my local vector build, so I went to a clean setup, and rebuilt from you PR branch.
Good news, this is indeed fixing the initial issue, and I don't get any crash from the traces. So it looks all good to me!

@tobz
Copy link
Contributor

tobz commented Jun 13, 2023

Phew! Good to hear. :)

@jszwedko jszwedko added the domain: buffers Anything related to Vector's memory/disk buffers label Jun 13, 2023
@tobz tobz closed this as completed in 37a662a Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: buffers Anything related to Vector's memory/disk buffers type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants