-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Stuck in infinite loop after disk buffer corruption #17644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So I digged a little bit in the code.
In my case, My understanding is that reader entry in the ledger refers to a record id that doesn't exist in the data file. I tried removing the I guess there is two potential fixes:
|
Not sure what we did to get so lucky, but it's sure nice to open an issue that already has some debugging done on it. 🥲 I'll take a look at this today. That area of the code has caused problems in the past, and I fixed another infinite-loop-after-crash bug there recently, so I'm sad, although perhaps not surprised, that it had another one hiding in there. 😞 |
No worries! As a fervent vector user, I'm happy to contribute improving it when I can :) |
@NicolasFloquet If you're up for building Vector from a branch, I have what should be the fix in #17657. Granted, this is a single variation of futzing with the data files that I came up with for exacerbating it, so your buffer data as-is may be different or theoretically have been triggering the issue in a different way and this fix may not work for you... hopefully it does, though. 😅 |
At a first glance it looks like you fix is working.
The code is within a However, it still looks like the buffer is actually fixed after that. I started 0.30.0 with the resulting buffer file, and it doesn't get stuck in the loop anymore. |
It looks like at some point decrement_total_buffer_size is called where the current buffer_size is 0. Atomicu64 wraps on add & sub operation, so the buffer size goes below 0, but the u64 operation used for the trace crashes. |
Yeah, feel free to send me the buffer data at Your thesis on why it happens makes sense, although I had run the tests in the PR at trace level when I was originally working on the fix, so it seems like your specific buffer data is likely relevant to exacerbating the overflow issue. |
Well this is getting weirder on my side, because i'm getting this crash even without my buggy buffer (ie. I cleaned by buffer directory and started vector with -vvv and the configuration provided in the ticket) |
Well, that's interesting... and not great. 😂 Let me put that fix on hold for the moment and see if I can reproduce what you're seeing now. |
@NicolasFloquet So was that testing the PR branch, or the official 0.30.0 release? I used the example configuration you pasted in the issue description, and used a clean release build of my PR, as well as the official 0.30.0 release. I couldn't reproduce a panic in the area of |
@tobz Ok so following you comment I suspected something was wrong with my local vector build, so I went to a clean setup, and rebuilt from you PR branch. |
Phew! Good to hear. :) |
A note for the community
Problem
Hello,
We've been facing multiple buffer corruption & partial write since before 0.30.0, due to hardware reset of our hosts.
0.30.0 has greatly improved the situation, however we're still facing issues from time to time.
Those issues are critical and unrecoverable without human intervention for two reasons:
What we see is that vector quickly ends up in an infinite loop, as shown in logs below:
Configuration
Version
0.30.0
Debug Output
Example Data
No response
Additional Context
Right now I can't share my buffer-data-1.dat file because it contains business critical data, but I can try scrubing any sensitive data if needed
References
No response
The text was updated successfully, but these errors were encountered: