-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flush error state before doing anything else #7890
Conversation
9ab0375
to
7f02647
Compare
There is some intricate error handling in |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7890 +/- ##
==========================================
+ Coverage 80.06% 81.97% +1.91%
==========================================
Files 190 248 +58
Lines 37181 45864 +8683
Branches 9450 11476 +2026
==========================================
+ Hits 29770 37599 +7829
- Misses 2997 3727 +730
- Partials 4414 4538 +124 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7f02647
to
672aa0c
Compare
If the function `FlushErrorState` is not called immediately after `CopyErrorData`, or at least before doing any serious work that can throw another error, it can exhaust the error stack (which is very small). This commit moved calls of `FlushErrorState` to immediately after `CopyErrorData` to reset the error stack and avoid exhausting the error stack. It also adds calls to `FreeErrorData` where applicable, i.e., where the error data is not re-thrown nor saved away.
672aa0c
to
bfbafa0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look OK from what I know about how FlushErrorState() should be used. However, as pointed out in an inline comment, I think there are bigger problems with some of the try-catch usage in this code.
I guess it is OK to defer the other changes to another PR. I am just not sure this change will solve any issues since there are no tests or reproducible cases.
ereport(LOG, | ||
(errcode(ERRCODE_INTERNAL_ERROR), | ||
errmsg("could not calculate next start on failure: resetting value"), | ||
errdetail("Error: %s.", errdata->message))); | ||
FlushErrorState(); | ||
FreeErrorData(errdata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks OK, but I question the correctness of this error handling so in the end I am not sure your change makes much of a difference.
The reason I am questioning the error handling is this:
Normally, when an error happens, the transaction is borked, so there are two options:
- Do a small amount of local cleanup and then rethrow the error
- Do cleanup and then start a new transaction.
This function does neither, AFAICT.
AFAICT, the function calculate_next_start_on_failure
seems to be called in a tuple processing loop, and when exiting here it just keeps on processing the next tuple while still in error processing. That can't be right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. In addition to the case you mention here, there is another case in #7893 where the exception handling is used to capture a syntax error in the input to an input function. This will unfortunately also capture "hard errors" like a OOM, triggering a cascade of OOM errors, which might end up with exhausting the error stack.
Yes, it is hard to reproduce reliably, but these are clear cases that can cause error stack to fill up, so we can see if this reduces the error rate. |
This release contains bug fixes since the 2.19.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * [#7816](#7816) Fix `ORDER BY` for direct queries on partial chunks * [#7834](#7834) Avoid unnecessary scheduler restarts * [#7837](#7837) Ignore frozen chunks in compression policy * [#7850](#7850) Add `is_current_xact_tuple` to Arrow TTS * [#7890](#7890) Flush error state before doing anything else **Thanks** * @bjornuppeke for reporting a problem with `INSERT INTO ... ON CONFLICT DO NOTHING` on compressed chunks * @kav23alex for reporting a segmentation fault on `ALTER TABLE` with `DEFAULT`
If the function
FlushErrorState
is not called immediately afterCopyErrorData
, or at least before doing any serious work that can throw another error, it can exhaust the error stack (which is very small).This commit moved calls of
FlushErrorState
to immediately afterCopyErrorData
to reset the error stack and avoid exhausting the error stack.It also adds calls to
FreeErrorData
where applicable, i.e., where the error data is not re-thrown nor saved away.