Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush error state before doing anything else #7890

Merged
merged 1 commit into from
Apr 1, 2025

Conversation

mkindahl
Copy link
Contributor

If the function FlushErrorState is not called immediately after CopyErrorData, or at least before doing any serious work that can throw another error, it can exhaust the error stack (which is very small).

This commit moved calls of FlushErrorState to immediately after CopyErrorData to reset the error stack and avoid exhausting the error stack.

It also adds calls to FreeErrorData where applicable, i.e., where the error data is not re-thrown nor saved away.

@mkindahl
Copy link
Contributor Author

There is some intricate error handling in continuous_agg_validate_query which is not entirely clear.

Copy link

codecov bot commented Mar 31, 2025

Codecov Report

Attention: Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.97%. Comparing base (59f50f2) to head (bfbafa0).
Report is 859 commits behind head on main.

Files with missing lines Patch % Lines
src/bgw/job_stat.c 0.00% 2 Missing ⚠️
src/telemetry/telemetry.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7890      +/-   ##
==========================================
+ Coverage   80.06%   81.97%   +1.91%     
==========================================
  Files         190      248      +58     
  Lines       37181    45864    +8683     
  Branches     9450    11476    +2026     
==========================================
+ Hits        29770    37599    +7829     
- Misses       2997     3727     +730     
- Partials     4414     4538     +124     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mkindahl mkindahl force-pushed the move-flush-error-state branch from 7f02647 to 672aa0c Compare March 31, 2025 12:32
@mkindahl mkindahl marked this pull request as ready for review March 31, 2025 12:33
If the function `FlushErrorState` is not called immediately after
`CopyErrorData`, or at least before doing any serious work that can
throw another error, it can exhaust the error stack (which is very
small).

This commit moved calls of `FlushErrorState` to immediately after
`CopyErrorData` to reset the error stack and avoid exhausting the error
stack.

It also adds calls to `FreeErrorData` where applicable, i.e., where the
error data is not re-thrown nor saved away.
@mkindahl mkindahl force-pushed the move-flush-error-state branch from 672aa0c to bfbafa0 Compare April 1, 2025 06:52
@mkindahl mkindahl added the bug label Apr 1, 2025
@mkindahl mkindahl added this to the v2.19.1 milestone Apr 1, 2025
Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look OK from what I know about how FlushErrorState() should be used. However, as pointed out in an inline comment, I think there are bigger problems with some of the try-catch usage in this code.

I guess it is OK to defer the other changes to another PR. I am just not sure this change will solve any issues since there are no tests or reproducible cases.

ereport(LOG,
(errcode(ERRCODE_INTERNAL_ERROR),
errmsg("could not calculate next start on failure: resetting value"),
errdetail("Error: %s.", errdata->message)));
FlushErrorState();
FreeErrorData(errdata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks OK, but I question the correctness of this error handling so in the end I am not sure your change makes much of a difference.

The reason I am questioning the error handling is this:

Normally, when an error happens, the transaction is borked, so there are two options:

  1. Do a small amount of local cleanup and then rethrow the error
  2. Do cleanup and then start a new transaction.

This function does neither, AFAICT.

AFAICT, the function calculate_next_start_on_failure seems to be called in a tuple processing loop, and when exiting here it just keeps on processing the next tuple while still in error processing. That can't be right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. In addition to the case you mention here, there is another case in #7893 where the exception handling is used to capture a syntax error in the input to an input function. This will unfortunately also capture "hard errors" like a OOM, triggering a cascade of OOM errors, which might end up with exhausting the error stack.

@mkindahl
Copy link
Contributor Author

mkindahl commented Apr 1, 2025

I guess it is OK to defer the other changes to another PR. I am just not sure this change will solve any issues since there are no tests or reproducible cases.

Yes, it is hard to reproduce reliably, but these are clear cases that can cause error stack to fill up, so we can see if this reduces the error rate.

@mkindahl mkindahl enabled auto-merge (rebase) April 1, 2025 07:14
@mkindahl mkindahl merged commit 31e8298 into timescale:main Apr 1, 2025
44 of 45 checks passed
@mkindahl mkindahl deleted the move-flush-error-state branch April 1, 2025 07:22
philkra added a commit that referenced this pull request Apr 1, 2025
This release contains bug fixes since the 2.19.0 release. We recommend that you upgrade at the next available opportunity.

**Bugfixes**
* [#7816](#7816) Fix `ORDER BY` for direct queries on partial chunks
* [#7834](#7834) Avoid unnecessary scheduler restarts
* [#7837](#7837) Ignore frozen chunks in compression policy
* [#7850](#7850) Add `is_current_xact_tuple` to Arrow TTS
* [#7890](#7890) Flush error state before doing anything else

**Thanks**
* @bjornuppeke for reporting a problem with `INSERT INTO ... ON CONFLICT DO NOTHING` on compressed chunks
* @kav23alex for reporting a segmentation fault on `ALTER TABLE` with `DEFAULT`
@philkra philkra mentioned this pull request Apr 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants