Flush error state before doing anything else #7890

mkindahl · 2025-03-31T12:06:10Z

If the function FlushErrorState is not called immediately after CopyErrorData, or at least before doing any serious work that can throw another error, it can exhaust the error stack (which is very small).

This commit moved calls of FlushErrorState to immediately after CopyErrorData to reset the error stack and avoid exhausting the error stack.

It also adds calls to FreeErrorData where applicable, i.e., where the error data is not re-thrown nor saved away.

mkindahl · 2025-03-31T12:10:40Z

There is some intricate error handling in continuous_agg_validate_query which is not entirely clear.

codecov · 2025-03-31T12:23:49Z

Codecov Report

Attention: Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.97%. Comparing base (59f50f2) to head (bfbafa0).
Report is 859 commits behind head on main.

Files with missing lines	Patch %	Lines
src/bgw/job_stat.c	0.00%	2 Missing ⚠️
src/telemetry/telemetry.c	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7890      +/-   ##
==========================================
+ Coverage   80.06%   81.97%   +1.91%     
==========================================
  Files         190      248      +58     
  Lines       37181    45864    +8683     
  Branches     9450    11476    +2026     
==========================================
+ Hits        29770    37599    +7829     
- Misses       2997     3727     +730     
- Partials     4414     4538     +124

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

If the function `FlushErrorState` is not called immediately after `CopyErrorData`, or at least before doing any serious work that can throw another error, it can exhaust the error stack (which is very small). This commit moved calls of `FlushErrorState` to immediately after `CopyErrorData` to reset the error stack and avoid exhausting the error stack. It also adds calls to `FreeErrorData` where applicable, i.e., where the error data is not re-thrown nor saved away.

erimatnor

The changes look OK from what I know about how FlushErrorState() should be used. However, as pointed out in an inline comment, I think there are bigger problems with some of the try-catch usage in this code.

I guess it is OK to defer the other changes to another PR. I am just not sure this change will solve any issues since there are no tests or reproducible cases.

erimatnor · 2025-03-31T14:38:42Z

src/bgw/job_stat.c

 		ereport(LOG,
 				(errcode(ERRCODE_INTERNAL_ERROR),
 				 errmsg("could not calculate next start on failure: resetting value"),
 				 errdetail("Error: %s.", errdata->message)));
-		FlushErrorState();
+		FreeErrorData(errdata);


This change looks OK, but I question the correctness of this error handling so in the end I am not sure your change makes much of a difference.

The reason I am questioning the error handling is this:

Normally, when an error happens, the transaction is borked, so there are two options:

Do a small amount of local cleanup and then rethrow the error

Do cleanup and then start a new transaction.

This function does neither, AFAICT.

AFAICT, the function calculate_next_start_on_failure seems to be called in a tuple processing loop, and when exiting here it just keeps on processing the next tuple while still in error processing. That can't be right.

Agree. In addition to the case you mention here, there is another case in #7893 where the exception handling is used to capture a syntax error in the input to an input function. This will unfortunately also capture "hard errors" like a OOM, triggering a cascade of OOM errors, which might end up with exhausting the error stack.

mkindahl · 2025-04-01T07:10:12Z

I guess it is OK to defer the other changes to another PR. I am just not sure this change will solve any issues since there are no tests or reproducible cases.

Yes, it is hard to reproduce reliably, but these are clear cases that can cause error stack to fill up, so we can see if this reduces the error rate.

@bjornuppeke

This release contains bug fixes since the 2.19.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * [#7816](#7816) Fix `ORDER BY` for direct queries on partial chunks * [#7834](#7834) Avoid unnecessary scheduler restarts * [#7837](#7837) Ignore frozen chunks in compression policy * [#7850](#7850) Add `is_current_xact_tuple` to Arrow TTS * [#7890](#7890) Flush error state before doing anything else **Thanks** * @bjornuppeke for reporting a problem with `INSERT INTO ... ON CONFLICT DO NOTHING` on compressed chunks * @kav23alex for reporting a segmentation fault on `ALTER TABLE` with `DEFAULT`

mkindahl requested review from fabriziomello and erimatnor March 31, 2025 12:06

mkindahl force-pushed the move-flush-error-state branch from 9ab0375 to 7f02647 Compare March 31, 2025 12:06

github-actions bot assigned mkindahl Mar 31, 2025

mkindahl force-pushed the move-flush-error-state branch from 7f02647 to 672aa0c Compare March 31, 2025 12:32

mkindahl marked this pull request as ready for review March 31, 2025 12:33

fabriziomello approved these changes Apr 1, 2025

View reviewed changes

mkindahl force-pushed the move-flush-error-state branch from 672aa0c to bfbafa0 Compare April 1, 2025 06:52

mkindahl added the bug label Apr 1, 2025

mkindahl added this to the v2.19.1 milestone Apr 1, 2025

erimatnor approved these changes Apr 1, 2025

View reviewed changes

mkindahl enabled auto-merge (rebase) April 1, 2025 07:14

mkindahl merged commit 31e8298 into timescale:main Apr 1, 2025
44 of 45 checks passed

mkindahl deleted the move-flush-error-state branch April 1, 2025 07:22

timescale-automation mentioned this pull request Apr 1, 2025

Backport to 2.19.x: #7890: Flush error state before doing anything else #7894

Merged

timescale-automation added the backported-2.19.x label Apr 1, 2025

philkra mentioned this pull request Apr 1, 2025

2.19.1 (2025-04-01) #7895

Closed

bayandin mentioned this pull request Apr 2, 2025

timescaledb 2.19.1 bayandin/homebrew-tap#262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flush error state before doing anything else #7890

Flush error state before doing anything else #7890

mkindahl commented Mar 31, 2025

mkindahl commented Mar 31, 2025

codecov bot commented Mar 31, 2025 •

edited

Loading

erimatnor left a comment

erimatnor Mar 31, 2025

mkindahl Apr 1, 2025

mkindahl commented Apr 1, 2025

Flush error state before doing anything else #7890

Flush error state before doing anything else #7890

Conversation

mkindahl commented Mar 31, 2025

mkindahl commented Mar 31, 2025

codecov bot commented Mar 31, 2025 • edited Loading

Codecov Report

erimatnor left a comment

Choose a reason for hiding this comment

erimatnor Mar 31, 2025

Choose a reason for hiding this comment

mkindahl Apr 1, 2025

Choose a reason for hiding this comment

mkindahl commented Apr 1, 2025

codecov bot commented Mar 31, 2025 •

edited

Loading