Don't capture hard errors for old cagg format #7903

mkindahl · 2025-04-02T16:48:09Z

If an error occurs inside the deserialization function for continuous aggregates, we are using the old numeric format and this was captured inside inner_agg_deserialize and the format was "repaired". However, this captured hard errors as well, which can cause cascading errors if execution continues after the error.

This commit fixes that by capturing data exception errors, internal error (which is the default for elog()) and protocol error (c.f. pq_getmsgint64 used by numeric_deserialize), but re-throwing all other errors.

It also adds tests to check that a repair is attempted on old format, and that it does not throw en error.

mkindahl · 2025-04-03T06:28:11Z

Note that this is not capturing errors it should capture, it seems. Hence the failures. Looking into it.

codecov · 2025-04-03T08:32:25Z

Codecov Report

Attention: Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.96%. Comparing base (59f50f2) to head (673027d).
Report is 869 commits behind head on main.

Files with missing lines	Patch %	Lines
tsl/src/partialize_finalize.c	78.57%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7903      +/-   ##
==========================================
+ Coverage   80.06%   81.96%   +1.89%     
==========================================
  Files         190      249      +59     
  Lines       37181    46162    +8981     
  Branches     9450    11575    +2125     
==========================================
+ Hits        29770    37836    +8066     
- Misses       2997     3766     +769     
- Partials     4414     4560     +146

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

erimatnor

Is there anyway we can test this? Looks like error handler is not covered by codecov.

mkindahl · 2025-04-03T13:50:59Z

Is there anyway we can test this? Looks like error handler is not covered by codecov.

There is unfortunately no easy way to test this. Something like this might be doable, but it is a bigger job, so suitable for a separate PR:

Use LD_PRELOAD to replace malloc() with our own version when running a test
Allow code to be marked with something similar to our TS_DEBUG_WAITPOINT that makes the malloc() call return NULL.
Add SQL function that allow you to "turn on" such a malloc() failure.
Update the test to "set" the malloc() failure before a particular SQL statement

The same applies to #7897 and #7893.

Nevertheless, I will take a look and see if I can create something like the above with a moderate effort.

erimatnor · 2025-04-03T15:13:11Z

Is there anyway we can test this? Looks like error handler is not covered by codecov.

Nevertheless, I will take a look and see if I can create something like the above with a moderate effort.

I think it is worthwhile working on tests, especially since there are several PRs trying to address similar issues so it seems important to get this right. While the changes in the PR seem reasonable, there's no way to currently know we are actually improving something or just making it worse. So, why should we merge some changes that we don't know works for sure?

erimatnor · 2025-04-03T15:19:14Z

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

fabriziomello · 2025-04-03T15:29:54Z

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

Hmmm... now I was looking to it because I'm sure we had those tests and we had: 6c38c60

But at some point later we removed it :-(

mkindahl · 2025-04-04T06:32:03Z

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

Hmmm... now I was looking to it because I'm sure we had those tests and we had: 6c38c60

But at some point later we removed it :-(

It was removed in 50578e4 because it caused issues with the test suite.

mkindahl · 2025-04-04T06:33:23Z

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

Yes, but that will only test one part of the code. A "hard" error should still cause an abortion, and this part is not dealt with using this approach.

mkindahl · 2025-04-04T08:39:20Z

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

Hmmm... now I was looking to it because I'm sure we had those tests and we had: 6c38c60

But at some point later we removed it :-(

Looking at the test, it seems like we have no tests to check that the repair works, so added it here.

mkindahl · 2025-04-04T08:55:54Z

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

The repair can only handle very specific issues, so we cannot do random changes. I added a test that truncates the last 16 bytes of the partial to generate the "broken" partial that this code can repair.

I also added debug printouts so that we can see that the repair is attempted and actually done. Since these are debug printouts, they print a little more than just the bare minimum, but it at least checks that the repair is attempted and works.

If an error occurs inside the deserialization function for continuous aggregates, we are using the old numeric format and this was captured inside `inner_agg_deserialize` and the format was "repaired". However, this captured hard errors as well, which can cause cascading errors if execution continues after the error. This commit fixes that by capturing data exception errors, internal error (which is the default for elog()) and protocol error (c.f. `pq_getmsgint64` used by `numeric_deserialize`), but re-throwing all other errors. It also adds tests to make sure that the repair is done and not generates an error.

github-actions bot assigned mkindahl Apr 2, 2025

mkindahl requested review from erimatnor and fabriziomello April 2, 2025 16:48

mkindahl added the bug label Apr 2, 2025

mkindahl added this to the v2.19.2 milestone Apr 2, 2025

mkindahl force-pushed the soft-error-for-partialize-cagg branch from 28637ce to dc1c1ae Compare April 2, 2025 16:51

mkindahl force-pushed the soft-error-for-partialize-cagg branch from dc1c1ae to 4dc601c Compare April 3, 2025 08:20

mkindahl force-pushed the soft-error-for-partialize-cagg branch 2 times, most recently from f365ec7 to d708c4c Compare April 3, 2025 09:05

mkindahl marked this pull request as ready for review April 3, 2025 09:31

erimatnor reviewed Apr 3, 2025

View reviewed changes

fabriziomello approved these changes Apr 3, 2025

View reviewed changes

mkindahl force-pushed the soft-error-for-partialize-cagg branch 4 times, most recently from 318a081 to a6a481a Compare April 4, 2025 08:38

mkindahl force-pushed the soft-error-for-partialize-cagg branch from a6a481a to 4d62ea6 Compare April 4, 2025 14:26

mkindahl force-pushed the soft-error-for-partialize-cagg branch from 4d62ea6 to 673027d Compare April 7, 2025 06:38

philkra removed this from the v2.19.2 milestone Apr 7, 2025

philkra added this to the v2.19.3 milestone Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't capture hard errors for old cagg format #7903

Don't capture hard errors for old cagg format #7903

mkindahl commented Apr 2, 2025 •

edited

Loading

mkindahl commented Apr 3, 2025

codecov bot commented Apr 3, 2025 •

edited

Loading

erimatnor left a comment •

edited

Loading

mkindahl commented Apr 3, 2025

erimatnor commented Apr 3, 2025 •

edited

Loading

erimatnor commented Apr 3, 2025

fabriziomello commented Apr 3, 2025

mkindahl commented Apr 4, 2025

mkindahl commented Apr 4, 2025

mkindahl commented Apr 4, 2025

mkindahl commented Apr 4, 2025 •

edited

Loading

Don't capture hard errors for old cagg format #7903

Are you sure you want to change the base?

Don't capture hard errors for old cagg format #7903

Conversation

mkindahl commented Apr 2, 2025 • edited Loading

mkindahl commented Apr 3, 2025

codecov bot commented Apr 3, 2025 • edited Loading

Codecov Report

erimatnor left a comment • edited Loading

Choose a reason for hiding this comment

mkindahl commented Apr 3, 2025

erimatnor commented Apr 3, 2025 • edited Loading

erimatnor commented Apr 3, 2025

fabriziomello commented Apr 3, 2025

mkindahl commented Apr 4, 2025

mkindahl commented Apr 4, 2025

mkindahl commented Apr 4, 2025

mkindahl commented Apr 4, 2025 • edited Loading

mkindahl commented Apr 2, 2025 •

edited

Loading

codecov bot commented Apr 3, 2025 •

edited

Loading

erimatnor left a comment •

edited

Loading

erimatnor commented Apr 3, 2025 •

edited

Loading

mkindahl commented Apr 4, 2025 •

edited

Loading