Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't capture hard errors for old cagg format #7903

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mkindahl
Copy link
Contributor

@mkindahl mkindahl commented Apr 2, 2025

If an error occurs inside the deserialization function for continuous aggregates, we are using the old numeric format and this was captured inside inner_agg_deserialize and the format was "repaired". However, this captured hard errors as well, which can cause cascading errors if execution continues after the error.

This commit fixes that by capturing data exception errors, internal error (which is the default for elog()) and protocol error (c.f. pq_getmsgint64 used by numeric_deserialize), but re-throwing all other errors.

It also adds tests to check that a repair is attempted on old format, and that it does not throw en error.

@mkindahl mkindahl added the bug label Apr 2, 2025
@mkindahl mkindahl added this to the v2.19.2 milestone Apr 2, 2025
@mkindahl mkindahl force-pushed the soft-error-for-partialize-cagg branch from 28637ce to dc1c1ae Compare April 2, 2025 16:51
@mkindahl
Copy link
Contributor Author

mkindahl commented Apr 3, 2025

Note that this is not capturing errors it should capture, it seems. Hence the failures. Looking into it.

@mkindahl mkindahl force-pushed the soft-error-for-partialize-cagg branch from dc1c1ae to 4dc601c Compare April 3, 2025 08:20
Copy link

codecov bot commented Apr 3, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.96%. Comparing base (59f50f2) to head (673027d).
Report is 869 commits behind head on main.

Files with missing lines Patch % Lines
tsl/src/partialize_finalize.c 78.57% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7903      +/-   ##
==========================================
+ Coverage   80.06%   81.96%   +1.89%     
==========================================
  Files         190      249      +59     
  Lines       37181    46162    +8981     
  Branches     9450    11575    +2125     
==========================================
+ Hits        29770    37836    +8066     
- Misses       2997     3766     +769     
- Partials     4414     4560     +146     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mkindahl mkindahl force-pushed the soft-error-for-partialize-cagg branch 2 times, most recently from f365ec7 to d708c4c Compare April 3, 2025 09:05
@mkindahl mkindahl marked this pull request as ready for review April 3, 2025 09:31
Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway we can test this? Looks like error handler is not covered by codecov.

@mkindahl
Copy link
Contributor Author

mkindahl commented Apr 3, 2025

Is there anyway we can test this? Looks like error handler is not covered by codecov.

There is unfortunately no easy way to test this. Something like this might be doable, but it is a bigger job, so suitable for a separate PR:

  • Use LD_PRELOAD to replace malloc() with our own version when running a test
  • Allow code to be marked with something similar to our TS_DEBUG_WAITPOINT that makes the malloc() call return NULL.
  • Add SQL function that allow you to "turn on" such a malloc() failure.
  • Update the test to "set" the malloc() failure before a particular SQL statement

The same applies to #7897 and #7893.

Nevertheless, I will take a look and see if I can create something like the above with a moderate effort.

@erimatnor
Copy link
Contributor

erimatnor commented Apr 3, 2025

Is there anyway we can test this? Looks like error handler is not covered by codecov.

Nevertheless, I will take a look and see if I can create something like the above with a moderate effort.

I think it is worthwhile working on tests, especially since there are several PRs trying to address similar issues so it seems important to get this right. While the changes in the PR seem reasonable, there's no way to currently know we are actually improving something or just making it worse. So, why should we merge some changes that we don't know works for sure?

@erimatnor
Copy link
Contributor

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

@fabriziomello
Copy link
Contributor

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

Hmmm... now I was looking to it because I'm sure we had those tests and we had: 6c38c60

But at some point later we removed it :-(

@mkindahl
Copy link
Contributor Author

mkindahl commented Apr 4, 2025

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

Hmmm... now I was looking to it because I'm sure we had those tests and we had: 6c38c60

But at some point later we removed it :-(

It was removed in 50578e4 because it caused issues with the test suite.

@mkindahl
Copy link
Contributor Author

mkindahl commented Apr 4, 2025

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

Yes, but that will only test one part of the code. A "hard" error should still cause an abortion, and this part is not dealt with using this approach.

@mkindahl mkindahl force-pushed the soft-error-for-partialize-cagg branch 4 times, most recently from 318a081 to a6a481a Compare April 4, 2025 08:38
@mkindahl
Copy link
Contributor Author

mkindahl commented Apr 4, 2025

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

Hmmm... now I was looking to it because I'm sure we had those tests and we had: 6c38c60

But at some point later we removed it :-(

Looking at the test, it seems like we have no tests to check that the repair works, so added it here.

@mkindahl
Copy link
Contributor Author

mkindahl commented Apr 4, 2025

@mkindahl Is it possible to test this by storing a partial agg result in a table, then modify that to some garbage value, then run finalize over it? It should fail in the deserialization function I guess...

The repair can only handle very specific issues, so we cannot do random changes. I added a test that truncates the last 16 bytes of the partial to generate the "broken" partial that this code can repair.

I also added debug printouts so that we can see that the repair is attempted and actually done. Since these are debug printouts, they print a little more than just the bare minimum, but it at least checks that the repair is attempted and works.

@mkindahl mkindahl force-pushed the soft-error-for-partialize-cagg branch from a6a481a to 4d62ea6 Compare April 4, 2025 14:26
If an error occurs inside the deserialization function for continuous
aggregates, we are using the old numeric format and this was captured
inside `inner_agg_deserialize` and the format was "repaired". However,
this captured hard errors as well, which can cause cascading errors if
execution continues after the error.

This commit fixes that by capturing data exception errors, internal
error (which is the default for elog()) and protocol error (c.f.
`pq_getmsgint64` used by `numeric_deserialize`), but re-throwing all
other errors.

It also adds tests to make sure that the repair is done and not
generates an error.
@mkindahl mkindahl force-pushed the soft-error-for-partialize-cagg branch from 4d62ea6 to 673027d Compare April 7, 2025 06:38
@philkra philkra removed this from the v2.19.2 milestone Apr 7, 2025
@philkra philkra added this to the v2.19.3 milestone Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants