Skip to content

Improve traceability of assertion errors in common functions #130454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
oddbookworm opened this issue Feb 22, 2025 · 24 comments
Open

Improve traceability of assertion errors in common functions #130454

oddbookworm opened this issue Feb 22, 2025 · 24 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@oddbookworm
Copy link

oddbookworm commented Feb 22, 2025

Feature request

(By @picnizx)

While the C API docs explicitly state that functions should usually not be called with an exception set, else some assertion errors may occur and/or the thread state may be invalid, this means that assert(!PyErr_Occurred()) in common macros / functions such as _PyType_LookupRef would just crash. However, in a CI scenario, without a debugger, it's extremely hard to pinpoint the actual cause.

Related: #128159 would give the possibility to show the C stack trace with -X faulthandler.

A crude idea: add assert(!PyErr_Occurred()) inside public common C API functions to be sure that the assertion is checked as well, making debugging a bit easier.


Toy Scenario (original request):

I was working on freethread support for pygame-ce, and I finally need to use a python build with debug symbols, and our atexit-registered quit function was causing SIGABRT to be thrown. This took me a while to figure out. Here's all the code you need to run (or just run python -c "import pygame" since pygame.quit() is atexit-registered).

import pygame
pygame.quit()

Here's the guilty block of code in our codebase:

    funcobj = PyObject_GetAttrString(module, "_internal_mod_quit");

    /* If we could not load _internal_mod_quit, load quit function */
    if (!funcobj)
        funcobj = PyObject_GetAttrString(module, "quit");

    /* Silence errors */
    if (PyErr_Occurred())
        PyErr_Clear();

In the 3.12 docs for PyObject_GetAttrString and PyObject_GetAttr. there's no mention at all that there shouldn't be any exceptions set before calling this function, and it works on release versions of python, so nobody has ever questioned this structure. I'll be making a pull request on our end to clear any exceptions between the first call and second call if needed (or use PyObject_GetOptionalAttrString in 3.13+).

The reason for this issue is that I'm not sure throwing a SIGABRT is the best option here.

gdb backtrace on a debug python 3.13t
[gdb.txt](https://github.com/user-attachments/files/18922524/gdb.txt)

CPython versions tested on:

3.12, 3.13, 3.14

Operating systems tested on:

Linux

@oddbookworm oddbookworm added the type-bug An unexpected behavior, bug, or error label Feb 22, 2025
@picnixz picnixz added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Feb 22, 2025
@picnixz
Copy link
Member

picnixz commented Feb 22, 2025

The docs say:

Retrieve an attribute named attr_name from object o. Returns the attribute value on success, or NULL on failure.

In general, if NULL is returned, then usually an exception is set. But the docs could perhaps be improved.

there's no mention at all that there shouldn't be any exceptions set before calling this function

Normally, functions shouldn't be called with a set exception, except those clearing exceptions or handling exceptions as they could reset another exception afterwards. I don't know if we documented this or not.

@picnixz picnixz added docs Documentation in the Doc dir and removed type-bug An unexpected behavior, bug, or error interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Feb 22, 2025
@picnixz
Copy link
Member

picnixz commented Feb 22, 2025

OTOH, if a missing attribute shouldn't be an error, you should use https://docs.python.org/3/c-api/object.html#c.PyObject_GetOptionalAttr instead as it silences the AttributeError

@picnixz picnixz added the pending The issue will be closed if no feedback is provided label Feb 22, 2025
@oddbookworm
Copy link
Author

oddbookworm commented Feb 22, 2025

Yeah, but maybe python could be a bit nicer and not throw an uncatchable critical error for this?
I mentioned using GetOptionalAttrString in 3.13+, but that doesn't help us with the 3.12 and below wheels we provide

@oddbookworm
Copy link
Author

Especially since it's something that could, in theory, be filtered out immediately

@oddbookworm
Copy link
Author

But actually, I'll probably switch our implementation to this

    if (PyObject_HasAttrString(module, "_internal_mod_quit"))
        funcobj = PyObject_GetAttrString(module, "_internal_mod_quit");
    else
    {
        funcobj = PyObject_GetAttrString(module, "quit");
    }

@picnixz
Copy link
Member

picnixz commented Feb 22, 2025

Well.. i'm assuming that the crash says that an exception is set. However, that's how the C API usually works (this is documented in https://docs.python.org/3/c-api/intro.html#exceptions) [emphasis mine]

As a general principle, a function that calls another function to perform some task should check whether the called function raised an exception, and if so, pass the exception state on to its caller. It should discard any object references that it owns, and return an error indicator, but it should not set another exception — that would overwrite the exception that was just raised, and lose important information about the exact cause of the error.

Stated otherwise, if the first call to PyObject_GetAttrString raised an exception, you should clear it before calling PyObject_GetAttrString again:

funcobj = PyObject_GetAttrString(module, "_internal_mod_quit");

/* If we could not load _internal_mod_quit, load quit function */
if (!funcobj) {
    PyErr_Clear();
    funcobj = PyObject_GetAttrString(module, "quit");
}

@picnixz
Copy link
Member

picnixz commented Feb 22, 2025

But actually, I'll probably switch our implementation to this

This is also an alternative.

@oddbookworm
Copy link
Author

oddbookworm commented Feb 22, 2025

Well.. i'm assuming that the crash says that an exception is set. However, that's how the C API usually works (this is documented in https://docs.python.org/3/c-api/intro.html#exceptions) [emphasis mine]

No, no it does not. It's a SIGABRT. That's a critical failure and it just instantly kills the interpreter. Which is kinda my point. It gives no info about why it's crashing. It seems really harsh to insta-kill the interpreter in a way that literally cannot be caught and gives zero indication of why without running a C debugger

(venv-freethreading) andrew@desktop ~/P/pygame-ce (fix-SYGABRT-on-debug-python-builds)> python -c "import pygame"
<frozen importlib._bootstrap>:488: RuntimeWarning: The global interpreter lock (GIL) has been enabled to load module 'pygame.base', which has not declared that it can run safely without the GIL. To override this behavior and keep the GIL disabled (at your own risk), run with PYTHON_GIL=0 or -Xgil=0.
pygame-ce 2.5.4.dev1 (SDL 2.32.0, Python 3.13.1)
python: Objects/typeobject.c:5252: _PyType_LookupRef: Assertion `!PyErr_Occurred()' failed.
fish: Job 1, 'python -c "import pygame"' terminated by signal SIGABRT (Abort)

@picnixz
Copy link
Member

picnixz commented Feb 22, 2025

python: Objects/typeobject.c:5252: _PyType_LookupRef: Assertion `!PyErr_Occurred()' failed.

That's exactly what it says. The SIGABRT is because of an assertion failure.

@oddbookworm
Copy link
Author

Yeah, it says an assertion failure. But how did we get there? _PyType_LookupRef has a lot of potential codepaths to it. As a user, this would scare me and I would have no idea where to begin to debug this

@picnixz
Copy link
Member

picnixz commented Feb 22, 2025

No, I mean, there are many reasons why we would get here, but the problem here is the usage. While we could change the behaviour of all functions by checking that PyErr_Occurred() is false at the very top, it's not how it should be designed. There are some assumptions that the C API, and most of the time, if CPython internal code shouldn't violate them. IOW, the issue usually stems from the user's code (although we do have issues where we violate it, to investigate, we need to have a MRE without 3rd party code).

But how did we get there?

Well... it's an API misuse, and we can't do more than writing docs to avoid that. In this case, it's mentioned that a C API function should not be called with an exception set, unless stated otherwise.

Sorry, but I think we should close this one as not an issue / not planned. Changing all C API functions to make them check for PyErr_Occurred before calling them isn't an easy task.

I'll tag some fellow C API experts @vstinner @encukou concerning whether we should have assert(PyErr_Occurred()) at the top of every public C API so that the traceback would be more precise (namely, in your case, you would have hit the assertion failure in the second call and not somewhere else).

It seems really harsh to insta-kill the interpreter in a way that literally cannot be caught and gives zero indication of why without running a C debugger

That's because there's no way to recover (properly in DEBUG mode) from such failure at this point. So we should kill the interpreter. In a release mode, that means swallowing up an exception, possibly leaking resources. In a DEBUG mode, you don't want to silently swallow an exception or leak resources without notice.

@oddbookworm
Copy link
Author

oddbookworm commented Feb 22, 2025

What am I supposed to do in this case though?

I get your points, but without some kind of traceability, it's almost impossible sometimes to actually figure out where the crash is coming from to fix it. Like this log for example. I'm fixing things on our end, and it runs fine on my local machine, and the action even runs fine locally for me, but when github runs it, now it's crashing. I can't run gdb through this, so I'm kinda SOL on fixing this...

(Granted, that was a manufactured scenario. I set that up as a demo for what happens before my fixes, but the point still stands I think)

@picnixz
Copy link
Member

picnixz commented Feb 22, 2025

I see where you're coming from. One solution, as I said, is to add assert(!PyErr_Occurred()) to all public API functions, just to be able to recover more information but it's a very long task considering the number of functions to modify. Maybe with -X dev -X faulthandler would you have a bit more information? There is a recent PR that tries to allow to show the c_stack on error (#128159) so maybe it could help.

cc @ZeroIntensity as well

@picnixz picnixz removed the pending The issue will be closed if no feedback is provided label Feb 22, 2025
@picnixz picnixz changed the title PyObject_GetAttr throws SIGABRT on debug builds of python on a pre-existing exception Improve tracability of assertion errors in common functions Feb 22, 2025
@picnixz picnixz added type-feature A feature request or enhancement interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed docs Documentation in the Doc dir labels Feb 22, 2025
@picnixz picnixz removed this from docs issues Feb 22, 2025
@ZeroIntensity
Copy link
Member

Even with faulthandler C stacks, I'm not sure using faulthandler all the time is a good idea, especially on free-threading. It's currently prone to races in some programs, including on GILicious builds.

Rather than assert(!PyErr_Occurred()), it could be interesting to add a _Py_SanityCheck() for debugging. Something like this?

static inline void
_Py_SanityCheck(void)
{
    _Py_AssertHoldsTstate();
    assert(!PyErr_Occurred());
}

And then maybe something to sanity-check object arguments too:

static inline void
_Py_SanityCheckObjects(PyObject **objects, Py_ssize_t num)
{
    _Py_SanityCheck();
    for (Py_ssize_t i = 0; i < num; ++i)
    {
        PyObject *op = objects[i];
        assert(op != NULL);
        assert(!_PyObject_IsFreed(op));
        assert(Py_REFCNT(op) > 0);
    }
}

#define _Py_NUMARGS(...)  (sizeof((PyObject*[]) { __VA_ARGS__ }) / sizeof(PyObject*))
#define _Py_SanityCheckObjects(...) _Py_SanityCheckObjects((PyObject *[]){ __VA_ARGS__ }, _Py_NUMARGS(__VA_ARGS__))

It doesn't seem like it would be all that disruptive to add a one-liner check to stable functions, but I'm thinking out loud.

@picnixz
Copy link
Member

picnixz commented Feb 22, 2025

Rather than assert(!PyErr_Occurred()), it could be interesting to add a _Py_SanityCheck() for debugging. Something like this?

Yes, I thought of something like this that can be hidden behind an #ifndef NDEBUG and that would be a no-op macro otherwise. I don't know how long that task would be but it could definitely help users targetting where the exact failures happen. I mean, if you can use a debugger, it's fine, but if you cannot (and not "don't know"), then it's pretty hard to debug CI.

EDIT: we likely need a macro otherwise, the location will also fail I think. Because the __FILE__ and __LINE__ would point to where the assertion failed, so if it failed in a function called, it wouldn't help.

@encukou
Copy link
Member

encukou commented Feb 24, 2025

The main thing the debug build does is enable assertions; without a debugger those turn into fatal errors. Using the debug build without access to a debugger is indeed not very useful.

For GitHub actions, there are workarounds like gha-ssh. (I haven't tried it myself yet, though.)

Adding a check of the pre-requisites to each function is, I think, a neat idea, but would be a lot of work to doit in some maintainable way. IMO it goes together with cataloging what pre-requisites are. See #127443, which is more about refcounting, but “which arguments can be NULL” or “does this function need !PyErr_Occurred()/GIL state/initialized runtime” should be rather straightforward extensions.

@vstinner
Copy link
Member

No, no it does not. It's a SIGABRT. That's a critical failure and it just instantly kills the interpreter. Which is kinda my point. It gives no info about why it's crashing. It seems really harsh to insta-kill the interpreter in a way that literally cannot be caught and gives zero indication of why without running a C debugger

SIGABRT can be catched in the process or by a debugger such as gdb. You can enable faulthandler which registers a signal handler for SIGABRT and dumps the Python traceback where the signal was raised.

A C debugger is really helpful to debug C extensions, right. You should try to reproduce the issue locally.

I'll tag some fellow C API experts @vstinner @encukou concerning whether we should have assert(PyErr_Occurred()) at the top of every public C API so that the traceback would be more precise (namely, in your case, you would have hit the assertion failure in the second call and not somewhere else).

That's a lot of work and it's not doable in all cases. In practice, some C functions are called with an exception set to handle the current exception.

@oddbookworm
Copy link
Author

My point was that some of this code has run perfectly fine for well over a decade as-is, unsupported as it may be. Yes, it's violating the guidance that you shouldn't use the C-API with an exception set, and I'll work on cleaning that part up, but if it's going to throw SIGABRT when debug is present, then I would at least expect some kind of exception when compiled for release. I don't think it's unreasonable to expect switching from release to debug to just throw SIGABRT out of left field

@oddbookworm
Copy link
Author

The only reason I was using debug symbols in the first place was because I was working on free threading, and gdb wasn't being super helpful because it couldn't see all the internal symbols in the release Python build

@encukou
Copy link
Member

encukou commented Feb 24, 2025

I don't think it's unreasonable to expect switching from release to debug to just throw SIGABRT out of left field

What do you expect the debug build to do?
That is an honest question; please take it at face value. I am surprised that enabling assertions would be unreasonable.

To only get debug symbols, you can use compiler options (on GCC/clang, add -g and -O0 or -Og to CFLAGS).

@ZeroIntensity
Copy link
Member

I still do think that a sanity-check macro would be worthwhile; I do know people that test locally on debug builds, but it would be especially convenient for debugging on our end.

For this specific case: I sometimes have a workflow that runs my test suite through Valgrind to catch silent memory errors or UB, but that can double as a debugging workflow for things like this. Would you consider using something like that?

@oddbookworm
Copy link
Author

What do you expect the debug build to do?

Well, my expectation generally when I switch from release to debug mode is that behavior remains unchanged overall. Extra logged info, fewer optimizations, and sanity checks for impossible situations perhaps. Usually, I'm after the debug symbols. What I don't expect is sudden fatal exceptions that never happen in the release build. IMO assert should be reserved for checking impossible situations as a sanity check, not as a gotcha to catch users not using the API correctly

@oddbookworm
Copy link
Author

To only get debug symbols, you can use compiler options (on GCC/clang, add -g and -O0 or -Og to CFLAGS).

Not sure this works with pyenv builds. We use meson-python as our build backend, and I've never had success getting it to pick up on my self-built versions of python, so it's either install a prebuilt or use something like pyenv

@oddbookworm
Copy link
Author

For this specific case: I sometimes have a workflow that runs my test suite through Valgrind to catch silent memory errors or UB, but that can double as a debugging workflow for things like this. Would you consider using something like that?

I'll look into it. I'm kinda terrified of the attempt though because I'm afraid of what it'll find in all the legacy code that's as old as I am.

@colesbury colesbury changed the title Improve tracability of assertion errors in common functions Improve traceability of assertion errors in common functions Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants