Skip to content

Stack overflow collecting PGO data on Windows #113655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mdboom opened this issue Jan 2, 2024 · 41 comments · Fixed by #130007
Open

Stack overflow collecting PGO data on Windows #113655

mdboom opened this issue Jan 2, 2024 · 41 comments · Fixed by #130007
Assignees
Labels
3.12 only security fixes 3.13 bugs and security fixes build The build process and cross-build deferred-blocker OS-windows performance Performance or resource usage type-bug An unexpected behavior, bug, or error

Comments

@mdboom
Copy link
Contributor

mdboom commented Jan 2, 2024

Bug report

Bug description:

During a PGO build on Windows on main (471aa75), a test in test_functools fails with a stack overflow.

Full log of build

PCbuild\build.bat --pgo
0:02:50 load avg: 0.35 [19/44] test_functools
Windows fatal exception: stack overflow

Thread 0x00002b38 (most recent call first):
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\libregrtest\win_utils.py", line 43 in _update_load

Current thread 0x00000e3c (most recent call first):
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1[875](https://github.com/faster-cpython/benchmarking/actions/runs/7367240496/job/20050307979#step:10:876) in fib
  File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1875 in fib
  ...

C:\actions-runner\_work\benchmarking\benchmarking\cpython>"C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Current\Bin\msbuild.exe" "C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\\pythoncore.vcxproj" /t:KillPython /nologo /v:m /clp:summary /p:Configuration=PGInstrument /p:Platform=x64 /p:KillPython=true 

  Killing any running python.exe instances...

C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pyproject.props(184,5): error : PGO run did not succeed (no python313!*.pgc files) and there is no data to merge [C:\actions-runner\_work\benchmarking\benchmarking\cpython\PCbuild\pythoncore.vcxproj]

Build FAILED.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Windows

Linked PRs

@mdboom mdboom added type-bug An unexpected behavior, bug, or error performance Performance or resource usage labels Jan 2, 2024
@mdboom mdboom self-assigned this Jan 2, 2024
@itamaro itamaro added build The build process and cross-build OS-windows labels Jan 2, 2024
@gvanrossum
Copy link
Member

Trying to repro, I can't even manage to complete a PGO build (first step above).

@mdboom
Copy link
Contributor Author

mdboom commented Jan 2, 2024

It sounds like you are repro'ing (I probably wasn't clear enough). It fails during the PGO build while running the tests that generate PGO data.

git bisect tells me the first bad commit is: 45e09f9 (which increased the recursion limit, which at least seems related).

@mdboom
Copy link
Contributor Author

mdboom commented Jan 2, 2024

It looks like 45e09f9 introduced the failing test, so it could just be that the test is incorrect (rather than a test going from passing to failing).

@gvanrossum
Copy link
Member

Ah, you're right, the output similar to yours scrolled off my screen. :-(

You should be able to experiment with some of the numbers changed by the PR to see which one affects the Windows PGO build. The stack limit is definitely an ongoing discussion.

@mdboom
Copy link
Contributor Author

mdboom commented Jan 2, 2024

Windows release buildbots are similarly failing: https://buildbot.python.org/all/#/builders/914/builds/3274/steps/4/logs/stdio

@gvanrossum
Copy link
Member

So it is crashing in this addition to test_functools.py:

    @support.skip_on_s390x
    @unittest.skipIf(support.is_wasi, "WASI has limited C stack")
    def test_lru_recursion(self):

        @self.module.lru_cache
        def fib(n):
            if n <= 1:
                return n
            return fib(n-1) + fib(n-2)

        if not support.Py_DEBUG:
            with support.infinite_recursion():
                fib(2500)

My conclusion is that the increased default C recursion limit is too large for the platform and we hit a stack overflow. Maybe try setting Py_C_RECURSION_LIMIT to something smaller than 8000? (I am trying now with 4000.)

@mdboom
Copy link
Contributor Author

mdboom commented Jan 2, 2024

My conclusion is that the increased default C recursion limit is too large for the platform and we hit a stack overflow. Maybe try setting Py_C_RECURSION_LIMIT to something smaller than 8000? (I am trying now with 4000.)

Yes -- my experimentation arrived at that too. 5,000 is too large, but 4,000 seems to work. See #113668.

@mdboom mdboom added the 3.12 only security fixes label Jan 2, 2024
@gvanrossum
Copy link
Member

Also, this docstring (deleted by that PR) suggests a possible cause:

    """Set a lower limit for tests that interact with infinite recursions
    (e.g test_ast.ASTHelpers_Test.test_recursion_direct) since on some
    debug windows builds, due to not enough functions being inlined the
    stack size might not handle the default recursion limit (1000). See
    bpo-11105 for details."""

Maybe we are running into the inlining problem even in non-debug mode?

@gvanrossum
Copy link
Member

Let's put this on tomorrow's agenda.

@zooba
Copy link
Member

zooba commented Jan 5, 2024

Since Windows 8, we've had GetCurrentThreadStackLimits, which should be able to give us a better estimate of how many native frames will fit in various configurations. We can also pass in the initial stack size as a preprocessor variable, though it's very possible to create new threads with different sizes, and I believe we might use a smaller stack for new threads already?

If we have any references to stack variables that are passed between frames (e.g. a pointer to a local in EvalFrame that's available in the next EvalFrame) then we can get the size of each recursion from the current build. Of course, that'll vary based on how it recurses (and how many of our own native APIs it goes through), but it's likely better than guessing. At the very least, it might be helpful for some assertions to detect when the guesses are invalidated.

A true solution would be to use structured exception handling and catch the stack overflow in EvalFrame. That's going to leave things in a pretty messy state though (leaked references at a minimum, and likely worse), so I think it's better to just crash.

@zooba
Copy link
Member

zooba commented Jan 5, 2024

Since Windows 8, we've had GetCurrentThreadStackLimits

We could also just check this against the address of a local in a function and raise if we're within some distance from the extent of the stack. It won't be a predictable number of recursions, and it may still be possible to use native recursion to cause a crash, but it should be fairly simple to implement and isn't really any worse than the alternatives.

@mdboom mdboom changed the title Python stack overflow collecting PGO data on Windows Stack overflow collecting PGO data on Windows Jan 5, 2024
@zooba zooba added release-blocker 3.13 bugs and security fixes labels Jan 8, 2024
@mdboom
Copy link
Contributor Author

mdboom commented Jan 16, 2024

@markshannon It seems that even with #113944 merged, this bug still persists -- PGO build log. It's this line:

File "C:\actions-runner\_work\benchmarking\benchmarking\cpython\Lib\test\test_functools.py", line 1875 in fib

Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
mgorny added a commit to thesamesam/cpython that referenced this issue Sep 22, 2024
Lower the C recursion limit for PPC64 and SPARC, as they use relatively
large stack frames that cause e.g. `test_descr` to hit a stack overflow.
According to quick testing, it seems that values around 8000 are max
for PPC64 (ELFv1 ABI) and 7000 for SPARC64.  To keep things safe, let's
use 5000 for PPC64 and 4000 for SPARC.

Co-authored-by: Sam James <[email protected]>
mgorny added a commit to thesamesam/cpython that referenced this issue Sep 23, 2024
Lower the C recursion limit for HPPA, PPC64 and SPARC, as they use
relatively large stack frames that cause e.g. `test_descr` to hit
a stack overflow.  According to quick testing, it seems that values
around 8000 are max for HPPA and PPC64 (ELFv1 ABI) and 7000 for SPARC64.
To keep things safe, let's use 5000 for PPC64 and 4000 for SPARC.

Co-authored-by: Sam James <[email protected]>
vstinner pushed a commit that referenced this issue Sep 23, 2024
…4264)

Lower the C recursion limit for HPPA, PPC64 and SPARC, as they use
relatively large stack frames that cause e.g. `test_descr` to hit
a stack overflow.  According to quick testing, it seems that values
around 8000 are max for HPPA and PPC64 (ELFv1 ABI) and 7000 for SPARC64.
To keep things safe, let's use 5000 for PPC64 and 4000 for SPARC.

Co-authored-by: Michał Górny <[email protected]>
@encukou
Copy link
Member

encukou commented Jan 13, 2025

Should I look into this?
Currently I have no idea what's going on, but I can dive in.

@mdboom
Copy link
Contributor Author

mdboom commented Jan 13, 2025

I think this is resolved and should have been closed. Are you still able to reproduce it, @encukou?

@zooba
Copy link
Member

zooba commented Jan 14, 2025

I think this is resolved and should have been closed.

Where's the fix?

There's a workaround in the current build which significantly overallocates memory, otherwise we wouldn't be able to release. Build with %UseExtraStackReserve%=false to disable this and try it. (you'll find the use of this in python.vcxproj)

@vstinner
Copy link
Member

The fix is to reduce Py_C_RECURSION_LIMIT and increase the stack size (StackReserveSize), no?

There's a workaround in the current build which significantly overallocates memory

Is it a real issue to allocate 8 MB stack?

@zooba
Copy link
Member

zooba commented Jan 17, 2025

Is it a real issue to allocate 8 MB stack?

Depends how many threads you create. IIRC, all the physical memory is reserved to ensure it will be contiguous, so you will run out of RAM 4x quicker than a 2MB stack.

@encukou
Copy link
Member

encukou commented Jan 20, 2025

I'm still not sure I see how this is a blocker bug. What's being blocked?
The issue is that PGO builds need more stack memory than usual, right? Is this a (RAM) optimization issue?

@zooba
Copy link
Member

zooba commented Jan 20, 2025

The final release build requires more stack memory than usual because PGO was enabled, yes (or no, if you were thinking something different).

It's a C stack optimization issue. For whatever reason, due to how the stack ends up looking after PGO is applied (presumably to the main interpreter loop), we need to allocate more memory for every thread to account for the expected number of C recursions. Rather than revert the entire complex change that led to that regression, I chose to temporarily expand the memory used and left it marked as a deferred blocker. Without that workaround, we would've reverted a significant change (improvement, on the whole) to recursion tracking so we could release.

It would be great to handle the recursion count better so that we don't have to experimentally reduce the limit each time something changes in the interpreter loop structure. But there's a whole paid team working on that, so I was hoping that by leaving it assigned to them, they'd do it. I'm not happy with saying "let's charge every user 1MB/thread because our team couldn't fix this".

@vstinner
Copy link
Member

The stack size can be reduced on Windows release PGO build if Py_C_RECURSION_LIMIT is also reduced.

@encukou
Copy link
Member

encukou commented Feb 10, 2025

The issue with reducing Py_C_RECURSION_LIMIT is that at the point where it's defined, we don't know if we're compiling with PGO. (I tried adding a flag macro with -D in PreprocessorDefinitions, but couldn't get it to work... and then realized that since it's public API in a public header, it' not enough to change it just for the CPython build.)
Next time I boot the Windows box, I'll try changing PyConfigHText in _UpdatePyconfig.

@zooba
Copy link
Member

zooba commented Feb 10, 2025

Please don't make more build-time changes to pyconfig. That's already proven to be a mistake, and as soon as we figure out how free-threaded extension builds are going to figure things out I'm ripping it out (they currently work about 50% of the time, I've proposed they just make it an explicit option for now and detect the current running interpreter if unspecified).

If anything, we should just reduce the public macro and then optionally use a higher limit ourselves if we can. But I'd rather have some kind of approach that doesn't rely on a compile-time constant to figure out when the stack is becoming exhausted (catching the exception is not safe in our context because we don't control every point of the stack).

@markshannon
Copy link
Member

But I'd rather have some kind of approach that doesn't rely on a compile-time constant to figure out

Agreed. I think the best way to fix this is to push forwards implementing #91079 for Windows. I'd like to implement #91079 for all tier1 (maybe tier 2) platforms, but it is quite tricky for linux.

catching the exception is not safe in our context because we don't control every point of the stack

@zooba can you elaborate.

@encukou
Copy link
Member

encukou commented Feb 10, 2025

@markshannon, do you expect to implement #91079 for Windows in 3.14?

@markshannon
Copy link
Member

Unless there is a technical obstacle, yes.

We might be able to do it for linux and mac for 3.14 as well, but no promises.

@encukou
Copy link
Member

encukou commented Feb 10, 2025

OK. Thanks! I'll mark that as a blocker then. Please let me know if your plans change so I can try reducing Py_C_RECURSION_LIMIT instead.

@zooba
Copy link
Member

zooba commented Feb 10, 2025

catching the exception is not safe in our context because we don't control every point of the stack

@zooba can you elaborate.

To be clear, I mean the native SEH exception. Though I imagine that's what you would've assumed I meant?

The stack overflow exception could come from any code, not just ceval.c, and that code likely won't be expecting it and won't clean up (release locks, etc.). In that circumstance, letting the OS terminate the process is the only safe thing (this will get worse when freethreading is using locks everywhere, though right now provided nobody catches the Python exception it's likely that the stack will just unwind straight up the current thread).

e.g. the user might call into OpenSSL after Py_C_RECURSION_LIMIT-1 calls, and then halfway through handling a network call a helper function hits the hard limit and everything unwinds back to ceval.c. If we chose to catch that exception and turn it into something that can be handled, or something that can yield and allow other threads in the process to run, we turn a "safe" crash into an unsafe one.

@chris-eibl
Copy link
Member

Maybe Py_NO_INLINE could be used if the problematic function can be identified?
See #77901 when it was added to r_float_str:

/* Issue #33720: Disable inlining for reducing the C stack consumption
   on PGO builds. */
Py_NO_INLINE static double
r_float_str(RFILE *p)

@zooba
Copy link
Member

zooba commented Feb 19, 2025

Is there a plan to backport the fix to 3.12 and 3.13? Or do we need to keep the resource overallocation in those versions?

I'd prefer to backport. Or at least trim the hardcoded limit to avoid crashing.

@zooba zooba reopened this Feb 19, 2025
@encukou
Copy link
Member

encukou commented Feb 24, 2025

The fix uncovered unrelated issues. I don't think backporting it would be appropriate; definitely not for all platforms. It needs the full alpha/beta period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 only security fixes 3.13 bugs and security fixes build The build process and cross-build deferred-blocker OS-windows performance Performance or resource usage type-bug An unexpected behavior, bug, or error
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

8 participants