-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
test_marshal: crash in Python 3.7b5 on Windows 10 #77901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Follow-up of bpo-33719. C:\Users\vstinner\AppData\Local\Programs\Python\Python37>python.exe -m test test_marshal -v Current thread 0x000003a0 (most recent call first): Crashes in test_marshal is on old topic: Current stack size: 2 million bytes (1.9 MiB) PCbuild/python.vcxproj: <StackReserveSize>2000000</StackReserveSize> |
I compiled the master branch of Python in release mode using VS2015 (MSC v.1912 64 bit) and I failed to reproduce the crash:
|
I also failed to reproduce the crash in the 3.7 branch. I guess that the python.org binary has been compiled differently. |
The uploaded binary is compiled with PGO enabled (and trained on most of the test suite). I'll check it out - hopefully we don't need to do anything drastic and can get away with either a compiler update or disabling optimizations on a single function. |
Ned, FYI |
I don't think that it's a release blocker. test_marshal does only crash on corner cases which should not occur on usual "valid" data. |
A crash in the test suite should be fixed, especially since we have protection against this crash (and a test that validates it). In this case, apparently the stack allocation for each frame of r_object grew and now there isn't room for 2000 calls (the value of MAX_MARSHAL_STACK_DEPTH). This isn't really a robust way of handling it anyway, so I'll check out whether there's an easy way to safely probe the stack before recursing, otherwise we'll just have to cut the number a bit further. |
I need to stop working on this right now, but here's the locals layout in a normal release build in r_object: @rdi @rdi p = 0x00000034 In the PGO build, it looks like this: I need to ping the team and figure out why the buffers are so far removed from the rest of the stack, and figure out what's in the gap. That seems to be the core of the problem. |
FYI, I posted the problem at https://developercommunity.visualstudio.com/content/problem/265778/pgo-generates-large-stack-frame.html (I _think_ it's public). Until then, it's probably best to reduce the recursion limit. Going to a depth of 1000 seems okay, and it certainly seems less bad than disabling optimisations. |
As for the test itself, the original test was added in dc78cc6. It tested that the stack overflow is not happen when unmarshal a fake code object with a deeply nested dict instead of co_code. It was renamed to test_loads_recursion in cf0fab2. In bpo-16475 (d7009c6) it was updated to mirror the changed structure of the code object in Python 3, but the original test was kept with the name test_loads_2x_code. Actually both tests work the same, because the value of few first bytes doesn't affect the result of these tests. Yet this test is fragile, it depends on the unstable structure of the code object, and it serves its purpose only because input data checking is not very strong in the current implementation. Future changes in the marshal module can make this test invalid, and this will not be noticed. In PR 7336 this test is rewritten in more reliable way. This likely is not directly related to the crash. |
Ned - PR 7401 fixes the crash at the cost of reducing the permitted recursion depth in marshal. Your call whether we take this fix or disable PGO, but I'd much rather take this fix (PGO has full-interpreter effects whereas this is very narrow). When I hear back about the compiler bug - if it is one - we should be able to raise the limit again without causing any compatibility issues. |
Steve, I'm OK with the proposed reduction as a workaround. We have other platform- and build-specific limits based on stack sizes etc. It's hard to have one-size-fits-all. |
Thanks. I merged, and miss-islington is backporting it. I guess at this stage you're manually cherry-picking into your own 3.7 branch for release? I'll merge for 3.7.1 though. |
No, not until 3.7.0rc1 next week. No need to cherry pick. Thanks! |
With the workaround merged for 3.7.0rc1, I'm downgrading from "release blocker" to "high" since I believe Steve wants to keep this issue open for a possible compiler resolution. |
Thanks. Agreed this should stay open, but it is blocked on a compiler fix. |
Do you want to have a different limit depending on the compiler and the compiler version? I'm not sure that it's worth it. I'm happy with the heuristic "use lower limit on Windows". By design, the test is fragile. I would suggest to close the issue, but you can keep it open if you prefer to increase the limit on a fixed compiler. |
What are your thoughts about rewriting the test (PR 7336)? |
The alternate solution is provided by PR 8071. The culprit of the main stack consumption is the code for unmarshalling float and complex in protocols 0 and 1 which uses 256-bytes buffers on the stack. Moving it to the separate function allows to decrease the stack consumption and increase the maximum marshal recursion depth on Windows. |
Thanks for writing that second PR, Serhiy! I'd already spotted that as the likely fix on our side, though I do want to complete the investigation into why MSVC generated unnecessarily sparse frames. But that is basically an internal discussion for me to follow up on at work. |
I added the backport to 3.7 label to your PR, as I want to restore the marshal stack depth limit in 3.7.1. |
Unfortunately PR 8071 doesn't fix the crash on PGO builds. :-( Initially I tested it only on the debug build, and now have tested on the PGO build. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: