-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: Go 1.6 regresses Windows performance due to timeBeginPeriod removal #14790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Smaller reproduction is always welcome. Thank you. Alex |
I tried investigating this, but your test crashes:
Alex |
Hmm, probably go-winio tests only run on Windows 10. Sorry about that. In any case, here's a smaller repro: http://play.golang.org/p/Zh8F6iqJOE. Run this with go test -bench ., making sure that your default timer resolution is 15ms. The results are quite stark for me: there is a 15x performance difference. |
@jstarks your last test reproduces the problem for me. Thank you. I really don't know what the problem here is. Your test uses mix of syscalls and Go channels to synchronize. I changed your test to use syscalls only (to see if problem goes away) and it does go away. And syscalls only version is much faster than your original version. I also built another version that uses channels only, and that is even faster. Here https://play.golang.org/p/JmZcyYUgTz are my test. If I run them I see this:
So in addition to your original issue, I am wondering why BenchmarkNoChannel1ms is ~500 times faster then Benchmark1ms. BenchmarkOnlyChannel1ms is even ~10 times faster, but that sounds reasonable to me (internal sync comparing to external). @dvyukov and @aclements if you can explain what is happening here. Or maybe suggestion on how to investigate this further. Thank you. Alex |
@alexbrainman Thanks. Indeed, it seems that for some reason we are sleeping for a tick instead of scheduling a ready goroutine in some code path. I played around with the runtime a bit, and I observed that if I removed the usleep in sysmon(), the performance got significantly (100x?) better. Of course, the usleep in sysmon() is there for a reason, so this is not a fix, but it might help with the investigation. I don't fully understand what sysmon's role is, but I think it's responsible for deciding that the syscall is going to take a while and that another goroutine should be scheduled to run. If that's the case, I guess the problem here is that the syscall is blocked until the goroutine gets to run, but the runtime has no way of knowing that and has to apply this (problematic) heuristic. Perhaps there should be a way to annotate a syscall as blocking, in which case the runtime could immediately schedule another goroutine instead of waiting. But again, I don't really understand the scheduler yet, so I may be off base. |
@jstarks, your explanation sounds entirely plausible. You're right that one of the roles of sysmon is to decide when a system call is probably blocked and that we should start up another thread to execute goroutines, but it's a very coarse mechanism, and the poor resolution of Windows timers really isn't helping here. This isn't usually an issue because most syscalls that block for a long time are socket IO and those go through a central select() mechanism, but you may have come across a case where it is an issue. In the runtime, we actually do have a way to indicate that a system call is blocking ( |
We have a similar mechanism on Linux etc with the "//sys" vs "//sysnb" comments which generate calls to either https://golang.org/pkg/syscall/#Syscall (maybe blocking) vs https://golang.org/pkg/syscall/#RawSyscall (not blocking). Looks like |
I don't know if we want to expose an yet another generic syscall API (CookedSyscall?), but it may make sense to mark the wrappers for things like synchronization operations as "almost certainly blocking" and internally use entersyscallblock. I really have no idea why we only use that on Solaris—it was created before my time—but it looks like it should work fine on any OS. It would be nice if we didn't need such hacks. |
I followed Brad's advice and changed my test program https://play.golang.org/p/JmZcyYUgTz to use syscall.RawSyscall in setEvent:
I also had to change runtime and syscall to implement syscall.RawSyscall:
I see not much changed:
(BenchmarkNoChannelDefaultResolution and BenchmarkNoChannel1ms are ~10% faster, but that is understandable). But main issue is still there. Alex |
I think we need to return back timeBeginPeriod. IMO the long-term solution is #7876. It would allow to raise sysmon sleep to that same 15ms. |
But UMS is only available 64-bit Windows from 7 and up. What about the
32-bit Windows port and 64-bit Windows server 2003?
|
@minux We will need the current sysmon as a backoff for OSes that does not support proper user-space scheduling. And windows implementation of that thing will still need timeBeginPeriod. UMS/sleep implementation needs to be chosen at startup depending on presence of necessary syscalls. But FWIW 2003 appeared before multi-core era and is so inefficient in lots of other fields, so I would not spend too much time on top performance on it. If you care about performance, updating to a newer OS should be the first step. |
All my benchtests ( https://play.golang.org/p/JmZcyYUgTz ) with 1ms in their name run with timeBeginPeriod set. Your suggestion still does not explain why, for example, BenchmarkNoChannel1ms is about 450 times faster than Benchmark1ms:
Why do you think this is? How can I find out?
I don't want to look for solution until I understand what my particular problem is. Alex |
What's the value of GOMAXPROCS? If it is 1, try to set it to 4. |
@alexbrainman Try the following patch:
|
GOMAXPROCS=4 makes both BenchmarkDefaultResolution and Benchmark1ms twice as fast, but still a lot slower than BenchmarkNoChannelDefaultResolution.
No change in performance as far as I can tell. Alex |
CL https://golang.org/cl/20834 mentions this issue. |
Had to blow dust off my windows machine. But my change does help when GOMAXPROCS>1: BenchmarkDefaultResolution-8 4583372 29720 -99.35% Here we are not even limited by sysmon sleeps because we have enough Ps to execute all runnable goroutines. This fix is even more important than timeBeginPeriod, because GOMAXPROCS=1 is rather esoteric today. |
CL https://golang.org/cl/20835 mentions this issue. |
Dmitry I tried your CL 20835, but I don't see any difference to my benchmarks:
The affect of CL 20834 we already know. It is difference between BenchmarkDefaultResolution and Benchmark1ms. But I am more concerned about even much larger difference between BenchmarkDefaultResolution and BenchmarkNoChannelDefaultResolution. Why does mixed (channels and syscalls) version perform so bad comparing to syscall only version? Alex |
CL https://golang.org/cl/21294 mentions this issue. |
It appears that windows osyield is just 15ms sleep on my computer (see benchmarks below). Replace NtWaitForSingleObject in osyield with SwitchToThread (as suggested by Dmitry). Also add issue #14790 related benchmarks, so we can track perfomance changes in CL 20834 and CL 20835 and beyond. Update #14790 benchmark old ns/op new ns/op delta BenchmarkChanToSyscallPing1ms 1953200 1953000 -0.01% BenchmarkChanToSyscallPing15ms 31562904 31248400 -1.00% BenchmarkSyscallToSyscallPing1ms 5247 4202 -19.92% BenchmarkSyscallToSyscallPing15ms 5260 4374 -16.84% BenchmarkChanToChanPing1ms 474 494 +4.22% BenchmarkChanToChanPing15ms 468 489 +4.49% BenchmarkOsYield1ms 980018 75.5 -99.99% BenchmarkOsYield15ms 15625200 75.8 -100.00% Change-Id: I1b4cc7caca784e2548ee3c846ca07ef152ebedce Reviewed-on: https://go-review.googlesource.com/21294 Run-TryBot: Alex Brainman <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Run-TryBot: Dmitry Vyukov <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
@alexbrainman It's significantly faster for GOMAXPROCS > 1, as expected. For GOMAXPROCS=1, it's still much slower unless I explicitly call timeBeginPeriod in my code -- do you still intend to restore the timeBeginPeriod call in the runtime? |
Glad to hear that. Thanks for confirming.
I know that. Unfortunetly it is how Go's scheduler works at this moment. Maybe it will improve one day.
Yes. We are going back to calling timeBeginPeriod(1) as before. See https://go-review.googlesource.com/#/c/20834/ for details. Alex |
This reverts commit ab4c929. Sysmon critically depends on system timer resolution for retaking of Ps blocked in system calls. See #14790 for an example of a program where execution time goes from 2ms to 30ms if timeBeginPeriod(1) is not used. We can remove timeBeginPeriod(1) when we support UMS (#7876). Update #14790 Change-Id: I362b56154359b2c52d47f9f2468fe012b481cf6d Reviewed-on: https://go-review.googlesource.com/20834 Reviewed-by: Austin Clements <[email protected]> Run-TryBot: Dmitry Vyukov <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Alex Brainman <[email protected]>
0435e88 restored pre go1.6 behavior. I don't see what else to do here, so closing. Alex |
Thanks everyone for fixing this. The UMS approach still sounds like the best approach long term -- I may have some time to help with this in a few months if there is interest. |
I am interested. But I know zilch about UMS and Go scheduler. And it is @dvyukov's baby. But if you sent a change, I will push it along. Thank you. Alex |
What version of Go are you using (
go version
)?go version go1.6 windows/amd64
What operating system and processor architecture are you using (
go env
)?set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=
set GORACE=
set GOROOT=C:\Go
set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
set GO15VENDOREXPERIMENT=1
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0
set CXX=g++
set CGO_ENABLED=1
What did you do?
Run a benchmark from my go-winio package, which uses completion ports to provide non-thread-blocking IO on top of named pipes. It regressed significantly between go 1.5.3 and go 1.6, but only when I don't call timeBeginPeriod(1000). It seems that some aspect of the go scheduler still depends on a fast timer.
You can see this with:
What did you expect to see?
Same performance before and after setting gofasttick=1 (which makes go-winio call timeBeginPeriod(1000) before doing any IO).
What did you see instead?
I can try to get a more minimal repro if it would be helpful. This has regressed Docker client performance significantly in Windows when operating over named pipes.
The text was updated successfully, but these errors were encountered: