-
Notifications
You must be signed in to change notification settings - Fork 18k
pgo: profile-guided optimization for golang v1.17 #49688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Did you see #28262? Usually, big features like these are designed and discussed in the issue tracker first. This is a very large amount of code to show up with all of a sudden :) |
@mvdan and others: for background, the authors of this CL have been discussing their PGO implementation privately with the Go compiler team. This pull request is the mechanism they are using to share their code changes. This pull request certainly isn't going to go in as is--for one thing, it's against the 1.17 branch--but will be used for further discussion and, I hope, a later updated implementation on mainline. So, it's new to the community (and new to me too, for that matter) but it's not entirely out of the blue. |
Thanks for the quick reply, @ianlancetaylor! I had some thoughts but they weren't related to the code changes, so I've left a comment on the issue instead. |
Thank you for proving the context @ianlancetaylor. This PR is still WIP as we (Uber) are working closely with @cherrymui @aclements and @jeremyfaller on cleaning up compiler dependencies (which would avoid a large amount of code duplication), supporting the common profiling format, adding bblayout optimization, and integrating our code in the main branch. |
Usually changes of this magnitude start with a publicly circulated design proposal. |
This will be no different. I don't think anyone is implying we would skip the proposal process. Uber's code drop is to get their work and ours on the same page, so we can build from there. I don't know what form it will take, but we have to start by sharing. :) |
Minor note: @google-cla thinks there is no CLA signed yet. Per https://opensource.google/docs/cla/#external-contributors, I believe you need to get added to the appropriate "Google CLA" Google Group within Uber to be seen as from Uber by the bot. This will be needed eventually before changes can go in, though as noted above, design docs / proposals will come well before that. |
@rajbarik, thanks for sharing this very interesting WIP patch! When building your feature branch at https://github.com/rajbarik/go/tree/pgo, I got the following errors
quick grep search shows ir.NodeFrequency is indeed missing from the ir package. Did I miss something here? Thanks. |
@lni My apologies. I have updated the PR now and have tested it to work. @cherrymui fyi. |
Thanks for pointing this out. I am working on getting this fixed. |
@rajbarik thanks a lot for the quick response. I've managed to build your patched version of go, interestingly I couldn't reproduce the above reported performance improvements. That LemireIterateManyb benchmark from bitset is slower when PGO is enabled. Please see my results below.
two observed differences in our benchmark outputs,
I've tried on a M1 ARM processor based mbp as well and I got similar results. Did I still miss anything here? Thanks! |
@lni Great to hear that you are able to build and test the patch. Before pprofinline (bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-m" -cpuprofile LemireIterate.pprof 2> before_log.txt (bitset) go tool pprof LemireIterate.pprof After pprofinline bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-m -fpprofprofileuse LemireIterate.pprof -pprofinline" -cpuprofile LemireIterate_after.pprof 2> after_log.txt after_log.txt content PprofInline: [hot] inline (*BitSet).extendSetMaybe in (*BitSet).Set with weight 5.20 Please let me know if they help. |
@rajbarik Thanks for the help above, sadly I still couldn't reproduce the performance improvements. Based on your suggestion above, I changed benchtime to 1000x and repeated the test -
Please let me know if there anything else I should try. Thanks! |
I‘m trying. ButI couldn't reproduce the performance improvements, too. $ go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=1000x -gcflags="-m" -cpuprofile LemireIterate.pprof 2> before_log.txt goos: linux goarch: amd64 pkg: github.com/bits-and-blooms/bitset cpu: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz BenchmarkLemireIterateManyb 1000 2482742 ns/op PASS ok github.com/bits-and-blooms/bitset 2.754s $ go tool pprof LemireIterate.pprof File: bitset.test Type: cpu Time: Dec 10, 2021 at 1:18am (CST) Duration: 2.72s, Total samples = 2.60s (95.47%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top30 Showing nodes accounting for 2.58s, 99.23% of 2.60s total Dropped 12 nodes (cum <= 0.01s) flat flat% sum% cum cum% 1.95s 75.00% 75.00% 2.07s 79.62% github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany 0.42s 16.15% 91.15% 2.50s 96.15% github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb 0.12s 4.62% 95.77% 0.12s 4.62% github.com/bits-and-blooms/bitset.trailingZeroes64 (inline) 0.09s 3.46% 99.23% 0.09s 3.46% runtime.memclrNoHeapPointers 0 0% 99.23% 0.09s 3.46% github.com/bits-and-blooms/bitset.New 0 0% 99.23% 0.09s 3.46% github.com/bits-and-blooms/bitset.TestBitSetHuge 0 0% 99.23% 0.02s 0.77% runtime.(*mcache).allocLarge 0 0% 99.23% 0.10s 3.85% runtime.makeslice 0 0% 99.23% 0.10s 3.85% runtime.mallocgc 0 0% 99.23% 0.08s 3.08% runtime.memclrNoHeapPointersChunked 0 0% 99.23% 2.49s 95.77% testing.(*B).launch 0 0% 99.23% 2.50s 96.15% testing.(*B).runN 0 0% 99.23% 0.10s 3.85% testing.tRunner (pprof) q go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=1000x -gcflags="-m -fpprofprofileuse LemireIterate.pprof -pprofinline" -cpuprofile LemireIterate_after.pprof 2> after_log.txt goos: linux goarch: amd64 pkg: github.com/bits-and-blooms/bitset cpu: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz BenchmarkLemireIterateManyb 1000 2504440 ns/op PASS ok github.com/bits-and-blooms/bitset 2.852s $ go tool pprof LemireIterate.pprof File: bitset.test Type: cpu Time: Dec 10, 2021 at 1:18am (CST) Duration: 2.72s, Total samples = 2.60s (95.47%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top30 Showing nodes accounting for 2.58s, 99.23% of 2.60s total Dropped 12 nodes (cum <= 0.01s) flat flat% sum% cum cum% 1.95s 75.00% 75.00% 2.07s 79.62% github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany 0.42s 16.15% 91.15% 2.50s 96.15% github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb 0.12s 4.62% 95.77% 0.12s 4.62% github.com/bits-and-blooms/bitset.trailingZeroes64 (inline) 0.09s 3.46% 99.23% 0.09s 3.46% runtime.memclrNoHeapPointers 0 0% 99.23% 0.09s 3.46% github.com/bits-and-blooms/bitset.New 0 0% 99.23% 0.09s 3.46% github.com/bits-and-blooms/bitset.TestBitSetHuge 0 0% 99.23% 0.02s 0.77% runtime.(*mcache).allocLarge 0 0% 99.23% 0.10s 3.85% runtime.makeslice 0 0% 99.23% 0.10s 3.85% runtime.mallocgc 0 0% 99.23% 0.08s 3.08% runtime.memclrNoHeapPointersChunked 0 0% 99.23% 2.49s 95.77% testing.(*B).launch 0 0% 99.23% 2.50s 96.15% testing.(*B).runN 0 0% 99.23% 0.10s 3.85% testing.tRunner (pprof) q $ cat after_log.txt | grep NextSetMany PprofInline Caller=github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb Callee=github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany Weight=2200000000 PprofInline: can't determine the hotness, but inline trailingZeroes64 in (*BitSet).NextSetMany PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb PprofInline: [hot] inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb with weight 28.03 PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannIterateMany PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannIterateMany PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannLowDensityIterateMany PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannLowDensityIterateMany PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidDensityIterateMany PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidDensityIterateMany PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidStrongDensityIterateMany PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidStrongDensityIterateMany $ cat after_log.txt | grep "\[hot\]" PprofInline: [hot] inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb with weight 28.03 @rajbarik |
For $ go test -bench=. -cpu 1 -parallel 1 -benchtime=3x -count=100 -cpuprofile all.pprof > before.txt $ go test -bench=. -cpu 1 -parallel 1 -benchtime=3x -count=100 -gcflags="-fpprofprofileuse all.pprof -specialize -pprofinline" >after.txt $ benchstat ./before.txt after.txt name old time/op new time/op delta Set 91.4ns ±52% 55.8ns ±40% -38.89% (p=0.000 n=95+95) GetTest 81.6ns ±38% 62.2ns ±39% -23.72% (p=0.000 n=98+98) SetExpand 1.02µs ±153% 1.95µs ±15% +90.68% (p=0.000 n=98+91) Count 1.16µs ± 1% 1.16µs ± 2% ~ (p=0.777 n=90+96) Iterate 13.1µs ± 0% 13.1µs ± 0% -0.06% (p=0.003 n=92+86) SparseIterate 14.5µs ±30% 13.6µs ± 1% -6.49% (p=0.000 n=99+84) LemireCreate 14.9ms ± 6% 15.7ms ± 4% +5.62% (p=0.000 n=95+97) LemireCount 1.54ms ± 9% 1.51ms ± 7% -1.85% (p=0.000 n=98+100) LemireIterate 4.20ms ± 1% 4.16ms ± 1% -1.11% (p=0.000 n=88+99) LemireIterateb 4.22ms ± 1% 4.17ms ± 1% -1.11% (p=0.000 n=92+100) LemireIterateManyb 2.94ms ± 5% 2.98ms ± 2% +1.05% (p=0.000 n=94+96) FlorianUekermannIterateMany 724ns ±10% 679ns ± 9% -6.24% (p=0.000 n=89+98) FlorianUekermannIterateManyReg 1.46µs ± 4% 1.40µs ± 4% -4.45% (p=0.000 n=100+97) FlorianUekermannIterateManyComp 788ns ±11% 926ns ± 9% +17.47% (p=0.000 n=97+100) FlorianUekermannLowDensityIterateMany 1.16ms ± 1% 1.19ms ± 1% +2.54% (p=0.000 n=91+97) FlorianUekermannLowDensityIterateManyReg 1.14ms ± 1% 1.13ms ± 1% -0.74% (p=0.000 n=90+98) FlorianUekermannLowDensityIterateManyComp 1.34ms ± 1% 1.35ms ± 1% +0.54% (p=0.000 n=94+93) FlorianUekermannMidDensityIterateMany 12.7ms ± 1% 12.9ms ± 1% +1.79% (p=0.000 n=89+100) FlorianUekermannMidDensityIterateManyReg 19.0ms ± 1% 18.9ms ± 0% -0.21% (p=0.000 n=94+97) FlorianUekermannMidDensityIterateManyComp 13.8ms ± 1% 13.7ms ± 0% -0.45% (p=0.000 n=92+99) FlorianUekermannMidStrongDensityIterateMany 38.3ms ± 1% 38.4ms ± 0% +0.36% (p=0.000 n=96+98) FlorianUekermannMidStrongDensityIterateManyReg 71.1ms ± 0% 74.8ms ± 0% +5.26% (p=0.000 n=95+97) FlorianUekermannMidStrongDensityIterateManyComp 46.3ms ± 2% 46.4ms ± 1% +0.21% (p=0.000 n=99+92) For some testcase, it improved. But some testcases got a worse result. |
The bitset version I am using is 5a829244ffd64e4120015ce1bf8285b0b6168d55 which is the latest in its default branch. Any chance you can quickly check which version of bitset are you running? Also wondering is there any other tests/benchmarks that are known to benefit from the PGO patch? Will be happy to try those as well. Thanks! |
@lni I will take a look at this next week and update you. Please stay tuned. |
@rajbarik Thanks a lot for looking into this. We are building a new HTAP database in Go called MatrixOne. Our profiling results indicate that some hot functions are not inlined as expected when using the official compiler, in our experiments, we manually batched their invocations and immediately observed noticeable performance improvements. Profiling guided inlining as described in your PGO patch sounds like a pretty cool potential solution to the problem we face. Looking forward to your update, will be very happy to further cooperate on this exciting PGO feature! |
@lni @zhouguangyuan0718 My apologies, I had modified the BenchmarkLemireIterateManyb function in bitset_benchmark_test.go to increase its execution time by setting every bit in the bitset. I had modified one line; please see the commented line below.
With this change, here is what I see on my Intel Mac laptop:
Another benchmark where I see ~25% improvement is Tally. Steps below:
Please let me know if you are able to reproduce bitset and tally now. @lni Regarding MatrixOne, I am able to run the unit-tests. Is there any particular unit-test I should dig deeper? Are there integration/component tests? |
It seems like PGO is being talked about internally again (#43930 (comment)). Is there a chance that we could have more details about the current plan and design? For those of us interested in the Go compiler while outside Google, it's being particularly hard to follow what is happening, let alone contribute in any meaningful way :) I understand that it may still be early for a formal proposal, but perhaps there could be a regular GitHub issue thread where we can subscribe for detailed updates every few weeks. Ideally the discussions about PGO would simply happen in a public place, removing the need for that back and forth between private and public communication. |
It's only been two days, but my last comment is already buried by the commit push activity :) I had to click on "see more" three times to find it again. Can we please at least have one tracking issue to talk about PGO? |
Oh, thanks Josh. It has been some time and I even forgot I already asked this last year :) |
Fixes golang#45928. Change-Id: Ifbb0effbca4ab7c0eb56069fee40edb564553c35 Reviewed-on: https://go-review.googlesource.com/c/go/+/410336 Reviewed-by: Cuong Manh Le <[email protected]> Run-TryBot: Wayne Zuo <[email protected]> Reviewed-by: David Chase <[email protected]> Reviewed-by: Than McIntosh <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
And add some documentation for the debug query param. Fixes golang#27737 Fixes golang#53971 Change-Id: I629aaa2d4a43175381eb04872f1caad238519a41 Reviewed-on: https://go-review.googlesource.com/c/go/+/421635 Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Than McIntosh <[email protected]> Reviewed-by: Daniel Martí <[email protected]> Reviewed-by: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Change-Id: I9ec725009376f5865adedca6c159b14140dde097 Reviewed-on: https://go-review.googlesource.com/c/go/+/426086 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Michael Pratt <[email protected]> Reviewed-by: Daniel Theophanes <[email protected]> Run-TryBot: Daniel Theophanes <[email protected]>
For golang#46505. Change-Id: I9bc9da5dd4b76cb2d8ff41390e1567678e72d88d Reviewed-on: https://go-review.googlesource.com/c/go/+/428938 Run-TryBot: Robert Griesemer <[email protected]> Reviewed-by: Robert Findley <[email protected]> Auto-Submit: Robert Griesemer <[email protected]> Reviewed-by: Robert Griesemer <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
pageAlloc.chunks used to require an atomic store when growing the heap because the scavenger would look at the list without locking the heap lock. However, the scavenger doesn't do that anymore, and it looks like nothing really does at all. This change updates the comment and makes the store non-atomic. Change-Id: Ib452d147861060f9f6e74e2d98ee111cf89ce8f5 Reviewed-on: https://go-review.googlesource.com/c/go/+/429219 Auto-Submit: Michael Knyszek <[email protected]> Reviewed-by: Keith Randall <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> Reviewed-by: Keith Randall <[email protected]>
All subfields use atomic types to ensure alignment, so there's no more need for these fields. Change-Id: Iada4253f352a074073ce603f1f6b07cbd5b7c58a Reviewed-on: https://go-review.googlesource.com/c/go/+/429220 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Keith Randall <[email protected]>
This change fixes an old TODO that made it a uint64 because it would make alignment within mheap more complicated. Now that we don't have to worry about it since we're using atomic types as much as possible, switch to using a Uintptr. This likely will improve performance a tiny bit on 32-bit platforms, but really it's mostly cleanup. Change-Id: Ie705799a111ccad977fc1f43de8b50cf611be303 Reviewed-on: https://go-review.googlesource.com/c/go/+/429221 Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Keith Randall <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Knyszek <[email protected]>
This is an optimization that prevents N^2 behavior within a chunk, but was accidentally skipped. There should be no functional change as a result of this CL. Fixes golang#54892. Change-Id: I861967a2268699fdc3464bd41bc56618b5628e6b Reviewed-on: https://go-review.googlesource.com/c/go/+/429255 Auto-Submit: Michael Knyszek <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Keith Randall <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Knyszek <[email protected]>
This was left over from the old pacer, and never removed when the old pacer was removed in Go 1.19. Change-Id: I79e5f0420c6100c66bd06129a68f5bbab7c1ea8f Reviewed-on: https://go-review.googlesource.com/c/go/+/429256 Reviewed-by: Keith Randall <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Keith Randall <[email protected]>
For golang#46505. Change-Id: I1a30fd895496befd16626afb48717ac837ed5778 Reviewed-on: https://go-review.googlesource.com/c/go/+/429315 Reviewed-by: Matthew Dempsky <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: Robert Griesemer <[email protected]>
Because most of these APIs are recently supported, we can only do some advancement work as much as possible under the premise of compatibility. For golang#54854. Change-Id: Id15d11288bf23902570d54eaf2704a5264210b2e Reviewed-on: https://go-review.googlesource.com/c/go/+/429115 Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: hopehook <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
FixedZone is transitively called by Time.UnmarshalJSON or Time.UnmarshalText for any RFC 3339 timestamp that is not in UTC. This function is relatively slow since it allocates 3 times. Given that RFC 3339 never has a zone name and most offsets are by the hour, we can cache unnamed zones on hour offsets. Caching a Location should be safe since it has no exported fields or methods that can mutate the Location. It is functionally immutable. The only way to observe that the Location was cached is either by pointer comparison or by shallow copying the struct value. Neither operation seems sensible to do with a *time.Location. Performance: name old time/op new time/op delta UnmarshalText 268ns ± 2% 182ns ± 1% -32.01% (p=0.000 n=10+10) Change-Id: Iab5432f34bdbb485512bb8b5464e076c03fd106f Reviewed-on: https://go-review.googlesource.com/c/go/+/425116 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Joseph Tsai <[email protected]> Auto-Submit: Joseph Tsai <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
Change-Id: I213b078effa4b7049c44498d651de5b938e5404b Reviewed-on: https://go-review.googlesource.com/c/go/+/428779 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Tobias Klauser <[email protected]> Reviewed-by: hopehook <[email protected]> Run-TryBot: hopehook <[email protected]> Auto-Submit: Tobias Klauser <[email protected]>
When the type assertion fails, the test mistakenly prints the expected (rather than the actual) type. When the error string doesn't match, the text mistakenly prints the original (rather than the converted) error (although there might not be any difference in the output, the code looks wrong). Fix both issues. Change-Id: Ia7dd0632fc677f458fec25d899c46268a12f76e7 Reviewed-on: https://go-review.googlesource.com/c/go/+/428916 Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
Change-Id: I25c8e8b701d6489f360fea30d09090826276b950 GitHub-Last-Rev: c2c8319 GitHub-Pull-Request: golang#54924 Reviewed-on: https://go-review.googlesource.com/c/go/+/428976 Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> Run-TryBot: Daniel Martí <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Daniel Martí <[email protected]>
It's set but the output is never used. Change-Id: I36ecb9c5f087a85289529907ede9f9bfc295d739 Reviewed-on: https://go-review.googlesource.com/c/go/+/428637 Reviewed-by: Cherry Mui <[email protected]> Reviewed-by: Benny Siegert <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]>
As discussed in CL 401434 there are substantial misuses of these in the wild, and they are a potential source of unsafety even for code that does not use them directly. Since proposal golang#53003 has already been implemented, now is the right time to deprecate reflect.{SliceHeader, StringHeader}. For golang#53003. Change-Id: I724cf46d4b22d2ed3cbf2b948e6aac5ee4bf0f6e Reviewed-on: https://go-review.googlesource.com/c/go/+/428757 Run-TryBot: hopehook <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Matthew Dempsky <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Auto-Submit: Matthew Dempsky <[email protected]>
…ice, String} After we deprecated reflect.{SliceHeader, StringHeader}, it is recommended to use unsafe.{Slice, String} to replace its work. However, the compiler and linker cannot be migrated for the time being. As a temporary strategy, using the "internal/unsafeheader" package like other code is the most suitable choice at present. For golang#53003. Change-Id: I69d0ef72e2d95caabd0706bbb247a719d225c758 Reviewed-on: https://go-review.googlesource.com/c/go/+/429755 Auto-Submit: Matthew Dempsky <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Matthew Dempsky <[email protected]> Run-TryBot: hopehook <[email protected]>
…ckage Follow CL 428777. Change-Id: I5ce49322e92c5d6539bb08248e3366187c30dcd8 Reviewed-on: https://go-review.googlesource.com/c/go/+/428780 Run-TryBot: Tobias Klauser <[email protected]> Reviewed-by: Bryan Mills <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Auto-Submit: Tobias Klauser <[email protected]>
First, we know that Go source files almost always weigh at least a few kilobytes, so we can kickstart the output buffer to be a reasonable size and reduce the initial number of incremental allocations and copies when appending bytes or strings to output. Second, in nodeSize we use a nested printer, but we don't actually need its printed bytes - we only need to know how many bytes it prints. For that reason, use a throwaway buffer: the part of our output buffer between length and capacity, as we haven't used it yet. Third, use a sync.Pool to reuse allocated printers. The current API doesn't allow reusing printers, and some programs like gofmt will print many files in sequence. Those changes combined result in a modest reduction in allocations and CPU usage. The benchmark uses testdata/parser.go, which has just over two thousand lines of code, which is pretty standard size-wise. We also split the Print benchmark to cover both a medium-sized ast.File as well as a pretty small ast.Decl node. The latter is a somewhat common scenario in gopls, which has code actions which alter small bits of the AST and print them back out to rewrite only a few lines in a file. name old time/op new time/op delta PrintFile-16 5.43ms ± 1% 4.85ms ± 3% -10.68% (p=0.000 n=9+10) PrintDecl-16 19.1µs ± 0% 18.5µs ± 1% -3.04% (p=0.000 n=10+10) name old speed new speed delta PrintFile-16 9.56MB/s ± 1% 10.69MB/s ± 3% +11.81% (p=0.000 n=8+10) PrintDecl-16 1.67MB/s ± 0% 1.73MB/s ± 1% +3.05% (p=0.000 n=10+10) name old alloc/op new alloc/op delta PrintFile-16 332kB ± 0% 107kB ± 2% -67.87% (p=0.000 n=10+10) PrintDecl-16 3.92kB ± 0% 3.28kB ± 0% -16.38% (p=0.000 n=10+10) name old allocs/op new allocs/op delta PrintFile-16 3.45k ± 0% 2.42k ± 0% -29.90% (p=0.000 n=10+10) PrintDecl-16 56.0 ± 0% 46.0 ± 0% -17.86% (p=0.000 n=10+10) Change-Id: I475a3babca77532b2d51888f49710f74763d81d2 Reviewed-on: https://go-review.googlesource.com/c/go/+/424924 Run-TryBot: Daniel Martí <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Robert Griesemer <[email protected]> Reviewed-by: Alan Donovan <[email protected]>
Change-Id: Ie543e1b1df667cfaf3aafa4be727881461ee8b7d GitHub-Last-Rev: ed993db GitHub-Pull-Request: golang#54888 Reviewed-on: https://go-review.googlesource.com/c/go/+/428716 Auto-Submit: Bryan Mills <[email protected]> Run-TryBot: Bryan Mills <[email protected]> Reviewed-by: Bryan Mills <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
Change-Id: I9de5aafb36d05bdc90bbdba516367eb2b200a7e0 Reviewed-on: https://go-review.googlesource.com/c/go/+/428777 Auto-Submit: Tobias Klauser <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Tobias Klauser <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
Implement CLONE_INTO_CGROUP feature, allowing to put a child in a specified cgroup in a clean and simple way. Note that the feature only works for cgroup v2, and requires Linux kernel 5.7 or newer. Using the feature requires a new syscall, clone3. Currently this is the only reason to use clone3, but the code is structured in a way so that other cases may be easily added in the future. Add a test case. While at it, try to simplify the syscall calling code in forkAndExecInChild1, which became complicated over time because: 1. It was using either rawVforkSyscall or RawSyscall6 depending on whether CLONE_NEWUSER was set. 2. On Linux/s390, the first two arguments to clone(2) system call are swapped (which deserved a mention in Linux ABI hall of shame). It was worked around in rawVforkSyscall on s390, but had to be implemented via a switch/case when using RawSyscall6, making the code less clear. Let's - modify rawVforkSyscall to have two arguments (which is also required for clone3); - remove the arguments workaround from s390 asm, instead implementing arguments swap in the caller (which still looks ugly but at least it's done once and is clearly documented now); - use rawVforkSyscall for all cases (since it is essentially similar to RawSyscall6, except for having less parameters, not returning r2, and saving/restoring the return address before/after syscall on 386 and amd64). Updates golang#51246. Change-Id: Ifcd418ebead9257177338ffbcccd0bdecb94474e Reviewed-on: https://go-review.googlesource.com/c/go/+/417695 Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> Run-TryBot: Kirill Kolyshkin <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Change-Id: Ib9746cb4f27625cb22620271b280d2da242b2fba Reviewed-on: https://go-review.googlesource.com/c/go/+/428437 Run-TryBot: Ian Lance Taylor <[email protected]> Run-TryBot: Andy Pan <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]>
Update CL 425881 and CL 428396 I browsed the source code related to copy_file_range in the kernel and found that the latest kernel may still return EXDEV errors in copy_file_range(2) due to certain cases, for details see: https://elixir.bootlin.com/linux/v5.19.7/source/fs/read_write.c#L1559, https://elixir.bootlin.com/linux/v5.19.7/source/fs/read_write.c#L1479, and https://elixir.bootlin.com/linux/v5.19.7/source/fs/read_write.c#L1439. Therefore, the EXDEV still needs to be kept, but the ENOSYS error can be safely removed. Change-Id: I47026b8dd33f7ffc4de1306af6b67c7b4d2062d2 Reviewed-on: https://go-review.googlesource.com/c/go/+/428555 Run-TryBot: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Andy Pan <[email protected]>
Change-Id: I9c4c3ada3a8f5d8d198cc42a4afc06972ee00c61 GitHub-Last-Rev: 4ed8011 GitHub-Pull-Request: golang#54916 Reviewed-on: https://go-review.googlesource.com/c/go/+/428921 Run-TryBot: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Change-Id: I7ea150e4896fc9b2e3a6dbdd9a1c2b651e74b844 Reviewed-on: https://go-review.googlesource.com/c/go/+/428778 Auto-Submit: Tobias Klauser <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Tobias Klauser <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
FYI, this has moved to https://go.dev/cl/429863, and there is a proposal at #55022. |
This PR implements profile-guided optimization for Golang v1.17. At a very high-level, our implementation consumes PPROF CPU profiles via the compiler flag '-fpprofprofileuse' and performs three key optimizations based on profiles: code specialization ('-specialize'), inlining ('-pprofinline'), code layout ('-codelayout'). Code specialization targets interface method calls to specialize them to direct function calls. Inlining decisions are made based on profiles. Codelayout optimization currently reorders functions based on profiles. BasicBlock layout optimization is currently WIP.
Inlining based on profiles yields decent wins (avg 15%, max 40%).
For bitset, we see ~19% gain from inlining. Please find the steps below to reproduce:
(bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -cpuprofile LemireIterate.pprof
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLemireIterateManyb 3 1780585980 ns/op
PASS
ok github.com/bits-and-blooms/bitset 12.828s
(bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-fpprofprofileuse LemireIterate.pprof -specialize -pprofinline"
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLemireIterateManyb 3 1597644922 ns/op
PASS
ok github.com/bits-and-blooms/bitset 10.303s
Please note: due to Go compiler dependency issues, we had to duplicate a lot of functionalities from PPROF and from inline/inl.go in order to make this work. We are currently working closely with Google-go folks to eliminate these redundancies.