pgo: profile-guided optimization for golang v1.17 #49688

rajbarik · 2021-11-19T18:26:42Z

This PR implements profile-guided optimization for Golang v1.17. At a very high-level, our implementation consumes PPROF CPU profiles via the compiler flag '-fpprofprofileuse' and performs three key optimizations based on profiles: code specialization ('-specialize'), inlining ('-pprofinline'), code layout ('-codelayout'). Code specialization targets interface method calls to specialize them to direct function calls. Inlining decisions are made based on profiles. Codelayout optimization currently reorders functions based on profiles. BasicBlock layout optimization is currently WIP.

Inlining based on profiles yields decent wins (avg 15%, max 40%).

For bitset, we see ~19% gain from inlining. Please find the steps below to reproduce:

(bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -cpuprofile LemireIterate.pprof
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLemireIterateManyb 3 1780585980 ns/op
PASS
ok github.com/bits-and-blooms/bitset 12.828s

(bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-fpprofprofileuse LemireIterate.pprof -specialize -pprofinline"
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLemireIterateManyb 3 1597644922 ns/op
PASS
ok github.com/bits-and-blooms/bitset 10.303s

Please note: due to Go compiler dependency issues, we had to duplicate a lot of functionalities from PPROF and from inline/inl.go in order to make this work. We are currently working closely with Google-go folks to eliminate these redundancies.

mvdan · 2021-11-20T16:48:05Z

Did you see #28262? Usually, big features like these are designed and discussed in the issue tracker first. This is a very large amount of code to show up with all of a sudden :)

ianlancetaylor · 2021-11-22T00:24:34Z

@mvdan and others: for background, the authors of this CL have been discussing their PGO implementation privately with the Go compiler team. This pull request is the mechanism they are using to share their code changes. This pull request certainly isn't going to go in as is--for one thing, it's against the 1.17 branch--but will be used for further discussion and, I hope, a later updated implementation on mainline.

So, it's new to the community (and new to me too, for that matter) but it's not entirely out of the blue.

mvdan · 2021-11-22T12:12:03Z

Thanks for the quick reply, @ianlancetaylor! I had some thoughts but they weren't related to the code changes, so I've left a comment on the issue instead.

rajbarik · 2021-11-22T17:24:49Z

Thank you for proving the context @ianlancetaylor. This PR is still WIP as we (Uber) are working closely with @cherrymui @aclements and @jeremyfaller on cleaning up compiler dependencies (which would avoid a large amount of code duplication), supporting the common profiling format, adding bblayout optimization, and integrating our code in the main branch.

josharian · 2021-11-22T17:46:03Z

Usually changes of this magnitude start with a publicly circulated design proposal.

jeremyfaller · 2021-11-22T17:57:06Z

Usually changes of this magnitude start with a publicly circulated design proposal.

This will be no different. I don't think anyone is implying we would skip the proposal process. Uber's code drop is to get their work and ours on the same page, so we can build from there. I don't know what form it will take, but we have to start by sharing. :)

prattmic · 2021-12-02T16:39:06Z

Minor note: @google-cla thinks there is no CLA signed yet. Per https://opensource.google/docs/cla/#external-contributors, I believe you need to get added to the appropriate "Google CLA" Google Group within Uber to be seen as from Uber by the bot. This will be needed eventually before changes can go in, though as noted above, design docs / proposals will come well before that.

lni · 2021-12-03T03:02:00Z

@rajbarik, thanks for sharing this very interesting WIP patch!

When building your feature branch at https://github.com/rajbarik/go/tree/pgo, I got the following errors

~/src/go/src$ ./make.bash
Building Go cmd/dist using /usr/local/go. (go1.17.2 darwin/arm64)
Building Go toolchain1 using /usr/local/go.
# bootstrap/cmd/compile/internal/pgo
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:230: fn.NodeFreq undefined (type *ir.Func has no field or method NodeFreq)
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:230: undefined: ir.NodeFrequency
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:258: call.EdgeFreq undefined (type *ir.CallExpr has no field or method EdgeFreq)
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:259: call.EdgeFreq undefined (type *ir.CallExpr has no field or method EdgeFreq)
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:259: undefined: ir.EdgeFrequency
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:262: call.EdgeFreq undefined (type *ir.CallExpr has no field or method EdgeFreq)
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:262: undefined: ir.EdgeFrequency
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:303: f.NodeFreq undefined (type *ir.Func has no field or method NodeFreq)
/Users/lni/src/go/src/cmd/compile/internal/pgo/irgraph.go:303: undefined: ir.NodeFrequency
go tool dist: FAILED: /usr/local/go/bin/go install -gcflags=-l -tags=math_big_pure_go compiler_bootstrap bootstrap/cmd/...: exit status 2

quick grep search shows ir.NodeFrequency is indeed missing from the ir package.

Did I miss something here? Thanks.

rajbarik · 2021-12-03T18:21:18Z

@lni My apologies. I have updated the PR now and have tested it to work.

@cherrymui fyi.

rajbarik · 2021-12-03T18:22:08Z

Minor note: @google-cla thinks there is no CLA signed yet. Per https://opensource.google/docs/cla/#external-contributors, I believe you need to get added to the appropriate "Google CLA" Google Group within Uber to be seen as from Uber by the bot. This will be needed eventually before changes can go in, though as noted above, design docs / proposals will come well before that.

Thanks for pointing this out. I am working on getting this fixed.

lni · 2021-12-04T13:56:03Z

@rajbarik thanks a lot for the quick response.

I've managed to build your patched version of go, interestingly I couldn't reproduce the above reported performance improvements. That LemireIterateManyb benchmark from bitset is slower when PGO is enabled. Please see my results below.

$ /Users/lni/src/go/bin/go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -cpuprofile LemireIterate.pprof
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
BenchmarkLemireIterateManyb 	       3	   2676432 ns/op
PASS
ok  	github.com/bits-and-blooms/bitset	0.230s

$ /Users/lni/src/go/bin/go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-fpprofprofileuse LemireIterate.pprof -specialize -pprofinline"
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
BenchmarkLemireIterateManyb 	       3	   2775468 ns/op
PASS
ok  	github.com/bits-and-blooms/bitset	0.233s

two observed differences in our benchmark outputs,

your ns/op is significantly higher than mine, 1780585980 ns/op vs. 2676432 ns/op
different processors, i7-8850H processor at 2.6Ghz vs. i9-9980HK at 2.4Ghz

I've tried on a M1 ARM processor based mbp as well and I got similar results.

Did I still miss anything here? Thanks!

rajbarik · 2021-12-06T23:18:09Z

@lni Great to hear that you are able to build and test the patch.
AFAICT, performance improvements may vary between hardware generations and SKUs. You may want to increase the benchtime to 10x or more to increase the execution time [in order to eliminate any noises]. Also, I recommend adding "-m" to gcflags and mining the logs to ensure inlining via pprof profiles is indeed working as expected [before and after]. Please see example below:

Before pprofinline

(bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-m" -cpuprofile LemireIterate.pprof 2> before_log.txt
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLemireIterateManyb 3 1483586757 ns/op
PASS
ok github.com/bits-and-blooms/bitset 11.476s

(bitset) go tool pprof LemireIterate.pprof
Type: cpu
Time: Dec 6, 2021 at 9:42am (PST)
Duration: 10.74s, Total samples = 9.29s (86.47%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 9.21s, 99.14% of 9.29s total
Dropped 19 nodes (cum <= 0.05s)
flat flat% sum% cum cum%
3.29s 35.41% 35.41% 3.39s 36.49% github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
2.69s 28.96% 64.37% 9.11s 98.06% github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb
1.57s 16.90% 81.27% 3s 32.29% github.com/bits-and-blooms/bitset.(*BitSet).Set
1.38s 14.85% 96.12% 1.43s 15.39% github.com/bits-and-blooms/bitset.(*BitSet).extendSetMaybe

After pprofinline

bitset) go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=3x -gcflags="-m -fpprofprofileuse LemireIterate.pprof -pprofinline" -cpuprofile LemireIterate_after.pprof 2> after_log.txt
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLemireIterateManyb 3 1311964698 ns/op
PASS
ok github.com/bits-and-blooms/bitset 8.782s
(bitset) go tool pprof LemireIterate_after.pprof
Type: cpu
Time: Dec 6, 2021 at 9:46am (PST)
Duration: 8.02s, Total samples = 6.97s (86.95%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 6.93s, 99.43% of 6.97s total
Dropped 9 nodes (cum <= 0.03s)
flat flat% sum% cum cum%
3.38s 48.49% 48.49% 3.40s 48.78% github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
3.31s 47.49% 95.98% 6.81s 97.70% github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb

after_log.txt content

PprofInline: [hot] inline (*BitSet).extendSetMaybe in (*BitSet).Set with weight 5.20
PprofInline: [hot] inline (*BitSet).Set in BenchmarkLemireIterateManyb with weight 10.92
PprofInline: [hot] inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb with weight 12.34

Please let me know if they help.

lni · 2021-12-08T15:20:44Z

@rajbarik Thanks for the help above, sadly I still couldn't reproduce the performance improvements.

Based on your suggestion above, I changed benchtime to 1000x and repeated the test -

lni$ /Users/lni/src/go/bin/go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=1000x -gcflags="-m" -cpuprofile LemireIterate.pprof 2> before_log.txt
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
BenchmarkLemireIterateManyb 	    1000	   2275634 ns/op
PASS
ok  	github.com/bits-and-blooms/bitset	3.114s

lni$ go tool pprof LemireIterate.pprof
Type: cpu
Time: Dec 8, 2021 at 11:07pm (CST)
Duration: 2.52s, Total samples = 2.09s (82.86%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top30
Showing nodes accounting for 2.07s, 99.04% of 2.09s total
Dropped 14 nodes (cum <= 0.01s)
      flat  flat%   sum%        cum   cum%
     1.61s 77.03% 77.03%      1.63s 77.99%  github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
     0.41s 19.62% 96.65%      2.05s 98.09%  github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb
     0.03s  1.44% 98.09%      0.03s  1.44%  runtime.asyncPreempt
     0.02s  0.96% 99.04%      0.02s  0.96%  runtime.kevent
         0     0% 99.04%      0.02s  0.96%  runtime.netpoll
         0     0% 99.04%      0.02s  0.96%  runtime.startTheWorldWithSema
         0     0% 99.04%      0.03s  1.44%  runtime.systemstack
         0     0% 99.04%      2.05s 98.09%  testing.(*B).launch
         0     0% 99.04%      2.05s 98.09%  testing.(*B).runN

lni$ /Users/lni/src/go/bin/go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=1000x -gcflags="-m -fpprofprofileuse LemireIterate.pprof -pprofinline" -cpuprofile LemireIterate_after.pprof 2> after_log.txt
goos: darwin
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
BenchmarkLemireIterateManyb 	    1000	   2290498 ns/op
PASS
ok  	github.com/bits-and-blooms/bitset	2.599s

lni$ go tool pprof LemireIterate_after.pprof
Type: cpu
Time: Dec 8, 2021 at 11:08pm (CST)
Duration: 2.49s, Total samples = 2.05s (82.47%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top40
Showing nodes accounting for 2.03s, 99.02% of 2.05s total
Dropped 9 nodes (cum <= 0.01s)
      flat  flat%   sum%        cum   cum%
     1.48s 72.20% 72.20%      1.52s 74.15%  github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
     0.48s 23.41% 95.61%      2.04s 99.51%  github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb
     0.05s  2.44% 98.05%      0.05s  2.44%  runtime.asyncPreempt
     0.02s  0.98% 99.02%      0.02s  0.98%  runtime.memclrNoHeapPointers
         0     0% 99.02%      0.02s  0.98%  github.com/bits-and-blooms/bitset.(*BitSet).Set
         0     0% 99.02%      0.02s  0.98%  github.com/bits-and-blooms/bitset.(*BitSet).extendSetMaybe
         0     0% 99.02%      0.03s  1.46%  runtime.makeslice
         0     0% 99.02%      0.03s  1.46%  runtime.mallocgc
         0     0% 99.02%      0.02s  0.98%  runtime.memclrNoHeapPointersChunked
         0     0% 99.02%      2.02s 98.54%  testing.(*B).launch
         0     0% 99.02%      0.02s  0.98%  testing.(*B).run1.func1
         0     0% 99.02%      2.04s 99.51%  testing.(*B).runN

lni$ cat after_log.txt | grep NextSetMany
PprofInline: can't determine the hotness, but inline trailingZeroes64 in (*BitSet).NextSetMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannLowDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannLowDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidStrongDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidStrongDensityIterateMany

lni$ cat after_log.txt | grep "\[hot\]"
PprofInline: [hot] inline (*BitSet).extendSetMaybe in (*BitSet).Set with weight 3.57
PprofInline: [hot] inline strconv.FormatInt in (*BitSet).String with weight 3.57
PprofInline: [hot] inline (*BitSet).Set in BenchmarkLemireIterateManyb with weight 7.14
PprofInline: [hot] inline (*BitSet).String in TestStringLong with weight 3.57
PprofInline: [hot] inline New in TestBitSetHuge with weight 3.57

Please let me know if there anything else I should try. Thanks!

zhouguangyuan0718 · 2021-12-09T17:26:54Z

I‘m trying. ButI couldn't reproduce the performance improvements, too.
The result is a little different with lni.

$ go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=1000x -gcflags="-m" -cpuprofile LemireIterate.pprof 2> before_log.txt
goos: linux
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
BenchmarkLemireIterateManyb         1000           2482742 ns/op
PASS
ok      github.com/bits-and-blooms/bitset       2.754s
$ go tool pprof LemireIterate.pprof
File: bitset.test
Type: cpu
Time: Dec 10, 2021 at 1:18am (CST)
Duration: 2.72s, Total samples = 2.60s (95.47%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)  top30
Showing nodes accounting for 2.58s, 99.23% of 2.60s total
Dropped 12 nodes (cum <= 0.01s)
      flat  flat%   sum%        cum   cum%
     1.95s 75.00% 75.00%      2.07s 79.62%  github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
     0.42s 16.15% 91.15%      2.50s 96.15%  github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb
     0.12s  4.62% 95.77%      0.12s  4.62%  github.com/bits-and-blooms/bitset.trailingZeroes64 (inline)
     0.09s  3.46% 99.23%      0.09s  3.46%  runtime.memclrNoHeapPointers
         0     0% 99.23%      0.09s  3.46%  github.com/bits-and-blooms/bitset.New
         0     0% 99.23%      0.09s  3.46%  github.com/bits-and-blooms/bitset.TestBitSetHuge
         0     0% 99.23%      0.02s  0.77%  runtime.(*mcache).allocLarge
         0     0% 99.23%      0.10s  3.85%  runtime.makeslice
         0     0% 99.23%      0.10s  3.85%  runtime.mallocgc
         0     0% 99.23%      0.08s  3.08%  runtime.memclrNoHeapPointersChunked
         0     0% 99.23%      2.49s 95.77%  testing.(*B).launch
         0     0% 99.23%      2.50s 96.15%  testing.(*B).runN
         0     0% 99.23%      0.10s  3.85%  testing.tRunner
(pprof) q
go test -bench=LemireIterateManyb -cpu 1 -parallel 1 -benchtime=1000x -gcflags="-m -fpprofprofileuse LemireIterate.pprof -pprofinline" -cpuprofile LemireIterate_after.pprof 2> after_log.txt
goos: linux
goarch: amd64
pkg: github.com/bits-and-blooms/bitset
cpu: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
BenchmarkLemireIterateManyb         1000           2504440 ns/op
PASS
ok      github.com/bits-and-blooms/bitset       2.852s
$  go tool pprof LemireIterate.pprof
File: bitset.test
Type: cpu
Time: Dec 10, 2021 at 1:18am (CST)
Duration: 2.72s, Total samples = 2.60s (95.47%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top30
Showing nodes accounting for 2.58s, 99.23% of 2.60s total
Dropped 12 nodes (cum <= 0.01s)
      flat  flat%   sum%        cum   cum%
     1.95s 75.00% 75.00%      2.07s 79.62%  github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany
     0.42s 16.15% 91.15%      2.50s 96.15%  github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb
     0.12s  4.62% 95.77%      0.12s  4.62%  github.com/bits-and-blooms/bitset.trailingZeroes64 (inline)
     0.09s  3.46% 99.23%      0.09s  3.46%  runtime.memclrNoHeapPointers
         0     0% 99.23%      0.09s  3.46%  github.com/bits-and-blooms/bitset.New
         0     0% 99.23%      0.09s  3.46%  github.com/bits-and-blooms/bitset.TestBitSetHuge
         0     0% 99.23%      0.02s  0.77%  runtime.(*mcache).allocLarge
         0     0% 99.23%      0.10s  3.85%  runtime.makeslice
         0     0% 99.23%      0.10s  3.85%  runtime.mallocgc
         0     0% 99.23%      0.08s  3.08%  runtime.memclrNoHeapPointersChunked
         0     0% 99.23%      2.49s 95.77%  testing.(*B).launch
         0     0% 99.23%      2.50s 96.15%  testing.(*B).runN
         0     0% 99.23%      0.10s  3.85%  testing.tRunner
(pprof) q
$  cat after_log.txt | grep NextSetMany
PprofInline Caller=github.com/bits-and-blooms/bitset.BenchmarkLemireIterateManyb Callee=github.com/bits-and-blooms/bitset.(*BitSet).NextSetMany Weight=2200000000
PprofInline: can't determine the hotness, but inline trailingZeroes64 in (*BitSet).NextSetMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb
PprofInline: [hot] inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb with weight 28.03
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannLowDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannLowDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidStrongDensityIterateMany
PprofInline: can't determine the hotness, but inline (*BitSet).NextSetMany in BenchmarkFlorianUekermannMidStrongDensityIterateMany
$   cat after_log.txt | grep "\[hot\]"
PprofInline: [hot] inline (*BitSet).NextSetMany in BenchmarkLemireIterateManyb with weight 28.03

@rajbarik
I want to know what else I should try, too.
Thank you.

zhouguangyuan0718 · 2021-12-10T03:43:49Z

For github.com/bits-and-blooms/bitset. I run benchmark for all the testcase. And add a test flag -count=100.

$ go test -bench=. -cpu 1 -parallel 1 -benchtime=3x -count=100 -cpuprofile all.pprof > before.txt
$ go test -bench=. -cpu 1 -parallel 1 -benchtime=3x -count=100 -gcflags="-fpprofprofileuse all.pprof -specialize -pprofinline" >after.txt
$ benchstat ./before.txt after.txt
name                                             old time/op   new time/op  delta
Set                                               91.4ns ±52%  55.8ns ±40%  -38.89%  (p=0.000 n=95+95)
GetTest                                           81.6ns ±38%  62.2ns ±39%  -23.72%  (p=0.000 n=98+98)
SetExpand                                        1.02µs ±153%  1.95µs ±15%  +90.68%  (p=0.000 n=98+91)
Count                                             1.16µs ± 1%  1.16µs ± 2%     ~     (p=0.777 n=90+96)
Iterate                                           13.1µs ± 0%  13.1µs ± 0%   -0.06%  (p=0.003 n=92+86)
SparseIterate                                     14.5µs ±30%  13.6µs ± 1%   -6.49%  (p=0.000 n=99+84)
LemireCreate                                      14.9ms ± 6%  15.7ms ± 4%   +5.62%  (p=0.000 n=95+97)
LemireCount                                       1.54ms ± 9%  1.51ms ± 7%   -1.85%  (p=0.000 n=98+100)
LemireIterate                                     4.20ms ± 1%  4.16ms ± 1%   -1.11%  (p=0.000 n=88+99)
LemireIterateb                                    4.22ms ± 1%  4.17ms ± 1%   -1.11%  (p=0.000 n=92+100)
LemireIterateManyb                                2.94ms ± 5%  2.98ms ± 2%   +1.05%  (p=0.000 n=94+96)
FlorianUekermannIterateMany                        724ns ±10%   679ns ± 9%   -6.24%  (p=0.000 n=89+98)
FlorianUekermannIterateManyReg                    1.46µs ± 4%  1.40µs ± 4%   -4.45%  (p=0.000 n=100+97)
FlorianUekermannIterateManyComp                    788ns ±11%   926ns ± 9%  +17.47%  (p=0.000 n=97+100)
FlorianUekermannLowDensityIterateMany             1.16ms ± 1%  1.19ms ± 1%   +2.54%  (p=0.000 n=91+97)
FlorianUekermannLowDensityIterateManyReg          1.14ms ± 1%  1.13ms ± 1%   -0.74%  (p=0.000 n=90+98)
FlorianUekermannLowDensityIterateManyComp         1.34ms ± 1%  1.35ms ± 1%   +0.54%  (p=0.000 n=94+93)
FlorianUekermannMidDensityIterateMany             12.7ms ± 1%  12.9ms ± 1%   +1.79%  (p=0.000 n=89+100)
FlorianUekermannMidDensityIterateManyReg          19.0ms ± 1%  18.9ms ± 0%   -0.21%  (p=0.000 n=94+97)
FlorianUekermannMidDensityIterateManyComp         13.8ms ± 1%  13.7ms ± 0%   -0.45%  (p=0.000 n=92+99)
FlorianUekermannMidStrongDensityIterateMany       38.3ms ± 1%  38.4ms ± 0%   +0.36%  (p=0.000 n=96+98)
FlorianUekermannMidStrongDensityIterateManyReg    71.1ms ± 0%  74.8ms ± 0%   +5.26%  (p=0.000 n=95+97)
FlorianUekermannMidStrongDensityIterateManyComp   46.3ms ± 2%  46.4ms ± 1%   +0.21%  (p=0.000 n=99+92)

For some testcase, it improved. But some testcases got a worse result.

lni · 2021-12-11T03:22:17Z

@rajbarik

The bitset version I am using is 5a829244ffd64e4120015ce1bf8285b0b6168d55 which is the latest in its default branch. Any chance you can quickly check which version of bitset are you running?

Also wondering is there any other tests/benchmarks that are known to benefit from the PGO patch? Will be happy to try those as well. Thanks!

rajbarik · 2021-12-16T21:22:03Z

@lni I will take a look at this next week and update you. Please stay tuned.

lni · 2021-12-17T06:21:48Z

@rajbarik Thanks a lot for looking into this.

We are building a new HTAP database in Go called MatrixOne. Our profiling results indicate that some hot functions are not inlined as expected when using the official compiler, in our experiments, we manually batched their invocations and immediately observed noticeable performance improvements. Profiling guided inlining as described in your PGO patch sounds like a pretty cool potential solution to the problem we face.

Looking forward to your update, will be very happy to further cooperate on this exciting PGO feature!

rajbarik · 2021-12-20T18:12:05Z

@lni @zhouguangyuan0718 My apologies, I had modified the BenchmarkLemireIterateManyb function in bitset_benchmark_test.go to increase its execution time by setting every bit in the bitset. I had modified one line; please see the commented line below.

// go test -bench=BenchmarkLemireIterateManyb
// see http://lemire.me/blog/2016/09/22/swift-versus-java-the-bitset-performance-test/
func BenchmarkLemireIterateManyb(b *testing.B) {
  bitmap := New(1000000000)
  for v := uint(0); v <= 1000000000; v += 1 { // This is the only change to the code which replaces 100 with 1
    bitmap.Set(v)
  }
  buffer := make([]uint, 256)
  b.ResetTimer()
  sum := uint(0)
  for i := 0; i < b.N; i++ {
    j := uint(0)
    j, buffer = bitmap.NextSetMany(j, buffer)
    for ; len(buffer) > 0; j, buffer = bitmap.NextSetMany(j, buffer) {
      for k := range buffer {
        sum += buffer[k]
      }
      j++
    }
  }

  if sum == 0 { // added just to fool ineffassign
    return
  }
}

With this change, here is what I see on my Intel Mac laptop:

go test -bench=BenchmarkLemireIterateManyb -cpu 1 -parallel 1 -count 5 -benchtime=3x -cpuprofile LemireIterate.pprof > before.txt
go test -bench=BenchmarkLemireIterateManyb -cpu 1 -parallel 1 -count 5 -benchtime=3x '-gcflags=-fpprofprofileuse LemireIterate.pprof -pprofinline' -cpuprofile LemireIterate_after.pprof > after.txt
benchstat before.txt after.txt
name                old time/op  new time/op  delta
LemireIterateManyb   1.46s ± 4%   1.31s ± 2%  -9.87%  (p=0.008 n=5+5)

Another benchmark where I see ~25% improvement is Tally. Steps below:

go test -bench=ScopeReporting/size1000000 -cpu 1 -parallel 1 -count 5 -benchtime=1x -cpuprofile scope.pprof > before.txt
go test -bench=ScopeReporting/size1000000 -cpu 1 -parallel 1 -count 5 -benchtime=1x '-gcflags=-fpprofprofileuse scope.pprof -pprofinline' -cpuprofile scope_after.pprof > after.txt
benchstat before.txt after.txt
name                        old time/op  new time/op  delta
ScopeReporting/size1000000  2.75ms ± 5%  2.04ms ± 5%  -25.75%  (p=0.008 n=5+5)

Please let me know if you are able to reproduce bitset and tally now.

@lni Regarding MatrixOne, I am able to run the unit-tests. Is there any particular unit-test I should dig deeper? Are there integration/component tests?

mvdan · 2022-08-22T09:27:42Z

It seems like PGO is being talked about internally again (#43930 (comment)). Is there a chance that we could have more details about the current plan and design? For those of us interested in the Go compiler while outside Google, it's being particularly hard to follow what is happening, let alone contribute in any meaningful way :)

I understand that it may still be early for a formal proposal, but perhaps there could be a regular GitHub issue thread where we can subscribe for detailed updates every few weeks. Ideally the discussions about PGO would simply happen in a public place, removing the need for that back and forth between private and public communication.

mvdan · 2022-08-24T18:40:23Z

It's only been two days, but my last comment is already buried by the commit push activity :) I had to click on "see more" three times to find it again. Can we please at least have one tracking issue to talk about PGO?

josharian · 2022-08-24T18:52:36Z

#28262

mvdan · 2022-08-24T18:56:18Z

Oh, thanks Josh. It has been some time and I even forgot I already asked this last year :)

Fixes golang#45928. Change-Id: Ifbb0effbca4ab7c0eb56069fee40edb564553c35 Reviewed-on: https://go-review.googlesource.com/c/go/+/410336 Reviewed-by: Cuong Manh Le <[email protected]> Run-TryBot: Wayne Zuo <[email protected]> Reviewed-by: David Chase <[email protected]> Reviewed-by: Than McIntosh <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

And add some documentation for the debug query param. Fixes golang#27737 Fixes golang#53971 Change-Id: I629aaa2d4a43175381eb04872f1caad238519a41 Reviewed-on: https://go-review.googlesource.com/c/go/+/421635 Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Than McIntosh <[email protected]> Reviewed-by: Daniel Martí <[email protected]> Reviewed-by: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

Change-Id: I9ec725009376f5865adedca6c159b14140dde097 Reviewed-on: https://go-review.googlesource.com/c/go/+/426086 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Michael Pratt <[email protected]> Reviewed-by: Daniel Theophanes <[email protected]> Run-TryBot: Daniel Theophanes <[email protected]>

For golang#46505. Change-Id: I9bc9da5dd4b76cb2d8ff41390e1567678e72d88d Reviewed-on: https://go-review.googlesource.com/c/go/+/428938 Run-TryBot: Robert Griesemer <[email protected]> Reviewed-by: Robert Findley <[email protected]> Auto-Submit: Robert Griesemer <[email protected]> Reviewed-by: Robert Griesemer <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

pageAlloc.chunks used to require an atomic store when growing the heap because the scavenger would look at the list without locking the heap lock. However, the scavenger doesn't do that anymore, and it looks like nothing really does at all. This change updates the comment and makes the store non-atomic. Change-Id: Ib452d147861060f9f6e74e2d98ee111cf89ce8f5 Reviewed-on: https://go-review.googlesource.com/c/go/+/429219 Auto-Submit: Michael Knyszek <[email protected]> Reviewed-by: Keith Randall <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> Reviewed-by: Keith Randall <[email protected]>

All subfields use atomic types to ensure alignment, so there's no more need for these fields. Change-Id: Iada4253f352a074073ce603f1f6b07cbd5b7c58a Reviewed-on: https://go-review.googlesource.com/c/go/+/429220 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Keith Randall <[email protected]>

This change fixes an old TODO that made it a uint64 because it would make alignment within mheap more complicated. Now that we don't have to worry about it since we're using atomic types as much as possible, switch to using a Uintptr. This likely will improve performance a tiny bit on 32-bit platforms, but really it's mostly cleanup. Change-Id: Ie705799a111ccad977fc1f43de8b50cf611be303 Reviewed-on: https://go-review.googlesource.com/c/go/+/429221 Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Keith Randall <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Knyszek <[email protected]>

This is an optimization that prevents N^2 behavior within a chunk, but was accidentally skipped. There should be no functional change as a result of this CL. Fixes golang#54892. Change-Id: I861967a2268699fdc3464bd41bc56618b5628e6b Reviewed-on: https://go-review.googlesource.com/c/go/+/429255 Auto-Submit: Michael Knyszek <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Keith Randall <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Michael Knyszek <[email protected]>

This was left over from the old pacer, and never removed when the old pacer was removed in Go 1.19. Change-Id: I79e5f0420c6100c66bd06129a68f5bbab7c1ea8f Reviewed-on: https://go-review.googlesource.com/c/go/+/429256 Reviewed-by: Keith Randall <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Keith Randall <[email protected]>

For golang#46505. Change-Id: I1a30fd895496befd16626afb48717ac837ed5778 Reviewed-on: https://go-review.googlesource.com/c/go/+/429315 Reviewed-by: Matthew Dempsky <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: Robert Griesemer <[email protected]>

Because most of these APIs are recently supported, we can only do some advancement work as much as possible under the premise of compatibility. For golang#54854. Change-Id: Id15d11288bf23902570d54eaf2704a5264210b2e Reviewed-on: https://go-review.googlesource.com/c/go/+/429115 Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: hopehook <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

FixedZone is transitively called by Time.UnmarshalJSON or Time.UnmarshalText for any RFC 3339 timestamp that is not in UTC. This function is relatively slow since it allocates 3 times. Given that RFC 3339 never has a zone name and most offsets are by the hour, we can cache unnamed zones on hour offsets. Caching a Location should be safe since it has no exported fields or methods that can mutate the Location. It is functionally immutable. The only way to observe that the Location was cached is either by pointer comparison or by shallow copying the struct value. Neither operation seems sensible to do with a *time.Location. Performance: name old time/op new time/op delta UnmarshalText 268ns ± 2% 182ns ± 1% -32.01% (p=0.000 n=10+10) Change-Id: Iab5432f34bdbb485512bb8b5464e076c03fd106f Reviewed-on: https://go-review.googlesource.com/c/go/+/425116 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Joseph Tsai <[email protected]> Auto-Submit: Joseph Tsai <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>

Change-Id: I213b078effa4b7049c44498d651de5b938e5404b Reviewed-on: https://go-review.googlesource.com/c/go/+/428779 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Tobias Klauser <[email protected]> Reviewed-by: hopehook <[email protected]> Run-TryBot: hopehook <[email protected]> Auto-Submit: Tobias Klauser <[email protected]>

When the type assertion fails, the test mistakenly prints the expected (rather than the actual) type. When the error string doesn't match, the text mistakenly prints the original (rather than the converted) error (although there might not be any difference in the output, the code looks wrong). Fix both issues. Change-Id: Ia7dd0632fc677f458fec25d899c46268a12f76e7 Reviewed-on: https://go-review.googlesource.com/c/go/+/428916 Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

Change-Id: I25c8e8b701d6489f360fea30d09090826276b950 GitHub-Last-Rev: c2c8319 GitHub-Pull-Request: golang#54924 Reviewed-on: https://go-review.googlesource.com/c/go/+/428976 Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> Run-TryBot: Daniel Martí <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Daniel Martí <[email protected]>

It's set but the output is never used. Change-Id: I36ecb9c5f087a85289529907ede9f9bfc295d739 Reviewed-on: https://go-review.googlesource.com/c/go/+/428637 Reviewed-by: Cherry Mui <[email protected]> Reviewed-by: Benny Siegert <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]>

As discussed in CL 401434 there are substantial misuses of these in the wild, and they are a potential source of unsafety even for code that does not use them directly. Since proposal golang#53003 has already been implemented, now is the right time to deprecate reflect.{SliceHeader, StringHeader}. For golang#53003. Change-Id: I724cf46d4b22d2ed3cbf2b948e6aac5ee4bf0f6e Reviewed-on: https://go-review.googlesource.com/c/go/+/428757 Run-TryBot: hopehook <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Matthew Dempsky <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Auto-Submit: Matthew Dempsky <[email protected]>

…ice, String} After we deprecated reflect.{SliceHeader, StringHeader}, it is recommended to use unsafe.{Slice, String} to replace its work. However, the compiler and linker cannot be migrated for the time being. As a temporary strategy, using the "internal/unsafeheader" package like other code is the most suitable choice at present. For golang#53003. Change-Id: I69d0ef72e2d95caabd0706bbb247a719d225c758 Reviewed-on: https://go-review.googlesource.com/c/go/+/429755 Auto-Submit: Matthew Dempsky <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Matthew Dempsky <[email protected]> Run-TryBot: hopehook <[email protected]>

…ckage Follow CL 428777. Change-Id: I5ce49322e92c5d6539bb08248e3366187c30dcd8 Reviewed-on: https://go-review.googlesource.com/c/go/+/428780 Run-TryBot: Tobias Klauser <[email protected]> Reviewed-by: Bryan Mills <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Auto-Submit: Tobias Klauser <[email protected]>

First, we know that Go source files almost always weigh at least a few kilobytes, so we can kickstart the output buffer to be a reasonable size and reduce the initial number of incremental allocations and copies when appending bytes or strings to output. Second, in nodeSize we use a nested printer, but we don't actually need its printed bytes - we only need to know how many bytes it prints. For that reason, use a throwaway buffer: the part of our output buffer between length and capacity, as we haven't used it yet. Third, use a sync.Pool to reuse allocated printers. The current API doesn't allow reusing printers, and some programs like gofmt will print many files in sequence. Those changes combined result in a modest reduction in allocations and CPU usage. The benchmark uses testdata/parser.go, which has just over two thousand lines of code, which is pretty standard size-wise. We also split the Print benchmark to cover both a medium-sized ast.File as well as a pretty small ast.Decl node. The latter is a somewhat common scenario in gopls, which has code actions which alter small bits of the AST and print them back out to rewrite only a few lines in a file. name old time/op new time/op delta PrintFile-16 5.43ms ± 1% 4.85ms ± 3% -10.68% (p=0.000 n=9+10) PrintDecl-16 19.1µs ± 0% 18.5µs ± 1% -3.04% (p=0.000 n=10+10) name old speed new speed delta PrintFile-16 9.56MB/s ± 1% 10.69MB/s ± 3% +11.81% (p=0.000 n=8+10) PrintDecl-16 1.67MB/s ± 0% 1.73MB/s ± 1% +3.05% (p=0.000 n=10+10) name old alloc/op new alloc/op delta PrintFile-16 332kB ± 0% 107kB ± 2% -67.87% (p=0.000 n=10+10) PrintDecl-16 3.92kB ± 0% 3.28kB ± 0% -16.38% (p=0.000 n=10+10) name old allocs/op new allocs/op delta PrintFile-16 3.45k ± 0% 2.42k ± 0% -29.90% (p=0.000 n=10+10) PrintDecl-16 56.0 ± 0% 46.0 ± 0% -17.86% (p=0.000 n=10+10) Change-Id: I475a3babca77532b2d51888f49710f74763d81d2 Reviewed-on: https://go-review.googlesource.com/c/go/+/424924 Run-TryBot: Daniel Martí <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Robert Griesemer <[email protected]> Reviewed-by: Alan Donovan <[email protected]>

Change-Id: Ie543e1b1df667cfaf3aafa4be727881461ee8b7d GitHub-Last-Rev: ed993db GitHub-Pull-Request: golang#54888 Reviewed-on: https://go-review.googlesource.com/c/go/+/428716 Auto-Submit: Bryan Mills <[email protected]> Run-TryBot: Bryan Mills <[email protected]> Reviewed-by: Bryan Mills <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

Change-Id: I9de5aafb36d05bdc90bbdba516367eb2b200a7e0 Reviewed-on: https://go-review.googlesource.com/c/go/+/428777 Auto-Submit: Tobias Klauser <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Tobias Klauser <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>

Implement CLONE_INTO_CGROUP feature, allowing to put a child in a specified cgroup in a clean and simple way. Note that the feature only works for cgroup v2, and requires Linux kernel 5.7 or newer. Using the feature requires a new syscall, clone3. Currently this is the only reason to use clone3, but the code is structured in a way so that other cases may be easily added in the future. Add a test case. While at it, try to simplify the syscall calling code in forkAndExecInChild1, which became complicated over time because: 1. It was using either rawVforkSyscall or RawSyscall6 depending on whether CLONE_NEWUSER was set. 2. On Linux/s390, the first two arguments to clone(2) system call are swapped (which deserved a mention in Linux ABI hall of shame). It was worked around in rawVforkSyscall on s390, but had to be implemented via a switch/case when using RawSyscall6, making the code less clear. Let's - modify rawVforkSyscall to have two arguments (which is also required for clone3); - remove the arguments workaround from s390 asm, instead implementing arguments swap in the caller (which still looks ugly but at least it's done once and is clearly documented now); - use rawVforkSyscall for all cases (since it is essentially similar to RawSyscall6, except for having less parameters, not returning r2, and saving/restoring the return address before/after syscall on 386 and amd64). Updates golang#51246. Change-Id: Ifcd418ebead9257177338ffbcccd0bdecb94474e Reviewed-on: https://go-review.googlesource.com/c/go/+/417695 Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> Run-TryBot: Kirill Kolyshkin <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

Change-Id: Ib9746cb4f27625cb22620271b280d2da242b2fba Reviewed-on: https://go-review.googlesource.com/c/go/+/428437 Run-TryBot: Ian Lance Taylor <[email protected]> Run-TryBot: Andy Pan <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]>

Update CL 425881 and CL 428396 I browsed the source code related to copy_file_range in the kernel and found that the latest kernel may still return EXDEV errors in copy_file_range(2) due to certain cases, for details see: https://elixir.bootlin.com/linux/v5.19.7/source/fs/read_write.c#L1559, https://elixir.bootlin.com/linux/v5.19.7/source/fs/read_write.c#L1479, and https://elixir.bootlin.com/linux/v5.19.7/source/fs/read_write.c#L1439. Therefore, the EXDEV still needs to be kept, but the ENOSYS error can be safely removed. Change-Id: I47026b8dd33f7ffc4de1306af6b67c7b4d2062d2 Reviewed-on: https://go-review.googlesource.com/c/go/+/428555 Run-TryBot: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Andy Pan <[email protected]>

Change-Id: I9c4c3ada3a8f5d8d198cc42a4afc06972ee00c61 GitHub-Last-Rev: 4ed8011 GitHub-Pull-Request: golang#54916 Reviewed-on: https://go-review.googlesource.com/c/go/+/428921 Run-TryBot: Ian Lance Taylor <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

Change-Id: I7ea150e4896fc9b2e3a6dbdd9a1c2b651e74b844 Reviewed-on: https://go-review.googlesource.com/c/go/+/428778 Auto-Submit: Tobias Klauser <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Tobias Klauser <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

prattmic · 2022-09-12T16:18:17Z

FYI, this has moved to https://go.dev/cl/429863, and there is a proposal at #55022.

google-cla bot added the cla: no Used by googlebot to label PRs as having an invalid CLA. The text of this label should not change. label Nov 19, 2021

rajbarik changed the base branch from master to release-branch.go1.17 November 19, 2021 18:28

mvdan mentioned this pull request Nov 22, 2021

cmd/compile: feedback-guided optimization #28262

Closed

rajbarik force-pushed the pgo branch from 69f72c8 to 7b1c59c Compare August 24, 2022 18:23

jinlin-bayarea force-pushed the pgo branch from 7ff19b2 to ff1cb2a Compare August 25, 2022 20:40

wdvxdr1123 and others added 2 commits August 31, 2022 09:35

cuiweixie and others added 25 commits September 8, 2022 15:52

rajbarik force-pushed the pgo branch from 9faf6b6 to de641ae Compare September 9, 2022 16:00

Enables PGO in Go and performs profile-guided inlining

03e649f

rajbarik force-pushed the pgo branch from de641ae to 03e649f Compare September 9, 2022 16:22

prattmic closed this Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pgo: profile-guided optimization for golang v1.17 #49688

pgo: profile-guided optimization for golang v1.17 #49688

rajbarik commented Nov 19, 2021

mvdan commented Nov 20, 2021

ianlancetaylor commented Nov 22, 2021

mvdan commented Nov 22, 2021

rajbarik commented Nov 22, 2021

josharian commented Nov 22, 2021

jeremyfaller commented Nov 22, 2021

prattmic commented Dec 2, 2021 •

edited

Loading

lni commented Dec 3, 2021

rajbarik commented Dec 3, 2021

rajbarik commented Dec 3, 2021

lni commented Dec 4, 2021

rajbarik commented Dec 6, 2021

lni commented Dec 8, 2021

zhouguangyuan0718 commented Dec 9, 2021

zhouguangyuan0718 commented Dec 10, 2021

lni commented Dec 11, 2021

rajbarik commented Dec 16, 2021

lni commented Dec 17, 2021

rajbarik commented Dec 20, 2021 •

edited

Loading

mvdan commented Aug 22, 2022

mvdan commented Aug 24, 2022

josharian commented Aug 24, 2022

mvdan commented Aug 24, 2022

prattmic commented Sep 12, 2022

pgo: profile-guided optimization for golang v1.17 #49688

pgo: profile-guided optimization for golang v1.17 #49688

Conversation

rajbarik commented Nov 19, 2021

mvdan commented Nov 20, 2021

ianlancetaylor commented Nov 22, 2021

mvdan commented Nov 22, 2021

rajbarik commented Nov 22, 2021

josharian commented Nov 22, 2021

jeremyfaller commented Nov 22, 2021

prattmic commented Dec 2, 2021 • edited Loading

lni commented Dec 3, 2021

rajbarik commented Dec 3, 2021

rajbarik commented Dec 3, 2021

lni commented Dec 4, 2021

rajbarik commented Dec 6, 2021

lni commented Dec 8, 2021

zhouguangyuan0718 commented Dec 9, 2021

zhouguangyuan0718 commented Dec 10, 2021

lni commented Dec 11, 2021

rajbarik commented Dec 16, 2021

lni commented Dec 17, 2021

rajbarik commented Dec 20, 2021 • edited Loading

mvdan commented Aug 22, 2022

mvdan commented Aug 24, 2022

josharian commented Aug 24, 2022

mvdan commented Aug 24, 2022

prattmic commented Sep 12, 2022

prattmic commented Dec 2, 2021 •

edited

Loading

rajbarik commented Dec 20, 2021 •

edited

Loading