Skip to content

cmd/compile: add basic block counters for PGO #65466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alexanius opened this issue Feb 2, 2024 · 27 comments
Open

cmd/compile: add basic block counters for PGO #65466

alexanius opened this issue Feb 2, 2024 · 27 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@alexanius
Copy link
Contributor

Proposal Details

This is a proposal for implementing PGO basic block counters in the Go compiler. The issue #62463 describes profile-based optimizations useful for the Go compiler. Most of them (basic block ordering, loop unroll, PGO register allocation and others) need counters inside the basic blocks. Currently, the Go compiler has the weighted call graph, which cannot be used for such optimizations directly.

Here I propose to add the basic block counters to make possible implementation of profile guided optimizations.

General approach

The general approach is based on adding counter values to the AST and SSA IR nodes, getting these values from the pprof file and correcting them during the compilation.

Step 1. Load the counter values to the AST nodes. The counters from the samples can easily be loaded to the corresponding AST nodes. As we use sampling profile, not all the nodes will have the values.

Step 2. Propagate the values to the remaining nodes. Here we traverse the AST nodes and propagate existing values to the nodes with no values. This is needed for further steps

Step 3. Correct values after devirtualization and inline. The callee function nodes contains the summary value of all the calls, but after inline, we should re-evaluate these values according to the inline point counter.

Step 4. Assign counters to the basic blocks during the SSA generation.

Step 5. Correct the counters of the basic blocks if any optimization changes the control flow.

Step 6. Implement the optimizations that rely on basic block counters.

Notes on implementation

  1. Alternative approach. The suggested approach assumes storing and correcting the counters during the whole compilation pipeline. This will add additional field to the IR nodes and can complicate the optimization implementation (at least additional steps to the inline). As an alternative, we could try to load counters to the particular SSA basic blocks, basing on the position information of the operations. This approach has the following disadvantages: we still need counter correction, based on top-down and bottom-up control flow graph traversing, and additional correction based on inline tree information. If there exists an optimization, that changes the control flow, we still need correction. Also, the dynamic escapes on cold paths optimization needs the counters on the AST nodes. So, loading the counters to the AST nodes is not more complicated (probably even easier) and gives more opportunities.

  2. One of the non-trivial parts is Step 2 - propagating nodes on the AST. Probably, this algorithm will be implemented as a down-top and top-down walk through the tree. The particular algorithm will be designed during implementation.

  3. To make the profile more precise, we need line discriminators. Currently, the debug information in the Go binary contains only per-line information. This will play a role in the cases of a few conditions in "if" construction, for example, but even without this information, the profile will be useful. The approach for loading this information is described in issue cmd/compile: add intra-line discrimination to PGO profiles #59612.

Implementation plan

I made a prototype that loads counters into the AST IR nodes and going to pass them to the SSA basic blocks. After that, I will implement the Steps 2 and 3. Then I'm going to add discriminators and implement the rest. After that I'm going to implement some of the optimizations like local basic block ordering.

I would like to get feedback from the community and understand if the community finds this useful.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 2, 2024
@prattmic
Copy link
Member

prattmic commented Feb 2, 2024

We certainly would like a way to precisely identify basic blocks in profiles so we can use that information for basic block level optimizations. That is what #59612 is intended to cover.

That issue doesn't cover precisely how the discriminator values are determined or how they get matched back to IR nodes and/or SSA blocks/values during the next build. It seems like this is something your design above tries to define, which is great.

That said, I don't quite follow exactly how you are defining them, or how propagation works as IR is mutated and eventually becomes SSA. Are you assigning discriminator values to high-level IR nodes (like ir.ForStmt) and then propagating through mutations and SSA?

I suspect this kind of propagation will get very complicated. I'd be tempted, at least for an initial version, to keep everything in SSA. Discriminators are based on basic block numbers, and optimizations prior to SSA simply cannot use them. Even in this case, I think tracking correlation to the profile through the different layers of SSA passes will be difficult.

I happen to be working on a prototype to assign discriminators to each PC, and plumb that into the binary metadata and the pprof profile. Once I have that done, you may be able to use this prototype to play with the PGO side of matching the samples-with-discriminators back to the build and applying basic block optimizations.

cc @cherrymui @aclements @jinlin-bayarea

@alexanius
Copy link
Contributor Author

@prattmic thank you for your answer.

Yes, you are right that discriminators are based on the basic blocks. My proposal needs the information about the column for the sampled instruction, so maybe the dwarf column information will be more suitable for my proposal. Thanks for highlighting that. Answering the question about the discriminator assignment - I do not do that for now (and need not discriminators itself, but the column number).

Yet, the column information is useful, we can load the counters to the AST nodes without it. The profile may be less precise, but still we can implement the profile-based optimizations. My motivation for loading the counters to the AST level is an opportunity to implement not only inline optimization on the AST level, so I think we should start loading the profile counters at the AST level.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/560781 mentions this issue: PROTOTYPE: cmd/compile,runtime: add discriminator, plumb to pprof

@prattmic
Copy link
Member

prattmic commented Feb 2, 2024

https://go.dev/cl/560781 plumbs a discriminator value from the compiler to the pprof Line.Column field (even though the value is not actually the column number.

Feel free to use this if you'd like to play with using discriminators for PGO. If you'd like just the column number itself, see my comment at https://go-review.googlesource.com/c/go/+/560781/2/src/cmd/internal/obj/pcln.go.

@ianlancetaylor
Copy link
Contributor

I don't see why this has to be a proposal; taking it out of the proposal process.

@mknyszek mknyszek added this to the Backlog milestone Feb 7, 2024
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 7, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/564055 mentions this issue: [WIP] DO NOT REVIEW cmd/compile: add basic block counters

@alexanius
Copy link
Contributor Author

I implemented some proof-of-concept prototype for basic block counters. Currently done:

  • Loading counters to AST nodes by line number information
  • The AST counter propagation algorithm
  • Moving counters to SSA nodes from AST nodes
  • Adding profile usage in basic block layout pass (just proof-of-concept)
  • Print counters to the html dumps

The patch shows how we can load basic block counters from the pprof file, how we can propagate them, and how to use them. The patch was tested on the go1 benchmark set, and it showed that depending on bb-layout pass the test fannkuch can be faster from 3.5% to 12.2% on the intel core machine (the xeon machine does not show this improvement).

To use basic block counters you should add an option -bbpgo: go test -a -bbpgo -pgo=go1.prof

This patch just shows the main idea of basic block counters loading. Before we can use it in the compiler, the following steps should be done:

  • Improve the counter propagation on the AST. Currently it does not take in account the returns from the middle of function. Also switches are not evaluated as needed.
  • Implement the counter corrections after inline
  • Improve the counter moving to SSA. At least we should process PanicBounds generation in a proper way.
  • Improve basic block layout and register allocation passes with profile information
  • Fix formatting when generating html dumps

This patch itself is not for review, but I would like to get some feedback for the general approach - the propagation algorithm and the Node structure modifications itself. Please, take a look at the irgraph.go and the ssa part.

cc @cherrymui @aclements @jinlin-bayarea @prattmic

@alexanius
Copy link
Contributor Author

Status update:

  • Improved profile loading to AST nodes. Currently the propagation algorithm takes in account possible returns in the middle of the function; improved the counter correction for branches; some other fixes to make counter loading and propagation more precise.
  • Improved profile correction for AST and SSA nodes transformations.
  • Fixed html dumps formatting. Now it looks pretty good.
  • Added IR profile check, but currently it is not usable in general case

During work with this feature, I came across some problems:

  • Edge counters. I made an attempt to add counters to the edges, as it should simplify pgo-based algorithms, but adding them breaks the compiler. Even if we do not have any reads of edges counters (just add a field and set any value), we have a compiler break. I think there is a place, where two edge objects are compared by value, and different counters in the edges influence the result of such comparison. To my pity, I did not find this place
  • ONAME and OLITERAL nodes. For one name and one literal we generate only one node and use its copies. That leads to the situation of an incorrect profile. For example:
if cond {
 // Hot path
 use A // ONAME node with big counter
} else {
 // Cold path
 use A // same ONAME node with the same big counter. But here it should be small
}

Currently I set zero counters to the ONAME and OLITERAL nodes, but hope a better solution can be found

  • IR dumps subsystem. Currently html dumps are written at the ssagen stage, but between AST creation and SSA generation, AST is transformed a few times. It would be useful to dump the AST IR after passes as well, but it seems, that it is too complicated now.

Further plans:

  • Continue improving propagation algorithm and counter correction for best precision
  • Make code refactoring
  • Add tests for pgo
  • Improve optimization passes to use basic block profiling

alexanius added a commit to alexanius/go that referenced this issue Feb 29, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed. Currently
implemented:

 + Loading counters to AST nodes by line number information
 + The AST counter propagation algorithm
 + Moving counters to SSA nodes from AST nodes
 + Adding profile usage in basic block layout pass (just
   proof-of-concept)
 + Print counters to the html dumps

The patch shows how we can load basic block counters from the pprof
file, how we can propagate them, and how to use them. The patch was
tested on the go1 benchmark set, and it showed that depending on
bb-layout pass the test fannkuch can be faster from 3.5% to 12.2%
on the intel core machine (the xeon machine does not show this improvement).

To use basic block counters you should add an option -bbpgo:

go test -a -bbpgo -pgo=go1.prof

This patch just shows the main idea of basic block counters loading.
Before we can use it in the compiler, the following steps should be done:

 * Improve the counter propagation on the AST and SSA. Currently there
   are some propagations, but they still may be improved
 * Improve basic block layout and register allocation passes with
   profile information

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
alexanius added a commit to alexanius/go that referenced this issue Mar 6, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed. Currently
implemented:

 + Loading counters to AST nodes by line number information
 + The AST counter propagation algorithm
 + Moving counters to SSA nodes from AST nodes
 + Adding profile usage in basic block layout pass (just
   proof-of-concept)
 + Print counters to the html dumps

The patch shows how we can load basic block counters from the pprof
file, how we can propagate them, and how to use them. The patch was
tested on the go1 benchmark set, and it showed that depending on
bb-layout pass the test fannkuch can be faster from 3.5% to 12.2%
on the intel core machine (the xeon machine does not show this improvement).

To use basic block counters you should add an option -bbpgo:

go test -a -bbpgo -pgo=go1.prof

This patch just shows the main idea of basic block counters loading.
Before we can use it in the compiler, the following steps should be done:

 * Improve the counter propagation on the AST and SSA. Currently there
   are some propagations, but they still may be improved
 * Improve basic block layout and register allocation passes with
   profile information

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
alexanius added a commit to alexanius/go that referenced this issue Mar 7, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed. Currently
implemented:

 + Loading counters to AST nodes by line number information
 + The AST counter propagation algorithm
 + Moving counters to SSA nodes from AST nodes
 + Adding profile usage in basic block layout pass (just
   proof-of-concept)
 + Print counters to the html dumps

The patch shows how we can load basic block counters from the pprof
file, how we can propagate them, and how to use them. The patch was
tested on the go1 benchmark set, and it showed that depending on
bb-layout pass the test fannkuch can be faster from 3.5% to 12.2%
on the intel core machine (the xeon machine does not show this improvement).

To use basic block counters you should add an option -bbpgo:

go test -a -bbpgo -pgo=go1.prof

This patch just shows the main idea of basic block counters loading.
Before we can use it in the compiler, the following steps should be done:

 * Improve the counter propagation on the AST and SSA. Currently there
   are some propagations, but they still may be improved
 * Improve basic block layout and register allocation passes with
   profile information

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
@alexanius
Copy link
Contributor Author

Status update:

  • Improved propagation algorithm. Added in places, where AST nodes are created during its transformations; now before ssagen the propagation is launched the second time (as a lot of new nodes created); other algorithm changes
  • Huge code refactoring and cleanup
  • Fixes for dumps of nodes with counters (ащк html and dumps alike)
  • Added tests for counters
  • Added choosing of likely branch based on counter value

For now, there are lots of things to do, but the basic block counters subsystem already allows to be used by pass developers. I ask for the review of this patch. After that I will be able to continue improving the basic block counters subsystem.

alexanius added a commit to alexanius/go that referenced this issue Mar 12, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -bbpgo:

go test -a -bbpgo -pgo=go1.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
@alexanius
Copy link
Contributor Author

Alternative approach by @jinlin-bayarea: https://go.dev/cl/571535

alexanius added a commit to alexanius/go that referenced this issue Mar 18, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -bbpgo:

go test -a -bbpgo -pgo=go1.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
alexanius added a commit to alexanius/go that referenced this issue Mar 27, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -bbpgo:

go test -a -bbpgo -pgo=go1.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
@alexanius
Copy link
Contributor Author

I would like to discuss the current status of basic-block counters implementation. According to the issue #62463, we have a list of wanted PGO optimizations. Some of them are related to the AST level, and some of them to the SSA level. Now we have two prototypes for basic block counters implementation:

  1. Loading counters on the AST (https://go.dev/cl/564055)
  2. Loading counters on the SSA (https://go.dev/cl/571535)

Both approaches have advantages and disadvantages, and I would like to understand if I should continue improving the first one.

The advantages of loading counters on AST:

  • It allows the profile guided optimizations on AST level
  • The counter propagation is more simple (as the tree as acyclic)
  • We can use counters on the SSA phase from the beginning

The problems of loading counters on AST:

  • We need corrections during optimizations (this can be simplified in current implementation)
  • The converting counters to basic blocks on the ssagen pass may be tricky sometimes
  • May be we need to add corrections to other passes, that change cfg

The advantages of loading counters on SSA:

  • More precise counters on basic blocks (as they are close to the blocks in the executable)
  • More simple integration as lesser number of passes should be modified and lesser corrections should be done

The problems of loading counters on SSA:

  • The counters should be loaded in the end of optimization pipeline, so lesser ideas from "PGO umbrella" may be implemented

In general, I believe that the loading counters to the AST is useful, as it gives us many opportunities for profile optimizations in all the parts of the compiler. I think that we even can use both approaches at the same time, as they do not conflict: for early optimizations we can use less precise counters and for late optimizations - more precise.

Please, share your thoughts on this idea - it is important to understand the community point of view.

CC @jinlin-bayarea @cherrymui @aclements @jinlin-bayarea @prattmic

alexanius added a commit to alexanius/go that referenced this issue Apr 10, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -bbpgo:

go build -a -bbpgo -pgo=file.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
alexanius added a commit to alexanius/go that referenced this issue Apr 11, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -bbpgo:

go build -a -bbpgo -pgo=file.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
alexanius added a commit to alexanius/go that referenced this issue Apr 18, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -bbpgo:

go build -a -bbpgo -pgo=file.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
alexanius added a commit to alexanius/go that referenced this issue Jun 6, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -pgobb:

go build -a -pgobb -pgo=file.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
alexanius added a commit to alexanius/go that referenced this issue Jul 10, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -pgobb:

go build -a -pgobb -pgo=file.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
alexanius added a commit to alexanius/go that referenced this issue Jul 31, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -pgobb:

go build -a -pgobb -pgo=file.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/602015 mentions this issue: cmd/compile: add ast nodes and basic block counters

@alexanius
Copy link
Contributor Author

Delivered a new version of basic block pgo. Now it uses preprofile subsystem, also tests for arm64 were added. The general idea is the same, the patch is ready for review and usage.

alexanius added a commit to alexanius/go that referenced this issue Aug 13, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -pgobb:

go build -a -pgobb -pgo=file.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/605555 mentions this issue: cmd/compile: basic block layout algorithms based on pgobb

@alexanius
Copy link
Contributor Author

Current progress of pgobb.

What it can do:

  • The basic implementation loads counters to the AST, moves it from the AST to SSA, corrects the likely information
  • The WIP implementation uses two another bb-layout algorithms:
    • The greedy (Pettis-Hansen) algorithm from here. It works with all main benchmarks. The analysis of the results will be below
    • The ext-stp algorithm from here. Now it can not be used instead of standard layout algorithm as it leads to the falls of regalloc and schedule passes. Also for good implementation it needs edges counters, that are not yet implemented. So, no evaluations for this algorithm yet.

Evaluation results.

Testing modes:

  • base - for base mode the go build -a -pgo=profile.pprof was taken
  • pgobb - basic implementation of pgobb: go build -a -pgo=profile.pprof -pgobb
  • pgogreed - basic implementation of greedy (Pettis-Hansen) algorithm: go build -a -pgo=profile.pprof -gcflags=all="-pgogreed"
  • pgobbgreed - implementation of greedy (Pettis-Hansen) algorithm with pgobb: go build -a -pgo=profile.pprof -gcflags=all="-pgogreed"
  • pgobbbbgreed - implementation of greedy (Pettis-Hansen) algorithm with pgobb with usage of bbcounters for the bb-layout weights: go build -a -pgo=profile.pprof -gcflags=all="-pgogreed -pgobbgreed"

Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz:

        |  pgobb | pgogreed | pgobbgreed | pgobbbbgreed |
      sec/op, Geomean
go1     | +0.38% |   +1.33% |     +0.30% |       +1.88% |
bent    | -0.09% |   +1.71% |     +0.43% |       +0.43% |
      req/s, Geomean
etcd    | -0.10% |   -2.02% |     -0.52% |       -0.95% |

ARMv8 Kunpeng920:

        |  pgobb | pgogreed | pgobbgreed | pgobbbbgreed | 
      sec/op, Geomean
go1     | -0.04% |   +1.43% |     +1.67% |       +2.12% |
bent    | +0.47% |   +2.16% |     +0.83% |       +0.80% |
      req/s, Geomean
etcd    | +0.30% |   -1.10% |     -0.81% |       -1.64% |

Currently the geomean results show, that the pgobb is a bit better (or bit worse), than standard pgo, and other combinations of pgobb and greedy layout shows larger dispersion in the results, but in general gives worse results. Below the results for all the tests are listed:

ARMv8 Kunpeng920:

                        │ pgo.out.log │           pgobb.out.log            │           pgogreed.out.log           │          pgobbgreed.out.log          │         pgobbbbgreed.out.log         │
                        │   sec/op    │   sec/op     vs base               │    sec/op     vs base                │    sec/op     vs base                │    sec/op     vs base                │
BinaryTree17-4             4.150 ± 0%    4.194 ± 0%  +1.05% (p=0.000 n=10)     4.244 ± 0%   +2.26% (p=0.000 n=10)     4.255 ± 0%   +2.52% (p=0.000 n=10)     4.194 ± 0%   +1.05% (p=0.000 n=10)
Fannkuch11-4               2.820 ± 0%    2.861 ± 0%  +1.48% (p=0.000 n=10)     2.835 ± 0%   +0.53% (p=0.000 n=10)     2.861 ± 3%   +1.47% (p=0.000 n=10)     2.850 ± 0%   +1.09% (p=0.000 n=10)
FmtFprintfEmpty-4         29.77n ± 0%   29.49n ± 0%  -0.92% (p=0.000 n=10)    29.78n ± 0%        ~ (p=0.724 n=10)    29.73n ± 0%        ~ (p=0.114 n=10)    29.58n ± 1%   -0.62% (p=0.001 n=10)
FmtFprintfString-4        56.27n ± 0%   55.99n ± 0%  -0.50% (p=0.001 n=10)    56.06n ± 1%   -0.38% (p=0.011 n=10)    56.05n ± 1%        ~ (p=0.109 n=10)    56.05n ± 1%        ~ (p=0.137 n=10)
FmtFprintfInt-4           96.67n ± 0%   96.54n ± 0%  -0.14% (p=0.005 n=10)   102.00n ± 0%   +5.51% (p=0.000 n=10)   101.25n ± 0%   +4.74% (p=0.000 n=10)   105.10n ± 1%   +8.72% (p=0.000 n=10)
FmtFprintfIntInt-4        117.8n ± 0%   115.8n ± 0%  -1.70% (p=0.000 n=10)    118.1n ± 0%   +0.30% (p=0.000 n=10)    118.7n ± 0%   +0.81% (p=0.000 n=10)    122.9n ± 0%   +4.37% (p=0.000 n=10)
FmtFprintfPrefixedInt-4   151.7n ± 0%   163.9n ± 1%  +8.11% (p=0.000 n=10)    161.6n ± 0%   +6.53% (p=0.000 n=10)    161.3n ± 0%   +6.40% (p=0.000 n=10)    157.0n ± 0%   +3.53% (p=0.000 n=10)
FmtFprintfFloat-4         226.0n ± 0%   230.3n ± 0%  +1.92% (p=0.000 n=10)    240.2n ± 0%   +6.31% (p=0.000 n=10)    239.5n ± 0%   +6.00% (p=0.000 n=10)    243.7n ± 0%   +7.83% (p=0.000 n=10)
FmtManyArgs-4             534.2n ± 0%   537.2n ± 0%  +0.57% (p=0.000 n=10)    534.5n ± 0%        ~ (p=0.134 n=10)    540.1n ± 0%   +1.11% (p=0.000 n=10)    573.5n ± 0%   +7.37% (p=0.000 n=10)
GobDecode-4               7.400m ± 0%   7.460m ± 1%  +0.81% (p=0.000 n=10)    7.463m ± 0%   +0.85% (p=0.000 n=10)    7.462m ± 0%   +0.84% (p=0.000 n=10)    7.496m ± 1%   +1.30% (p=0.000 n=10)
GobEncode-4               5.697m ± 1%   5.688m ± 0%       ~ (p=0.912 n=10)    5.711m ± 1%        ~ (p=0.315 n=10)    5.743m ± 2%        ~ (p=0.218 n=10)    5.753m ± 2%   +0.98% (p=0.043 n=10)
Gzip-4                    273.3m ± 0%   272.3m ± 0%  -0.37% (p=0.003 n=10)    272.7m ± 0%        ~ (p=0.165 n=10)    272.9m ± 0%        ~ (p=0.218 n=10)    273.2m ± 0%        ~ (p=1.000 n=10)
Gunzip-4                  30.65m ± 0%   30.81m ± 0%  +0.52% (p=0.000 n=10)    30.11m ± 0%   -1.77% (p=0.000 n=10)    30.30m ± 0%   -1.15% (p=0.000 n=10)    29.98m ± 0%   -2.21% (p=0.000 n=10)
HTTPClientServer-4        65.18µ ± 1%   65.91µ ± 1%  +1.11% (p=0.000 n=10)    67.54µ ± 1%   +3.61% (p=0.000 n=10)    64.87µ ± 0%   -0.48% (p=0.015 n=10)    67.41µ ± 0%   +3.42% (p=0.000 n=10)
JSONEncode-4              9.085m ± 0%   9.130m ± 0%  +0.51% (p=0.000 n=10)    8.952m ± 0%   -1.46% (p=0.000 n=10)    8.994m ± 0%   -1.00% (p=0.000 n=10)    9.180m ± 1%   +1.05% (p=0.000 n=10)
JSONDecode-4              48.49m ± 0%   47.40m ± 0%  -2.26% (p=0.000 n=10)    49.13m ± 0%   +1.31% (p=0.000 n=10)    49.45m ± 0%   +1.97% (p=0.000 n=10)    49.47m ± 0%   +2.02% (p=0.000 n=10)
Mandelbrot200-4           5.789m ± 0%   5.766m ± 0%  -0.40% (p=0.000 n=10)    5.770m ± 0%   -0.32% (p=0.000 n=10)    5.783m ± 0%   -0.11% (p=0.000 n=10)    5.783m ± 0%   -0.10% (p=0.000 n=10)
GoParse-4                 4.352m ± 0%   4.389m ± 0%  +0.86% (p=0.000 n=10)    4.473m ± 0%   +2.79% (p=0.000 n=10)    4.514m ± 0%   +3.73% (p=0.000 n=10)    4.474m ± 0%   +2.81% (p=0.000 n=10)
RegexpMatchEasy0_32-4     62.97n ± 0%   62.93n ± 1%       ~ (p=0.196 n=10)    62.87n ± 0%        ~ (p=0.209 n=10)    63.10n ± 0%        ~ (p=0.492 n=10)    62.77n ± 1%        ~ (p=0.072 n=10)
RegexpMatchEasy0_1K-4     282.4n ± 0%   287.6n ± 0%  +1.82% (p=0.000 n=10)    281.3n ± 1%        ~ (p=0.725 n=10)    284.4n ± 0%   +0.73% (p=0.000 n=10)    284.6n ± 0%   +0.78% (p=0.000 n=10)
RegexpMatchEasy1_32-4     57.39n ± 0%   56.38n ± 1%  -1.77% (p=0.000 n=10)    56.38n ± 0%   -1.76% (p=0.000 n=10)    56.78n ± 0%   -1.05% (p=0.000 n=10)    56.41n ± 0%   -1.71% (p=0.000 n=10)
RegexpMatchEasy1_1K-4     351.5n ± 0%   347.8n ± 1%  -1.07% (p=0.000 n=10)    346.4n ± 0%   -1.45% (p=0.000 n=10)    352.3n ± 0%   +0.24% (p=0.002 n=10)    350.3n ± 0%   -0.34% (p=0.001 n=10)
RegexpMatchMedium_32-4    879.9n ± 0%   880.9n ± 0%       ~ (p=0.699 n=10)    855.1n ± 0%   -2.82% (p=0.000 n=10)    874.1n ± 1%   -0.66% (p=0.001 n=10)    876.1n ± 0%   -0.43% (p=0.002 n=10)
RegexpMatchMedium_1K-4    26.76µ ± 0%   26.50µ ± 0%  -0.94% (p=0.000 n=10)    26.44µ ± 0%   -1.17% (p=0.000 n=10)    26.99µ ± 0%   +0.86% (p=0.000 n=10)    27.01µ ± 0%   +0.95% (p=0.000 n=10)
RegexpMatchHard_32-4      1.425µ ± 0%   1.394µ ± 1%  -2.21% (p=0.000 n=10)    1.441µ ± 0%   +1.12% (p=0.000 n=10)    1.472µ ± 0%   +3.30% (p=0.000 n=10)    1.420µ ± 0%   -0.35% (p=0.000 n=10)
RegexpMatchHard_1K-4      42.34µ ± 0%   41.27µ ± 1%  -2.53% (p=0.000 n=10)    43.08µ ± 0%   +1.75% (p=0.000 n=10)    43.37µ ± 0%   +2.44% (p=0.000 n=10)    41.92µ ± 0%   -1.00% (p=0.000 n=10)
Revcomp-4                 798.2m ± 0%   796.8m ± 1%       ~ (p=0.436 n=10)    822.6m ± 0%   +3.05% (p=0.000 n=10)    805.8m ± 0%   +0.95% (p=0.000 n=10)    808.6m ± 0%   +1.30% (p=0.000 n=10)
Template-4                62.67m ± 3%   63.44m ± 1%  +1.22% (p=0.009 n=10)    64.47m ± 1%   +2.87% (p=0.001 n=10)    63.12m ± 1%   +0.71% (p=0.011 n=10)    65.19m ± 2%   +4.01% (p=0.000 n=10)
TimeParse-4               261.2n ± 0%   253.5n ± 0%  -2.97% (p=0.000 n=10)    303.8n ± 0%  +16.27% (p=0.000 n=10)    302.1n ± 0%  +15.62% (p=0.000 n=10)    306.5n ± 0%  +17.32% (p=0.000 n=10)
TimeFormat-4              250.1n ± 0%   243.6n ± 0%  -2.56% (p=0.000 n=10)    252.8n ± 0%   +1.08% (p=0.000 n=10)    253.4n ± 0%   +1.34% (p=0.000 n=10)    258.0n ± 0%   +3.18% (p=0.000 n=10)
geomean                   51.05µ        51.03µ       -0.04%                   51.78µ        +1.43%                   51.91µ        +1.67%                   52.13µ        +2.12%

                                                        │    pgo.llog    │               pgobb.llog               │             pgogreed.llog             │             pgobbgreed.llog             │           pgobbbbgreed.llog           │
                                                        │     sec/op     │     sec/op      vs base                │    sec/op      vs base                │    sec/op      vs base                  │    sec/op      vs base                │
AppendMsgReplicateDecision-4                              0.7701n ±   0%   0.7701n ±   0%        ~ (p=0.509 n=10)   0.7701n ±  0%        ~ (p=0.875 n=10)                                             0.7701n ±  0%        ~ (p=0.973 n=10)
AppendMsgResyncTargetsInfo-4                               6.139n ±   0%    6.163n ±   0%   +0.39% (p=0.000 n=10)    6.162n ±  0%   +0.37% (p=0.000 n=10)                                              6.065n ±  0%   -1.21% (p=0.000 n=10)
Arc-4                                                      4.927µ ±   0%    4.883µ ±   0%   -0.88% (p=0.000 n=10)    4.973µ ±  0%   +0.93% (p=0.000 n=10)    5.000µ ±  0%   +1.48% (p=0.000 n=10)      4.981µ ±  0%   +1.10% (p=0.000 n=10)
BaseTest2KB-4                                              992.7n ±   0%    992.3n ±   0%   -0.04% (p=0.000 n=10)    992.4n ±  0%   -0.03% (p=0.000 n=10)   1584.0n ±  0%  +59.57% (p=0.000 n=10)     1584.0n ±  0%  +59.57% (p=0.000 n=10)
Bindata-4                                                  255.5m ±   2%    255.8m ±   0%        ~ (p=0.631 n=10)    256.3m ±  1%        ~ (p=0.315 n=10)    252.7m ±  1%        ~ (p=0.393 n=10)      256.1m ±  1%        ~ (p=0.436 n=10)
Broker_No_Metrics_Open-4                                   110.2µ ±   1%    110.6µ ±   1%        ~ (p=0.529 n=10)    112.8µ ±  1%   +2.30% (p=0.000 n=10)    112.4µ ±  1%   +1.95% (p=0.000 n=10)      113.0µ ±  1%   +2.53% (p=0.000 n=10)
Broker_Open-4                                              157.0µ ±   3%    158.7µ ±   1%        ~ (p=0.579 n=10)    162.9µ ±  1%   +3.74% (p=0.002 n=10)    156.9µ ±  2%        ~ (p=0.796 n=10)      157.0µ ±  2%        ~ (p=1.000 n=10)
BufferedWriteSyncer/write_file_with_buffer-4               210.6n ±   9%    195.8n ±   8%   -7.07% (p=0.009 n=10)    204.0n ±  4%        ~ (p=0.393 n=10)    193.5n ± 10%        ~ (p=0.218 n=10)      199.2n ± 16%   -5.41% (p=0.004 n=10)
Build-4                                                    400.6m ± 151%    423.4m ± 138%        ~ (p=0.353 n=10)    721.0m ± 41%        ~ (p=0.075 n=10)    394.7m ±  8%        ~ (p=0.853 n=10)      397.5m ±  1%        ~ (p=1.000 n=10)
BuildJSON-4                                                5.654µ ±   1%    5.676µ ±   1%   +0.39% (p=0.043 n=10)    5.788µ ±  2%   +2.37% (p=0.000 n=10)    5.621µ ±  1%        ~ (p=0.052 n=10)      5.819µ ±  0%   +2.92% (p=0.000 n=10)
BWTS-4                                                    0.8223n ±   3%   0.8252n ±   1%        ~ (p=0.109 n=10)   0.8246n ±  5%        ~ (p=0.447 n=10)   0.7661n ±  7%   -6.83% (p=0.012 n=10)     0.8204n ±  1%        ~ (p=0.617 n=10)
CombinedTestAndAdd-4                                       463.4n ±  20%    530.1n ±   3%        ~ (p=0.529 n=10)    543.2n ±  4%        ~ (p=0.165 n=10)    379.1n ± 24%        ~ (p=0.342 n=10)      390.5n ±  4%        ~ (p=0.631 n=10)
CommitAfterHashFixedSize/10K-4                             26.17m ±   2%    26.52m ±   1%        ~ (p=0.063 n=10)    26.32m ±  6%        ~ (p=0.971 n=10)    25.96m ±  4%        ~ (p=0.631 n=10)      26.62m ±  2%        ~ (p=0.052 n=10)
DasumMediumUnitaryInc-4                                    1.529µ ±   0%    1.530µ ±   0%        ~ (p=0.179 n=10)    1.531µ ±  0%   +0.13% (p=0.000 n=10)    1.531µ ±  0%   +0.13% (p=0.000 n=10)      1.529µ ±  0%        ~ (p=1.000 n=10)
DataUpdateTracker-4                                        2.726µ ±   1%    2.772µ ±   1%   +1.69% (p=0.001 n=10)    2.879µ ±  1%   +5.59% (p=0.000 n=10)                                              2.839µ ±  1%   +4.13% (p=0.000 n=10)
DecodehealingTracker-4                                     864.6n ±   1%    966.1n ±   1%  +11.74% (p=0.000 n=10)    879.9n ±  1%   +1.77% (p=0.000 n=10)                                              855.5n ±  1%   -1.05% (p=0.000 n=10)
Dgeev/Circulant100-4                                       19.01m ±   1%    18.79m ±   0%   -1.17% (p=0.000 n=10)    18.97m ±  0%        ~ (p=0.739 n=10)    18.76m ±  0%   -1.31% (p=0.000 n=10)      18.77m ±  1%   -1.29% (p=0.000 n=10)
Dgeev/Circulant10-4                                        52.45µ ±   0%    52.57µ ±   0%   +0.24% (p=0.001 n=10)    53.04µ ±  0%   +1.14% (p=0.000 n=10)    53.03µ ±  0%   +1.12% (p=0.000 n=10)      53.12µ ±  0%   +1.29% (p=0.000 n=10)
DirectSend-4                                               552.9n ±   2%    556.2n ±   1%        ~ (p=0.060 n=10)    555.0n ±  2%        ~ (p=0.896 n=10)    546.4n ±  1%   -1.17% (p=0.011 n=10)      547.7n ±  2%        ~ (p=0.210 n=10)
Dnrm2MediumPosInc-4                                        7.070µ ±   0%    7.071µ ±   0%        ~ (p=0.587 n=10)    7.071µ ±  0%        ~ (p=0.752 n=10)    7.070µ ±  0%        ~ (p=0.893 n=10)      7.072µ ±  0%        ~ (p=0.180 n=10)
Encoding4KBVerySparse-4                                    20.29µ ±   1%    20.33µ ±   1%        ~ (p=0.315 n=10)    18.60µ ±  1%   -8.30% (p=0.000 n=10)    20.23µ ±  1%        ~ (p=0.436 n=10)      20.33µ ±  1%        ~ (p=0.529 n=10)
FastTest2KB-4                                              992.4n ±   0%    992.3n ±   0%        ~ (p=0.496 n=10)    992.5n ±  0%   +0.01% (p=0.006 n=10)   1584.0n ±  0%  +59.61% (p=0.000 n=10)      992.3n ±  0%        ~ (p=0.725 n=10)
FPAQ-4                                                     25.50m ±   0%    25.58m ±   0%   +0.30% (p=0.009 n=10)    25.61m ±  0%   +0.42% (p=0.000 n=10)    25.36m ±  0%   -0.56% (p=0.000 n=10)      26.24m ±  0%   +2.92% (p=0.000 n=10)
FuturePromotion10000-4                                     3.370n ±   0%    3.370n ±   0%        ~ (p=0.387 n=10)    3.370n ±  0%        ~ (p=1.000 n=10)    3.370n ±  0%        ~ (p=0.387 n=10)      3.466n ±  0%   +2.85% (p=0.000 n=10)
GenerateKeyP256-4                                          21.30µ ±   0%    21.33µ ±   0%   +0.14% (p=0.003 n=10)    21.33µ ±  0%        ~ (p=0.171 n=10)    21.25µ ±  0%   -0.27% (p=0.001 n=10)      21.28µ ±  0%   -0.09% (p=0.008 n=10)
GenSharedKeyP256-4                                         77.53µ ±   0%    77.57µ ±   0%        ~ (p=0.928 n=10)    77.77µ ±  0%   +0.31% (p=0.001 n=10)    77.64µ ±  0%        ~ (p=0.089 n=10)      77.57µ ±  0%        ~ (p=0.739 n=10)
GenSharedKeyS256-4                                         72.73µ ±   0%    72.72µ ±   0%        ~ (p=0.565 n=10)    72.66µ ±  0%        ~ (p=0.063 n=10)    72.66µ ±  1%        ~ (p=0.469 n=10)      72.67µ ±  0%        ~ (p=0.353 n=10)
Growth_MultiSegment-4                                      20.19m ±   0%    20.38m ±   0%   +0.94% (p=0.000 n=10)    20.70m ±  0%   +2.52% (p=0.000 n=10)    20.44m ±  1%   +1.24% (p=0.000 n=10)      20.42m ±  0%   +1.13% (p=0.000 n=10)
Hash8K-4                                                   31.21µ ±   0%    31.22µ ±   0%        ~ (p=0.362 n=10)    31.34µ ±  0%   +0.42% (p=0.004 n=10)    31.18µ ±  0%   -0.10% (p=0.001 n=10)      31.18µ ±  0%   -0.10% (p=0.000 n=10)
HashFixedSize/10K-4                                        14.06m ±   2%    14.03m ±   1%        ~ (p=0.739 n=10)    14.05m ±  2%        ~ (p=0.684 n=10)    13.81m ±  2%        ~ (p=0.052 n=10)      14.12m ±  1%        ~ (p=0.315 n=10)
HashimotoLight-4                                           2.072m ±   5%    2.134m ±   1%   +2.95% (p=0.043 n=10)    2.154m ±  3%   +3.91% (p=0.043 n=10)    2.096m ±  6%        ~ (p=0.853 n=10)      2.056m ±  4%        ~ (p=1.000 n=10)
HistogramAllocation-4                                      2.889µ ±   3%    2.852µ ±   1%        ~ (p=0.072 n=10)    2.895µ ±  3%        ~ (p=0.644 n=10)    2.884µ ±  2%        ~ (p=0.631 n=10)      2.877µ ±  1%        ~ (p=0.698 n=10)
JSONLogMarshalerFunc-4                                    1000.8n ±   1%   1000.4n ±   1%        ~ (p=0.838 n=10)   1020.0n ±  2%   +1.92% (p=0.045 n=10)    987.6n ±  3%        ~ (p=0.699 n=10)     1026.5n ±  1%   +2.57% (p=0.000 n=10)
Listener-4                                                 1.092µ ±   3%    1.105µ ±   2%        ~ (p=0.093 n=10)    1.095µ ±  2%        ~ (p=0.671 n=10)    1.092µ ±  2%        ~ (p=0.644 n=10)      1.081µ ±  2%        ~ (p=0.101 n=10)
LouvainDirectedMultiplex-4                                 23.65m ±   2%    23.78m ±   4%        ~ (p=0.353 n=10)    23.90m ±  3%        ~ (p=0.052 n=10)    23.15m ±  3%        ~ (p=0.481 n=10)      23.21m ±  3%        ~ (p=0.971 n=10)
LZ-4                                                       1.466m ±   2%    1.447m ±   1%        ~ (p=0.105 n=10)    1.480m ±  1%   +0.98% (p=0.009 n=10)    1.468m ±  1%        ~ (p=0.631 n=10)      1.481m ±  1%   +1.08% (p=0.002 n=10)
MarshalMsgdataUsageCacheInfo-4                             142.1n ±   3%    144.4n ±   1%   +1.65% (p=0.000 n=10)    145.6n ±  1%   +2.46% (p=0.000 n=10)                                              143.5n ±  1%   +0.99% (p=0.000 n=10)
MergeByLanguage-4                                          997.2n ±   1%    996.7n ±   1%        ~ (p=0.469 n=10)   1022.0n ±  4%   +2.48% (p=0.035 n=10)    974.4n ±  3%        ~ (p=0.050 n=10)      982.3n ±  1%   -1.50% (p=0.007 n=10)
MTFT-4                                                     1.247m ±   0%    1.252m ±   0%   +0.40% (p=0.000 n=10)    1.231m ±  0%   -1.28% (p=0.000 n=10)    1.238m ±  0%   -0.71% (p=0.000 n=10)      1.232m ±  0%   -1.20% (p=0.000 n=10)
MultiWriteSyncer/2_discarder-4                             15.39n ±  11%    15.21n ±   6%        ~ (p=0.305 n=10)    14.77n ± 13%   -4.03% (p=0.037 n=10)    14.97n ±  7%   -2.76% (p=0.041 n=10)      14.80n ±  6%        ~ (p=0.137 n=10)
MultiWriteSyncer/4_discarder-4                             19.53n ±   7%    19.86n ±  10%        ~ (p=0.645 n=10)    19.45n ±  3%        ~ (p=0.306 n=10)    20.04n ±  3%        ~ (p=0.271 n=10)      20.07n ±  4%        ~ (p=0.838 n=10)
MultiWriteSyncer/4_discarder_with_buffer-4                 224.7n ±   4%    213.4n ±  10%        ~ (p=0.393 n=10)    205.8n ± 15%        ~ (p=0.353 n=10)    217.5n ± 12%        ~ (p=0.393 n=10)      222.2n ± 12%        ~ (p=0.280 n=10)
MulWorkspaceDense1000Hundredth-4                           324.7m ±   3%    331.9m ±   0%   +2.23% (p=0.000 n=10)    324.4m ±  1%        ~ (p=0.971 n=10)    316.1m ±  2%   -2.66% (p=0.015 n=10)      322.9m ±  1%        ~ (p=0.481 n=10)
MuxBrodcast-4                                              762.7n ±   5%    807.2n ±   7%   +5.84% (p=0.043 n=10)    815.8n ± 11%        ~ (p=0.123 n=10)    758.5n ±  5%        ~ (p=0.631 n=10)      764.7n ±  4%        ~ (p=0.796 n=10)
OpDiv128-4                                                 126.2n ±   1%    125.8n ±   0%        ~ (p=0.059 n=10)    136.0n ±  1%   +7.76% (p=0.000 n=10)    138.4n ±  1%   +9.58% (p=0.000 n=10)      135.6n ±  0%   +7.41% (p=0.000 n=10)
ParallelBrodcast-4                                         744.1n ±   0%    746.2n ±   0%   +0.28% (p=0.015 n=10)    764.3n ±  0%   +2.72% (p=0.000 n=10)    767.2n ±  0%   +3.10% (p=0.000 n=10)      771.9n ±  1%   +3.73% (p=0.000 n=10)
ParallelDirectSend-4                                       545.8n ±   3%    547.6n ±   2%        ~ (p=0.796 n=10)    557.0n ±  3%   +2.07% (p=0.015 n=10)    574.0n ±  1%   +5.18% (p=0.002 n=10)      568.5n ±  3%   +4.16% (p=0.001 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:100-4    23.95µ ±   1%    23.99µ ±   1%        ~ (p=0.616 n=10)    24.55µ ±  1%   +2.51% (p=0.000 n=10)    24.30µ ±  1%   +1.42% (p=0.009 n=10)      24.47µ ±  1%   +2.16% (p=0.000 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:10-4     36.34µ ±   1%    36.27µ ±   1%        ~ (p=0.393 n=10)    36.61µ ±  1%   +0.76% (p=0.011 n=10)    36.04µ ±  2%        ~ (p=0.247 n=10)      36.61µ ±  0%   +0.74% (p=0.000 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-4      280.7µ ±   1%    270.1µ ±   4%   -3.79% (p=0.019 n=10)    274.4µ ±  6%        ~ (p=0.123 n=10)    273.3µ ±  4%   -2.65% (p=0.035 n=10)      271.1µ ±  4%   -3.42% (p=0.007 n=10)
ParallelizeUntil/pieces:999,workers:10,chunkSize:13-4      32.11µ ±   0%    32.09µ ±   1%        ~ (p=0.971 n=10)    32.40µ ±  1%   +0.92% (p=0.002 n=10)    31.97µ ±  1%        ~ (p=0.796 n=10)      32.47µ ±  0%   +1.11% (p=0.000 n=10)
ParseBigBytes-4                                            2.061µ ±   2%    2.046µ ±   2%        ~ (p=0.353 n=10)    2.097µ ±  4%   +1.75% (p=0.005 n=10)    2.024µ ±  2%        ~ (p=0.190 n=10)      2.017µ ±  1%   -2.14% (p=0.027 n=10)
PendingDemotion10000-4                                     216.4n ±   0%    216.4n ±   0%        ~ (p=0.364 n=10)    221.2n ±  0%   +2.24% (p=0.000 n=10)    221.2n ±  0%   +2.24% (p=0.000 n=10)      221.5n ±  0%   +2.38% (p=0.000 n=10)
Pi/foo=apmckinlay/prec=100-4                               4.015µ ±   0%    3.860µ ±   0%   -3.87% (p=0.000 n=10)    4.270µ ±  0%   +6.35% (p=0.000 n=10)    4.113µ ±  0%   +2.44% (p=0.000 n=10)      4.042µ ±  0%   +0.66% (p=0.000 n=10)
Pi/foo=ericlagergren_(GDA)/prec=100-4                      391.9µ ±   0%    396.1µ ±   1%   +1.07% (p=0.015 n=10)    403.1µ ±  0%   +2.86% (p=0.000 n=10)    398.1µ ±  1%   +1.60% (p=0.000 n=10)      399.1µ ±  0%   +1.85% (p=0.000 n=10)
Pi/foo=ericlagergren_(Go)/prec=100-4                       196.1µ ±   1%    195.6µ ±   2%        ~ (p=0.481 n=10)    202.8µ ±  1%   +3.41% (p=0.000 n=10)    198.2µ ±  1%   +1.06% (p=0.009 n=10)      196.7µ ±  1%        ~ (p=0.631 n=10)
Pi/foo=float64/prec=100-4                                  5.243µ ±   0%    5.243µ ±   0%        ~ (p=0.232 n=10)    5.243µ ±  0%        ~ (p=1.000 n=10)    5.243µ ±  0%        ~ (p=0.387 n=10)      5.246µ ±  0%   +0.05% (p=0.000 n=10)
Pi/foo=go-inf/prec=100-4                                   167.6µ ±   3%    168.4µ ±   2%        ~ (p=1.000 n=10)    168.2µ ±  1%        ~ (p=0.739 n=10)    169.5µ ±  3%        ~ (p=0.218 n=10)      166.4µ ±  1%        ~ (p=0.218 n=10)
Pi/foo=shopspring/prec=100-4                               505.4µ ±   2%    501.2µ ±   1%        ~ (p=0.075 n=10)    506.7µ ±  2%        ~ (p=0.971 n=10)    505.0µ ±  2%        ~ (p=1.000 n=10)      493.1µ ±  2%   -2.44% (p=0.001 n=10)
Polygon-4                                                  9.138µ ±   0%    9.133µ ±   0%        ~ (p=0.361 n=10)    9.252µ ±  0%   +1.24% (p=0.000 n=10)    9.422µ ±  0%   +3.10% (p=0.000 n=10)      9.451µ ±  0%   +3.43% (p=0.000 n=10)
PoolBatchInsert10000-4                                      1.069 ±   0%     1.071 ±   0%        ~ (p=0.089 n=10)     1.084 ±  0%   +1.38% (p=0.000 n=10)     1.076 ±  1%   +0.65% (p=0.011 n=10)       1.065 ±  0%   -0.34% (p=0.019 n=10)
PoolBatchLocalInsert10000-4                                 1.041 ±   1%     1.045 ±   0%        ~ (p=0.165 n=10)     1.058 ±  1%   +1.61% (p=0.000 n=10)     1.051 ±  1%   +0.94% (p=0.009 n=10)       1.043 ±  0%        ~ (p=0.684 n=10)
RecursiveFunc-4                                            15.88µ ±   1%    17.27µ ±   1%   +8.80% (p=0.000 n=10)    17.66µ ±  1%  +11.20% (p=0.000 n=10)    17.29µ ±  1%   +8.91% (p=0.000 n=10)      17.75µ ±  0%  +11.76% (p=0.000 n=10)
ReflectorResyncChanMany-4                                  882.4n ±   2%    878.8n ±   1%        ~ (p=0.812 n=10)    890.8n ±  2%   +0.95% (p=0.045 n=10)    867.5n ±  2%        ~ (p=0.075 n=10)      870.6n ±  2%        ~ (p=0.218 n=10)
RegexMatch-4                                               1.686µ ±   1%    1.583µ ±   0%   -6.14% (p=0.000 n=10)    1.741µ ±  0%   +3.26% (p=0.000 n=10)    1.675µ ±  0%   -0.68% (p=0.019 n=10)      1.686µ ±  0%        ~ (p=0.926 n=10)
RenderSpec-4                                               7.361m ±   1%    7.328m ±   3%        ~ (p=0.853 n=10)    7.618m ±  1%   +3.49% (p=0.000 n=10)    7.462m ±  3%   +1.37% (p=0.029 n=10)      7.446m ±  3%   +1.16% (p=0.001 n=10)
RenderSpecBlackFriday2-4                                   5.267m ±   2%    5.309m ±   1%        ~ (p=0.436 n=10)    5.431m ±  2%   +3.11% (p=0.002 n=10)    5.287m ±  2%        ~ (p=0.684 n=10)      5.349m ±  1%        ~ (p=0.063 n=10)
RenderSpecNoHTML-4                                         7.348m ±   2%    7.299m ±   1%        ~ (p=0.853 n=10)    7.571m ±  1%   +3.03% (p=0.000 n=10)    7.371m ±  3%        ~ (p=0.190 n=10)      7.415m ±  1%   +0.91% (p=0.015 n=10)
RepeatExecProgram-4                                        17.61µ ±   2%    17.51µ ±   1%        ~ (p=0.123 n=10)    17.68µ ±  1%        ~ (p=0.055 n=10)    17.23µ ±  2%   -2.19% (p=0.029 n=10)      17.53µ ±  0%        ~ (p=0.315 n=10)
RepeatIOExecProgram-4                                      30.27µ ±   4%    30.56µ ±   2%        ~ (p=0.247 n=10)    30.94µ ±  2%        ~ (p=0.075 n=10)    30.62µ ±  3%        ~ (p=0.853 n=10)      30.25µ ±  2%        ~ (p=0.971 n=10)
RepeatIONew-4                                              1.342µ ±   1%    1.324µ ±   1%   -1.34% (p=0.001 n=10)    1.367µ ±  0%   +1.90% (p=0.000 n=10)    1.347µ ±  1%        ~ (p=0.127 n=10)      1.399µ ±  0%   +4.29% (p=0.000 n=10)
RepeatNew-4                                                163.1n ±   1%    155.6n ±   2%   -4.57% (p=0.000 n=10)    157.6n ±  0%   -3.34% (p=0.000 n=10)    155.2n ±  0%   -4.84% (p=0.000 n=10)      160.4n ±  1%   -1.66% (p=0.001 n=10)
ReplaceShortcodeTokens-4                                   13.11µ ±   8%    13.82µ ±   7%        ~ (p=0.247 n=10)    14.18µ ±  1%   +8.14% (p=0.007 n=10)    12.38µ ±  2%   -5.56% (p=0.000 n=10)      12.56µ ±  1%   -4.20% (p=0.001 n=10)
ResourceChainPostProcess-4                                 93.41m ±   3%    95.76m ±   2%   +2.51% (p=0.023 n=10)    96.38m ±  1%   +3.17% (p=0.002 n=10)    93.14m ±  4%        ~ (p=0.579 n=10)      93.41m ±  1%        ~ (p=0.853 n=10)
Run/10k/1-4                                                 34.07 ±   3%     34.20 ±   1%        ~ (p=0.853 n=10)     35.21 ±  1%   +3.34% (p=0.000 n=10)     34.03 ±  1%        ~ (p=0.853 n=10)       33.77 ±  2%        ~ (p=0.971 n=10)
Run/10k/16-4                                                14.12 ±   2%     14.23 ±   1%        ~ (p=0.631 n=10)     14.55 ±  2%   +3.02% (p=0.003 n=10)     14.27 ±  2%        ~ (p=0.247 n=10)       13.99 ±  3%        ~ (p=0.315 n=10)
Sampler_Check/100_keys-4                                   23.77n ±   3%    23.72n ±   0%   -0.19% (p=0.001 n=10)    23.69n ±  0%   -0.32% (p=0.000 n=10)    23.73n ±  0%   -0.17% (p=0.035 n=10)      23.71n ±  0%   -0.21% (p=0.001 n=10)
Sampler_Check/50_keys-4                                    23.76n ±   0%    23.73n ±   2%        ~ (p=0.125 n=10)    23.70n ±  0%   -0.25% (p=0.000 n=10)    23.69n ±  0%   -0.29% (p=0.000 n=10)      23.70n ±  0%   -0.27% (p=0.001 n=10)
Sampler_Check/7_keys-4                                     21.15n ±   0%    21.22n ±   0%   +0.33% (p=0.000 n=10)    21.18n ±  0%   +0.14% (p=0.003 n=10)    21.17n ±  2%   +0.09% (p=0.023 n=10)      21.17n ±  0%   +0.09% (p=0.045 n=10)
Sampler_CheckWithHook/100_keys-4                           47.15n ±   5%    48.75n ±   0%        ~ (p=0.092 n=10)    48.00n ±  3%        ~ (p=0.617 n=10)    48.32n ±  2%        ~ (p=0.138 n=10)      47.32n ±  2%        ~ (p=0.670 n=10)
Sampler_CheckWithHook/50_keys-4                            47.15n ±   5%    48.80n ±   0%   +3.51% (p=0.022 n=10)    48.11n ±  3%        ~ (p=0.210 n=10)    48.48n ±  2%   +2.82% (p=0.050 n=10)      47.52n ±  2%        ~ (p=0.617 n=10)
Sampler_CheckWithHook/7_keys-4                             49.34n ±   5%    50.93n ±   1%   +3.22% (p=0.034 n=10)    50.31n ±  3%        ~ (p=0.051 n=10)    50.86n ±  2%   +3.07% (p=0.042 n=10)      49.93n ±  2%        ~ (p=0.224 n=10)
ScaleVec10000Inc20-4                                       32.02µ ±  10%    30.37µ ±   3%        ~ (p=0.218 n=10)    30.78µ ±  4%        ~ (p=0.280 n=10)    34.05µ ±  8%        ~ (p=0.393 n=10)      35.06µ ±  4%        ~ (p=0.089 n=10)
ScopeTaggedNoCachedSubscopes-4                             6.150µ ±   1%    5.950µ ±   4%   -3.26% (p=0.001 n=10)    6.042µ ±  5%   -1.76% (p=0.035 n=10)    6.213µ ±  5%        ~ (p=0.436 n=10)      6.254µ ±  2%   +1.68% (p=0.005 n=10)
SeparateTestAndAdd-4                                       536.6n ±  12%    595.3n ±   4%        ~ (p=0.075 n=10)    605.4n ±  4%  +12.82% (p=0.028 n=10)    486.8n ±  9%        ~ (p=1.000 n=10)      485.8n ±  1%        ~ (p=0.796 n=10)
StandardJSON-4                                             1.336µ ±   9%    1.365µ ±   5%        ~ (p=0.494 n=10)    1.328µ ±  6%        ~ (p=0.493 n=10)    1.351µ ±  5%        ~ (p=0.342 n=10)      1.404µ ±  5%        ~ (p=0.123 n=10)
StdlibJSON-4                                               3.783µ ±   0%    3.805µ ±   1%   +0.58% (p=0.011 n=10)    3.888µ ±  0%   +2.78% (p=0.000 n=10)    3.811µ ±  0%   +0.74% (p=0.000 n=10)      3.906µ ±  0%   +3.24% (p=0.000 n=10)
TarjanSCCGnp_1000_half-4                                   123.1m ±   1%    121.0m ±   1%   -1.67% (p=0.000 n=10)    122.8m ±  2%        ~ (p=0.436 n=10)    122.6m ±  2%        ~ (p=0.280 n=10)      124.8m ±  1%   +1.39% (p=0.015 n=10)
TarjanSCCGnp_10_tenth-4                                    13.61µ ±   2%    13.59µ ±   2%        ~ (p=0.684 n=10)    14.24µ ±  2%   +4.63% (p=0.000 n=10)    13.90µ ±  2%   +2.13% (p=0.019 n=10)      13.96µ ±  1%   +2.59% (p=0.000 n=10)
TeeCheck-4                                                 291.8n ±   2%    302.2n ±   1%   +3.58% (p=0.000 n=10)    296.1n ±  2%   +1.46% (p=0.012 n=10)    292.9n ±  1%        ~ (p=0.671 n=10)      295.3n ±  2%        ~ (p=0.079 n=10)
TextMovementBetweenSegments-4                              853.8µ ±   2%    885.5µ ±   4%        ~ (p=0.143 n=10)    898.1µ ±  7%        ~ (p=0.089 n=10)    842.7µ ±  5%        ~ (p=0.796 n=10)      850.0µ ±  1%        ~ (p=1.000 n=10)
ValidateVersionTildeFail-4                                 929.3n ±   1%    949.1n ±   1%   +2.14% (p=0.000 n=10)    983.0n ±  1%   +5.78% (p=0.000 n=10)    932.1n ±  1%        ~ (p=0.404 n=10)      942.4n ±  1%   +1.41% (p=0.001 n=10)
WalkAllBreadthFirstGnp_1000_tenth-4                        18.49m ±   1%    18.14m ±   1%   -1.92% (p=0.002 n=10)    19.06m ±  2%   +3.04% (p=0.002 n=10)    18.63m ±  1%        ~ (p=0.280 n=10)      18.24m ±  0%   -1.40% (p=0.035 n=10)
WalkAllBreadthFirstGnp_10_tenth-4                          6.759µ ±   2%    6.635µ ±   1%   -1.84% (p=0.007 n=10)    6.975µ ±  1%   +3.19% (p=0.000 n=10)    6.866µ ±  1%   +1.58% (p=0.000 n=10)      6.845µ ±  0%   +1.27% (p=0.000 n=10)
WriteSyncer/write_file_with_no_buffer-4                    943.8n ±   2%    931.2n ±   1%   -1.32% (p=0.007 n=10)    932.5n ±  1%   -1.20% (p=0.007 n=10)    932.5n ±  1%   -1.19% (p=0.001 n=10)      950.1n ±  1%        ~ (p=0.288 n=10)
ZapConsole-4                                               948.9n ±   6%    917.8n ±  10%        ~ (p=0.724 n=10)    906.5n ±  3%        ~ (p=0.063 n=10)    930.8n ±  7%        ~ (p=0.684 n=10)      913.6n ±  9%        ~ (p=0.670 n=10)
ZapJSON-4                                                  628.1n ±   3%    621.5n ±   9%        ~ (p=0.796 n=10)    677.5n ±  8%   +7.86% (p=0.002 n=10)    647.5n ±  9%        ~ (p=0.089 n=10)      635.2n ± 13%        ~ (p=0.247 n=10)
geomean                                                    19.10µ           19.19µ          +0.47%                   19.52µ         +2.16%                   26.14µ         +0.83%                ¹    19.26µ         +0.80%
¹ benchmark set differs from baseline; geomeans may not be comparable

                 │ 2024.08.16_13.32.server.go.pgo.stat │ 2024.08.16_13.32.server.go.pgobb.stat │ 2024.08.16_13.32.server.go.pgogreed.stat │ 2024.08.16_13.32.server.go.pgobbgreed.stat │ 2024.08.16_13.32.server.go.pgobbbbgreed.stat │
                 │                req/s                │     req/s       vs base               │       req/s        vs base               │        req/s         vs base               │       req/s         vs base                  │
Etcd_w                                     4.513k ± 0%      4.526k ± 1%  +0.28% (p=0.043 n=10)         4.448k ± 1%  -1.45% (p=0.000 n=10)           4.495k ± 0%  -0.41% (p=0.000 n=10)          4.450k ± 0%  -1.40% (p=0.000 n=10+20)
Etcd_l10c50w                               37.51k ± 1%      37.85k ± 1%       ~ (p=0.063 n=10)         37.28k ± 2%       ~ (p=0.143 n=10)           37.64k ± 1%       ~ (p=0.796 n=10)          37.01k ± 1%  -1.32% (p=0.000 n=10+20)
Etcd_r                                     5.574k ± 1%      5.601k ± 0%  +0.48% (p=0.019 n=10)         5.519k ± 1%  -0.99% (p=0.000 n=10)           5.590k ± 2%       ~ (p=0.280 n=10)          5.496k ± 0%  -1.40% (p=0.000 n=10+20)
Etcd_l1c1w                                 4.434k ± 1%      4.450k ± 1%       ~ (p=0.247 n=10)         4.379k ± 1%  -1.23% (p=0.005 n=10)           4.418k ± 1%       ~ (p=0.165 n=10)          4.377k ± 0%  -1.29% (p=0.000 n=10+20)
Etcd_l100c1000w                            44.49k ± 1%      44.40k ± 1%       ~ (p=0.579 n=10)         44.08k ± 1%  -0.94% (p=0.029 n=10)           43.75k ± 1%  -1.67% (p=0.000 n=10)          43.51k ± 1%  -2.21% (p=0.000 n=10+20)
Etcd_l1c1lr                                5.568k ± 1%      5.599k ± 1%       ~ (p=0.631 n=10)         5.518k ± 1%  -0.90% (p=0.043 n=10)           5.535k ± 2%       ~ (p=0.436 n=10)          5.498k ± 1%  -1.27% (p=0.000 n=10+20)
Etcd_l100c1000lr                           72.65k ± 2%      72.47k ± 3%       ~ (p=1.000 n=10)         71.50k ± 2%       ~ (p=0.089 n=10)           70.31k ± 2%  -3.22% (p=0.002 n=10)          70.76k ± 1%  -2.60% (p=0.000 n=10+20)
geomean                                    13.34k           13.38k       +0.30%                        13.20k       -1.10%                          13.24k       -0.81%                         13.12k       -1.64%

Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz:

                        │ pgo.out.log  │           pgobb.out.log            │          pgogreed.out.log           │          pgobbgreed.out.log          │         pgobbbbgreed.out.log         │
                        │    sec/op    │   sec/op     vs base               │    sec/op     vs base               │    sec/op     vs base                │    sec/op     vs base                │
BinaryTree17-4              3.231 ± 0%    3.258 ± 1%  +0.83% (p=0.000 n=10)     3.338 ± 1%  +3.31% (p=0.000 n=10)     3.270 ± 1%   +1.20% (p=0.000 n=10)     3.235 ± 1%        ~ (p=0.190 n=10)
Fannkuch11-4                2.915 ± 0%    2.919 ± 0%  +0.12% (p=0.000 n=10)     2.907 ± 0%  -0.29% (p=0.002 n=10)     2.909 ± 0%   -0.21% (p=0.000 n=10)     2.869 ± 0%   -1.60% (p=0.000 n=10)
FmtFprintfEmpty-4          29.77n ± 0%   29.97n ± 1%  +0.64% (p=0.017 n=10)    29.92n ± 1%       ~ (p=0.492 n=10)    29.71n ± 1%        ~ (p=0.060 n=10)    29.60n ± 1%   -0.59% (p=0.003 n=10)
FmtFprintfString-4         54.27n ± 1%   59.17n ± 1%  +9.04% (p=0.000 n=10)    53.28n ± 0%  -1.82% (p=0.000 n=10)    52.42n ± 5%        ~ (p=0.063 n=10)    54.57n ± 1%        ~ (p=0.060 n=10)
FmtFprintfInt-4            63.61n ± 0%   64.50n ± 1%  +1.39% (p=0.000 n=10)    65.12n ± 0%  +2.38% (p=0.000 n=10)    63.65n ± 0%        ~ (p=0.237 n=10)    68.21n ± 0%   +7.23% (p=0.000 n=10)
FmtFprintfIntInt-4        100.50n ± 0%   99.68n ± 0%  -0.82% (p=0.000 n=10)   105.80n ± 0%  +5.27% (p=0.000 n=10)   100.10n ± 0%   -0.40% (p=0.001 n=10)   112.50n ± 0%  +11.94% (p=0.000 n=10)
FmtFprintfPrefixedInt-4    125.1n ± 4%   132.8n ± 0%  +6.11% (p=0.000 n=10)    130.5n ± 0%  +4.27% (p=0.000 n=10)    128.7n ± 0%        ~ (p=0.468 n=10)    131.9n ± 0%   +5.39% (p=0.000 n=10)
FmtFprintfFloat-4          191.4n ± 0%   192.2n ± 0%  +0.39% (p=0.000 n=10)    205.6n ± 0%  +7.42% (p=0.000 n=10)    208.2n ± 0%   +8.78% (p=0.000 n=10)    204.4n ± 0%   +6.79% (p=0.000 n=10)
FmtManyArgs-4              441.1n ± 0%   429.0n ± 1%  -2.75% (p=0.000 n=10)    448.9n ± 0%  +1.75% (p=0.000 n=10)    428.5n ± 0%   -2.86% (p=0.000 n=10)    458.8n ± 0%   +4.00% (p=0.000 n=10)
GobDecode-4                5.404m ± 1%   5.272m ± 0%  -2.46% (p=0.000 n=10)    5.571m ± 1%  +3.09% (p=0.000 n=10)    5.478m ± 2%   +1.36% (p=0.002 n=10)    5.570m ± 2%   +3.07% (p=0.000 n=10)
GobEncode-4                4.461m ± 0%   4.377m ± 1%  -1.88% (p=0.000 n=10)    4.438m ± 1%       ~ (p=0.165 n=10)    4.540m ± 1%   +1.76% (p=0.000 n=10)    4.533m ± 1%   +1.60% (p=0.002 n=10)
Gzip-4                     297.5m ± 0%   300.7m ± 0%  +1.08% (p=0.000 n=10)    303.1m ± 1%  +1.90% (p=0.000 n=10)    302.6m ± 0%   +1.72% (p=0.000 n=10)    301.9m ± 1%   +1.51% (p=0.000 n=10)
Gunzip-4                   30.15m ± 0%   30.16m ± 1%       ~ (p=0.684 n=10)    30.36m ± 0%  +0.71% (p=0.000 n=10)    30.46m ± 1%   +1.04% (p=0.000 n=10)    29.83m ± 1%   -1.05% (p=0.000 n=10)
HTTPClientServer-4         103.5µ ± 1%   111.3µ ± 2%  +7.45% (p=0.000 n=10)    105.4µ ± 1%  +1.83% (p=0.000 n=10)    110.0µ ± 1%   +6.19% (p=0.000 n=10)    107.4µ ± 1%   +3.69% (p=0.000 n=10)
JSONEncode-4               8.196m ± 0%   8.208m ± 1%       ~ (p=0.481 n=10)    8.048m ± 0%  -1.80% (p=0.000 n=10)    8.416m ± 0%   +2.68% (p=0.000 n=10)    8.424m ± 0%   +2.78% (p=0.000 n=10)
JSONDecode-4               39.04m ± 0%   37.81m ± 0%  -3.15% (p=0.000 n=10)    39.45m ± 0%  +1.05% (p=0.000 n=10)    38.06m ± 0%   -2.51% (p=0.000 n=10)    37.90m ± 0%   -2.93% (p=0.000 n=10)
Mandelbrot200-4            4.930m ± 0%   4.935m ± 0%  +0.09% (p=0.000 n=10)    4.929m ± 0%  -0.03% (p=0.019 n=10)    4.932m ± 0%   +0.04% (p=0.019 n=10)    4.929m ± 0%   -0.02% (p=0.009 n=10)
GoParse-4                  3.487m ± 0%   3.421m ± 0%  -1.87% (p=0.000 n=10)    3.489m ± 0%       ~ (p=0.684 n=10)    3.526m ± 0%   +1.12% (p=0.000 n=10)    3.583m ± 0%   +2.77% (p=0.000 n=10)
RegexpMatchEasy0_32-4      62.46n ± 3%   65.90n ± 1%  +5.49% (p=0.000 n=10)    59.36n ± 1%  -4.98% (p=0.000 n=10)    60.64n ± 2%   -2.91% (p=0.000 n=10)    63.32n ± 1%        ~ (p=0.072 n=10)
RegexpMatchEasy0_1K-4      209.7n ± 1%   209.6n ± 2%       ~ (p=0.970 n=10)    209.8n ± 2%       ~ (p=0.753 n=10)    211.7n ± 1%   +0.93% (p=0.034 n=10)    210.8n ± 0%        ~ (p=0.109 n=10)
RegexpMatchEasy1_32-4      56.26n ± 2%   57.36n ± 3%  +1.96% (p=0.003 n=10)    55.28n ± 2%       ~ (p=0.089 n=10)    55.95n ± 2%        ~ (p=0.436 n=10)    55.56n ± 1%        ~ (p=0.218 n=10)
RegexpMatchEasy1_1K-4      293.0n ± 2%   299.8n ± 2%  +2.30% (p=0.001 n=10)    295.7n ± 1%  +0.92% (p=0.002 n=10)    291.8n ± 1%        ~ (p=0.616 n=10)    298.3n ± 2%   +1.79% (p=0.005 n=10)
RegexpMatchMedium_32-4     981.7n ± 5%   887.5n ± 0%  -9.59% (p=0.000 n=10)   1001.0n ± 0%       ~ (p=0.102 n=10)    859.2n ± 0%  -12.47% (p=0.000 n=10)    894.0n ± 6%   -8.93% (p=0.000 n=10)
RegexpMatchMedium_1K-4     27.15µ ± 0%   27.85µ ± 7%  +2.61% (p=0.000 n=10)    27.34µ ± 7%  +0.73% (p=0.000 n=10)    27.80µ ± 0%   +2.40% (p=0.001 n=10)    28.03µ ± 6%   +3.28% (p=0.000 n=10)
RegexpMatchHard_32-4       1.444µ ± 0%   1.412µ ± 1%  -2.18% (p=0.000 n=10)    1.482µ ± 0%  +2.67% (p=0.000 n=10)    1.430µ ± 0%   -0.94% (p=0.000 n=10)    1.425µ ± 1%   -1.28% (p=0.000 n=10)
RegexpMatchHard_1K-4       43.41µ ± 2%   42.45µ ± 0%  -2.21% (p=0.000 n=10)    44.90µ ± 0%  +3.45% (p=0.000 n=10)    43.41µ ± 0%        ~ (p=0.447 n=10)    43.05µ ± 0%   -0.82% (p=0.000 n=10)
Revcomp-4                  467.0m ± 0%   466.8m ± 1%       ~ (p=0.684 n=10)    453.4m ± 1%  -2.92% (p=0.000 n=10)    455.5m ± 0%   -2.46% (p=0.000 n=10)    457.7m ± 2%   -1.99% (p=0.011 n=10)
Template-4                 52.07m ± 2%   51.86m ± 0%  -0.40% (p=0.002 n=10)    51.87m ± 0%  -0.38% (p=0.011 n=10)    52.09m ± 0%        ~ (p=0.853 n=10)    52.24m ± 0%   +0.33% (p=0.023 n=10)
TimeParse-4                249.1n ± 0%   252.4n ± 0%  +1.37% (p=0.000 n=10)    265.5n ± 0%  +6.59% (p=0.000 n=10)    263.0n ± 0%   +5.60% (p=0.000 n=10)    271.5n ± 0%   +9.01% (p=0.000 n=10)
TimeFormat-4               226.0n ± 0%   224.9n ± 0%  -0.49% (p=0.000 n=10)    238.8n ± 0%  +5.66% (p=0.000 n=10)    232.0n ± 0%   +2.65% (p=0.000 n=10)    252.6n ± 0%  +11.77% (p=0.000 n=10)
geomean                    45.69µ        45.87µ       +0.38%                   46.30µ       +1.33%                   45.83µ        +0.30%                   46.55µ        +1.88%


                                                        │   pgo.llog    │              pgobb.llog              │             pgogreed.llog             │           pgobbgreed.llog#0           │           pgobbgreed.llog#1           │
                                                        │    sec/op     │    sec/op      vs base               │    sec/op      vs base                │    sec/op      vs base                │    sec/op      vs base                │
Arc-4                                                      4.275µ ±  1%    4.318µ ±  1%  +1.02% (p=0.001 n=10)    4.329µ ±  1%   +1.26% (p=0.000 n=10)    4.315µ ±  1%   +0.95% (p=0.001 n=10)    4.315µ ±  1%   +0.95% (p=0.001 n=10)
BaseTest2KB-4                                              801.7n ±  0%    801.5n ±  0%       ~ (p=0.378 n=10)   1012.0n ±  0%  +26.23% (p=0.000 n=10)   1012.0n ±  0%  +26.23% (p=0.000 n=10)   1012.0n ±  0%  +26.23% (p=0.000 n=10)
Broker_No_Metrics_Open-4                                   124.4µ ±  1%    124.6µ ±  1%       ~ (p=0.631 n=10)    126.0µ ±  1%   +1.24% (p=0.000 n=10)    124.7µ ±  1%        ~ (p=0.247 n=10)    124.7µ ±  1%        ~ (p=0.247 n=10)
Broker_Open-4                                              190.0µ ±  2%    178.3µ ±  2%  -6.16% (p=0.000 n=10)    190.5µ ±  3%        ~ (p=0.971 n=10)    182.1µ ±  2%   -4.14% (p=0.000 n=10)    182.1µ ±  2%   -4.14% (p=0.000 n=10)
BufferedWriteSyncer/write_file_with_buffer-4               232.3n ±  4%    242.7n ±  8%       ~ (p=0.089 n=10)    231.6n ±  3%        ~ (p=0.971 n=10)    234.0n ±  9%        ~ (p=0.564 n=10)    234.0n ±  9%        ~ (p=0.564 n=10)
BuildJSON-4                                                5.088µ ±  0%    5.046µ ±  0%  -0.82% (p=0.000 n=10)    5.181µ ±  0%   +1.84% (p=0.000 n=10)    5.092µ ±  0%        ~ (p=0.196 n=10)    5.092µ ±  0%        ~ (p=0.196 n=10)
BWTS-4                                                    0.7847n ±  1%   0.7799n ±  0%  -0.61% (p=0.002 n=10)   0.7896n ±  0%   +0.63% (p=0.000 n=10)   0.7805n ±  0%   -0.53% (p=0.015 n=10)   0.7805n ±  0%   -0.53% (p=0.015 n=10)
CombinedTestAndAdd-4                                       347.7n ±  1%    347.8n ±  0%       ~ (p=0.897 n=10)    346.6n ±  0%   -0.29% (p=0.027 n=10)    348.1n ±  1%        ~ (p=0.616 n=10)    348.1n ±  1%        ~ (p=0.616 n=10)
CommitAfterHashFixedSize/10K-4                             16.58m ±  1%    16.51m ±  1%       ~ (p=0.218 n=10)    16.68m ±  1%        ~ (p=0.089 n=10)    16.61m ±  1%        ~ (p=1.000 n=10)    16.61m ±  1%        ~ (p=1.000 n=10)
DasumMediumUnitaryInc-4                                    400.8n ±  0%    401.2n ±  0%       ~ (p=0.541 n=10)    398.8n ±  0%   -0.47% (p=0.001 n=10)    398.8n ±  0%   -0.49% (p=0.001 n=10)    398.8n ±  0%   -0.49% (p=0.001 n=10)
Dgeev/Circulant100-4                                       20.70m ±  0%    21.05m ±  0%  +1.71% (p=0.000 n=10)    20.89m ±  0%   +0.91% (p=0.000 n=10)    21.10m ±  0%   +1.94% (p=0.000 n=10)    21.10m ±  0%   +1.94% (p=0.000 n=10)
Dgeev/Circulant10-4                                        56.81µ ±  0%    56.50µ ±  0%  -0.55% (p=0.000 n=10)    57.58µ ±  1%   +1.35% (p=0.000 n=10)    56.57µ ±  0%   -0.43% (p=0.005 n=10)    56.57µ ±  0%   -0.43% (p=0.005 n=10)
DirectSend-4                                               420.3n ±  2%    424.1n ±  3%       ~ (p=0.143 n=10)    421.7n ±  1%        ~ (p=0.447 n=10)    428.6n ±  2%        ~ (p=0.072 n=10)    428.6n ±  2%        ~ (p=0.072 n=10)
Dnrm2MediumPosInc-4                                        5.418µ ±  0%    5.419µ ±  0%       ~ (p=0.439 n=10)    5.418µ ±  0%        ~ (p=0.828 n=10)    5.417µ ±  0%        ~ (p=0.213 n=10)    5.417µ ±  0%        ~ (p=0.213 n=10)
Encoding4KBVerySparse-4                                    18.71µ ±  0%    18.76µ ±  0%  +0.28% (p=0.008 n=10)    20.52µ ±  0%   +9.70% (p=0.000 n=10)    18.79µ ±  0%   +0.41% (p=0.000 n=10)    18.79µ ±  0%   +0.41% (p=0.000 n=10)
FastTest2KB-4                                              167.0n ±  0%    167.0n ±  0%       ~ (p=0.720 n=10)    168.6n ±  0%   +0.93% (p=0.000 n=10)    168.6n ±  0%   +0.93% (p=0.000 n=10)    168.6n ±  0%   +0.93% (p=0.000 n=10)
FPAQ-4                                                     24.18m ±  0%    24.14m ±  0%       ~ (p=0.247 n=10)    24.37m ±  0%   +0.79% (p=0.000 n=10)    24.14m ±  0%        ~ (p=0.143 n=10)    24.14m ±  0%        ~ (p=0.143 n=10)
FuturePromotion10000-4                                     3.093n ±  0%    3.093n ±  0%       ~ (p=0.911 n=10)    3.093n ±  0%        ~ (p=0.806 n=10)    3.479n ±  0%  +12.48% (p=0.000 n=10)    3.479n ±  0%  +12.48% (p=0.000 n=10)
GenerateKeyP256-4                                          23.95µ ±  0%    23.91µ ±  0%  -0.18% (p=0.000 n=10)    24.03µ ±  0%   +0.33% (p=0.000 n=10)    23.98µ ±  0%   +0.15% (p=0.034 n=10)    23.98µ ±  0%   +0.15% (p=0.034 n=10)
GenSharedKeyP256-4                                         95.43µ ±  0%    95.41µ ±  0%       ~ (p=0.060 n=10)    95.37µ ±  0%   -0.07% (p=0.041 n=10)    95.43µ ±  0%        ~ (p=0.424 n=10)    95.43µ ±  0%        ~ (p=0.424 n=10)
GenSharedKeyS256-4                                         92.23µ ±  0%    92.33µ ±  0%       ~ (p=0.305 n=10)    92.19µ ±  0%        ~ (p=0.578 n=10)    92.14µ ±  0%        ~ (p=0.118 n=10)    92.14µ ±  0%        ~ (p=0.118 n=10)
Growth_MultiSegment-4                                      15.22m ±  0%    15.21m ±  0%       ~ (p=0.579 n=10)    15.34m ±  0%   +0.80% (p=0.000 n=10)    15.19m ±  0%   -0.20% (p=0.000 n=10)    15.19m ±  0%   -0.20% (p=0.000 n=10)
HashFixedSize/10K-4                                        10.89m ±  1%    10.89m ±  1%       ~ (p=0.739 n=10)    11.03m ±  1%   +1.20% (p=0.004 n=10)    10.86m ±  1%        ~ (p=0.631 n=10)    10.86m ±  1%        ~ (p=0.631 n=10)
HashimotoLight-4                                           1.520m ±  1%    1.517m ±  1%       ~ (p=0.315 n=10)    1.523m ±  1%        ~ (p=0.436 n=10)    1.522m ±  0%        ~ (p=0.481 n=10)    1.522m ±  0%        ~ (p=0.481 n=10)
HistogramAllocation-4                                      2.056µ ±  1%    2.041µ ±  3%       ~ (p=0.481 n=10)    2.086µ ±  5%        ~ (p=0.868 n=10)    2.012µ ±  6%        ~ (p=0.165 n=10)    2.012µ ±  6%        ~ (p=0.165 n=10)
JSONLogMarshalerFunc-4                                     645.6n ±  0%    647.0n ±  1%       ~ (p=0.393 n=10)    648.1n ±  0%        ~ (p=0.052 n=10)    644.3n ±  1%        ~ (p=0.280 n=10)    644.3n ±  1%        ~ (p=0.280 n=10)
Listener-4                                                 1.072µ ±  2%    1.089µ ±  3%       ~ (p=0.591 n=10)    1.095µ ±  4%        ~ (p=0.211 n=10)    1.093µ ±  3%        ~ (p=0.110 n=10)    1.093µ ±  3%        ~ (p=0.110 n=10)
LouvainDirectedMultiplex-4                                 23.62m ±  1%    23.62m ±  1%       ~ (p=0.912 n=10)    24.20m ±  0%   +2.44% (p=0.000 n=10)    23.61m ±  0%        ~ (p=0.436 n=10)    23.61m ±  0%        ~ (p=0.436 n=10)
LZ-4                                                       1.738m ±  1%    1.719m ±  1%  -1.12% (p=0.002 n=10)    1.724m ±  1%   -0.81% (p=0.007 n=10)    1.723m ±  2%   -0.87% (p=0.029 n=10)    1.723m ±  2%   -0.87% (p=0.029 n=10)
MTFT-4                                                     1.327m ±  1%    1.326m ±  2%       ~ (p=0.853 n=10)    1.295m ±  1%   -2.44% (p=0.000 n=10)    1.285m ±  0%   -3.22% (p=0.002 n=10)    1.285m ±  0%   -3.22% (p=0.002 n=10)
MultiWriteSyncer/2_discarder-4                             13.00n ±  1%    12.56n ±  1%  -3.35% (p=0.000 n=10)    13.34n ±  1%   +2.65% (p=0.000 n=10)    13.72n ±  1%   +5.58% (p=0.000 n=10)    13.72n ±  1%   +5.58% (p=0.000 n=10)
MultiWriteSyncer/4_discarder-4                             17.14n ±  0%    16.68n ±  1%  -2.71% (p=0.000 n=10)    18.01n ±  0%   +5.05% (p=0.000 n=10)    17.33n ±  0%   +1.08% (p=0.000 n=10)    17.33n ±  0%   +1.08% (p=0.000 n=10)
MultiWriteSyncer/4_discarder_with_buffer-4                 222.3n ± 16%    234.7n ±  3%  +5.58% (p=0.018 n=10)    223.9n ±  7%        ~ (p=0.436 n=10)    227.0n ±  5%        ~ (p=0.210 n=10)    227.0n ±  5%        ~ (p=0.210 n=10)
MulWorkspaceDense1000Hundredth-4                           246.4m ±  0%    246.6m ±  0%       ~ (p=0.315 n=10)    253.4m ±  0%   +2.82% (p=0.000 n=10)    246.6m ±  0%        ~ (p=0.436 n=10)    246.6m ±  0%        ~ (p=0.436 n=10)
MuxBrodcast-4                                              679.8n ±  8%    642.1n ±  5%       ~ (p=0.726 n=10)    699.5n ±  8%   +2.89% (p=0.023 n=10)    655.3n ±  6%        ~ (p=0.912 n=10)    655.3n ±  6%        ~ (p=0.912 n=10)
ParallelBrodcast-4                                         631.0n ±  0%    628.8n ±  0%       ~ (p=0.093 n=10)    639.7n ±  1%   +1.38% (p=0.000 n=10)    627.9n ±  1%        ~ (p=0.404 n=10)    627.9n ±  1%        ~ (p=0.404 n=10)
ParallelDirectSend-4                                       422.2n ±  1%    423.9n ±  2%       ~ (p=0.986 n=10)    422.8n ±  3%        ~ (p=0.869 n=10)    419.8n ±  2%        ~ (p=0.671 n=10)    419.8n ±  2%        ~ (p=0.671 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:100-4    40.07µ ±  2%    41.10µ ±  2%  +2.57% (p=0.002 n=10)    41.92µ ±  2%   +4.62% (p=0.000 n=10)    41.60µ ±  2%   +3.83% (p=0.002 n=10)    41.60µ ±  2%   +3.83% (p=0.002 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:10-4     55.91µ ±  1%    58.00µ ±  3%  +3.74% (p=0.000 n=10)    55.52µ ±  1%        ~ (p=0.353 n=10)    56.92µ ±  3%        ~ (p=0.075 n=10)    56.92µ ±  3%        ~ (p=0.075 n=10)
ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-4      317.3µ ±  2%    313.5µ ±  2%       ~ (p=0.105 n=10)    316.3µ ±  2%        ~ (p=0.684 n=10)    313.5µ ±  1%        ~ (p=0.143 n=10)    313.5µ ±  1%        ~ (p=0.143 n=10)
ParallelizeUntil/pieces:999,workers:10,chunkSize:13-4      51.70µ ±  2%    53.37µ ±  1%  +3.25% (p=0.001 n=10)    50.90µ ±  1%   -1.54% (p=0.035 n=10)    52.29µ ±  4%        ~ (p=0.393 n=10)    52.29µ ±  4%        ~ (p=0.393 n=10)
ParseBigBytes-4                                            1.695µ ±  0%    1.669µ ±  0%  -1.53% (p=0.000 n=10)    1.695µ ±  0%        ~ (p=0.563 n=10)    1.667µ ±  0%   -1.62% (p=0.000 n=10)    1.667µ ±  0%   -1.62% (p=0.000 n=10)
PendingDemotion10000-4                                     161.4n ±  1%    158.1n ±  0%  -2.07% (p=0.000 n=10)    168.6n ±  0%   +4.43% (p=0.000 n=10)    162.2n ±  0%   +0.43% (p=0.015 n=10)    162.2n ±  0%   +0.43% (p=0.015 n=10)
Polygon-4                                                  7.874µ ±  2%    7.881µ ±  3%       ~ (p=1.000 n=10)    7.981µ ±  1%   +1.37% (p=0.015 n=10)    7.849µ ±  0%   -0.30% (p=0.007 n=10)    7.849µ ±  0%   -0.30% (p=0.007 n=10)
PoolBatchInsert10000-4                                      1.243 ±  0%     1.239 ±  0%  -0.37% (p=0.000 n=10)     1.264 ±  0%   +1.65% (p=0.000 n=10)     1.244 ±  0%        ~ (p=0.353 n=10)     1.244 ±  0%        ~ (p=0.353 n=10)
PoolBatchLocalInsert10000-4                                 1.234 ±  0%     1.235 ±  0%       ~ (p=0.971 n=10)     1.260 ±  0%   +2.07% (p=0.000 n=10)     1.216 ±  0%   -1.43% (p=0.000 n=10)     1.216 ±  0%   -1.43% (p=0.000 n=10)
RecursiveFunc-4                                            16.22µ ±  0%    15.95µ ±  1%  -1.67% (p=0.000 n=10)    16.18µ ±  0%   -0.23% (p=0.001 n=10)    14.97µ ±  1%   -7.73% (p=0.000 n=10)    14.97µ ±  1%   -7.73% (p=0.000 n=10)
ReflectorResyncChanMany-4                                  605.5n ±  3%    605.4n ±  2%       ~ (p=0.481 n=10)    613.4n ±  2%        ~ (p=0.436 n=10)    598.7n ±  5%        ~ (p=0.218 n=10)    598.7n ±  5%        ~ (p=0.218 n=10)
RegexMatch-4                                               1.350µ ±  2%    1.362µ ±  2%       ~ (p=0.239 n=10)    1.389µ ±  1%   +2.89% (p=0.001 n=10)    1.374µ ±  1%        ~ (p=0.085 n=10)    1.374µ ±  1%        ~ (p=0.085 n=10)
RepeatExecProgram-4                                        15.40µ ±  1%    15.43µ ±  1%       ~ (p=0.280 n=10)    15.88µ ±  1%   +3.09% (p=0.000 n=10)    15.44µ ±  1%        ~ (p=0.928 n=10)    15.44µ ±  1%        ~ (p=0.928 n=10)
RepeatIOExecProgram-4                                      26.27µ ±  1%    26.23µ ±  2%       ~ (p=0.853 n=10)    28.74µ ±  7%   +9.41% (p=0.000 n=10)    26.33µ ±  2%        ~ (p=1.000 n=10)    26.33µ ±  2%        ~ (p=1.000 n=10)
RepeatIONew-4                                              1.082µ ±  0%    1.101µ ±  0%  +1.80% (p=0.000 n=10)    1.126µ ±  0%   +4.07% (p=0.000 n=10)    1.110µ ±  0%   +2.59% (p=0.000 n=10)    1.110µ ±  0%   +2.59% (p=0.000 n=10)
RepeatNew-4                                                110.0n ±  2%    101.0n ±  2%  -8.18% (p=0.000 n=10)    114.1n ±  0%   +3.73% (p=0.000 n=10)    112.0n ±  2%   +1.86% (p=0.049 n=10)    112.0n ±  2%   +1.86% (p=0.049 n=10)
Run/10k/1-4                                                 31.44 ±  0%     31.29 ±  1%  -0.48% (p=0.015 n=10)     31.57 ±  1%        ~ (p=0.218 n=10)     31.24 ±  1%   -0.66% (p=0.002 n=10)     31.24 ±  1%   -0.66% (p=0.002 n=10)
Run/10k/16-4                                                11.44 ±  1%     11.39 ±  1%       ~ (p=0.315 n=10)     11.58 ±  1%   +1.21% (p=0.005 n=10)     11.42 ±  1%        ~ (p=0.393 n=10)     11.42 ±  1%        ~ (p=0.393 n=10)
Sampler_Check/100_keys-4                                   39.74n ±  2%    40.08n ±  2%       ~ (p=0.838 n=10)    39.98n ±  1%        ~ (p=0.579 n=10)    39.66n ±  1%        ~ (p=0.447 n=10)    39.66n ±  1%        ~ (p=0.447 n=10)
Sampler_Check/50_keys-4                                    40.31n ±  2%    40.36n ±  1%       ~ (p=0.796 n=10)    40.70n ±  2%        ~ (p=0.631 n=10)    40.04n ±  1%        ~ (p=0.105 n=10)    40.04n ±  1%        ~ (p=0.105 n=10)
Sampler_Check/7_keys-4                                     37.88n ±  6%    36.34n ± 14%       ~ (p=0.190 n=10)    38.95n ±  4%        ~ (p=0.393 n=10)    39.51n ±  8%        ~ (p=0.315 n=10)    39.51n ±  8%        ~ (p=0.315 n=10)
Sampler_CheckWithHook/100_keys-4                           92.89n ±  7%    98.77n ± 22%       ~ (p=0.895 n=10)    94.59n ± 19%        ~ (p=0.971 n=10)    92.81n ± 18%        ~ (p=0.481 n=10)    92.81n ± 18%        ~ (p=0.481 n=10)
Sampler_CheckWithHook/50_keys-4                            95.25n ±  6%   100.40n ± 24%       ~ (p=0.898 n=10)    96.34n ± 21%        ~ (p=1.000 n=10)    95.36n ± 21%        ~ (p=0.739 n=10)    95.36n ± 21%        ~ (p=0.739 n=10)
Sampler_CheckWithHook/7_keys-4                             90.03n ±  5%    93.94n ± 21%       ~ (p=0.971 n=10)    90.95n ± 19%        ~ (p=0.796 n=10)    90.13n ± 19%        ~ (p=0.739 n=10)    90.13n ± 19%        ~ (p=0.739 n=10)
ScaleVec10000Inc20-4                                       28.92µ ±  1%    28.80µ ±  1%       ~ (p=0.927 n=10)    28.91µ ±  1%        ~ (p=0.897 n=10)    28.91µ ±  1%        ~ (p=1.000 n=10)    28.91µ ±  1%        ~ (p=1.000 n=10)
ScopeTaggedNoCachedSubscopes-4                             4.376µ ±  2%    4.342µ ±  2%       ~ (p=0.165 n=10)    4.461µ ±  1%   +1.94% (p=0.002 n=10)    4.366µ ±  1%        ~ (p=0.481 n=10)    4.366µ ±  1%        ~ (p=0.481 n=10)
SeparateTestAndAdd-4                                       438.1n ±  0%    440.0n ±  0%  +0.45% (p=0.002 n=10)    438.6n ±  0%        ~ (p=0.435 n=10)    437.5n ±  0%        ~ (p=0.078 n=10)    437.5n ±  0%        ~ (p=0.078 n=10)
StandardJSON-4                                             1.104µ ±  1%    1.100µ ±  1%       ~ (p=0.084 n=10)    1.116µ ±  0%   +1.04% (p=0.005 n=10)    1.099µ ±  1%   -0.50% (p=0.018 n=10)    1.099µ ±  1%   -0.50% (p=0.018 n=10)
StdlibJSON-4                                               3.288µ ±  0%    3.212µ ±  0%  -2.31% (p=0.000 n=10)    3.366µ ±  0%   +2.36% (p=0.000 n=10)    3.208µ ±  0%   -2.42% (p=0.000 n=10)    3.208µ ±  0%   -2.42% (p=0.000 n=10)
TarjanSCCGnp_1000_half-4                                   115.4m ±  1%    114.6m ±  0%  -0.70% (p=0.029 n=10)    119.2m ±  1%   +3.26% (p=0.000 n=10)    114.2m ±  1%   -1.04% (p=0.002 n=10)    114.2m ±  1%   -1.04% (p=0.002 n=10)
TarjanSCCGnp_10_tenth-4                                    10.23µ ±  1%    10.24µ ±  1%       ~ (p=0.839 n=10)    10.28µ ±  1%   +0.47% (p=0.041 n=10)    10.17µ ±  1%        ~ (p=0.075 n=10)    10.17µ ±  1%        ~ (p=0.075 n=10)
TeeCheck-4                                                 322.6n ±  4%    309.5n ±  5%  -4.06% (p=0.046 n=10)    318.2n ±  5%        ~ (p=0.305 n=10)    314.9n ±  4%        ~ (p=0.393 n=10)    314.9n ±  4%        ~ (p=0.393 n=10)
TextMovementBetweenSegments-4                              516.4µ ±  0%    515.1µ ±  0%       ~ (p=0.052 n=10)    514.8µ ±  0%   -0.30% (p=0.035 n=10)    515.5µ ±  0%        ~ (p=0.247 n=10)    515.5µ ±  0%        ~ (p=0.247 n=10)
ValidateVersionTildeFail-4                                 750.5n ±  0%    750.3n ±  0%       ~ (p=0.839 n=10)    758.3n ±  0%   +1.05% (p=0.000 n=10)    751.2n ±  0%        ~ (p=0.109 n=10)    751.2n ±  0%        ~ (p=0.109 n=10)
WalkAllBreadthFirstGnp_1000_tenth-4                        15.93m ±  0%    15.85m ±  0%  -0.48% (p=0.000 n=10)    16.45m ±  1%   +3.24% (p=0.000 n=10)    15.78m ±  1%   -0.95% (p=0.015 n=10)    15.78m ±  1%   -0.95% (p=0.015 n=10)
WalkAllBreadthFirstGnp_10_tenth-4                          4.831µ ±  1%    4.773µ ±  1%  -1.19% (p=0.015 n=10)    4.970µ ±  1%   +2.88% (p=0.000 n=10)    4.755µ ±  1%   -1.56% (p=0.015 n=10)    4.755µ ±  1%   -1.56% (p=0.015 n=10)
WriteSyncer/write_file_with_no_buffer-4                    1.093µ ±  1%    1.098µ ±  1%       ~ (p=0.403 n=10)    1.108µ ±  1%   +1.42% (p=0.001 n=10)    1.088µ ±  1%        ~ (p=0.403 n=10)    1.088µ ±  1%        ~ (p=0.403 n=10)
ZapConsole-4                                               849.5n ±  1%    856.0n ±  1%  +0.77% (p=0.041 n=10)    871.4n ±  1%   +2.58% (p=0.000 n=10)    866.7n ±  1%   +2.02% (p=0.000 n=10)    866.7n ±  1%   +2.02% (p=0.000 n=10)
ZapJSON-4                                                  535.6n ±  1%    540.6n ±  1%  +0.93% (p=0.012 n=10)    552.0n ±  1%   +3.05% (p=0.000 n=10)    549.6n ±  1%   +2.60% (p=0.000 n=10)    549.6n ±  1%   +2.60% (p=0.000 n=10)
geomean                                                    14.03µ          14.02µ        -0.09%                   14.27µ         +1.71%                   14.09µ         +0.43%                   14.09µ         +0.43%

                 │ 2024.08.16_14.27.server.go.pgo.stat │ 2024.08.16_14.27.server.go.pgobb.stat │ 2024.08.16_14.27.server.go.pgogreed.stat │ 2024.08.16_14.27.server.go.pgobbgreed.stat │ 2024.08.16_14.27.server.go.pgobbbbgreed.stat │
                 │                req/s                │     req/s       vs base               │       req/s        vs base               │        req/s         vs base               │       req/s         vs base                  │
Etcd_l10c50w                               3.884k ± 1%      3.893k ± 1%       ~ (p=0.353 n=10)         3.799k ± 2%  -2.19% (p=0.002 n=10)           3.908k ± 2%       ~ (p=0.315 n=10)          3.865k ± 1%       ~ (p=0.530 n=10+20)
Etcd_r                                     3.883k ± 0%      3.893k ± 1%       ~ (p=0.143 n=10)         3.834k ± 0%  -1.26% (p=0.000 n=10)           3.854k ± 1%  -0.73% (p=0.002 n=10)          3.844k ± 0%  -1.00% (p=0.000 n=10+20)
Etcd_l1c1w                                  137.4 ± 1%       134.5 ± 2%  -2.07% (p=0.007 n=10)          133.1 ± 2%  -3.11% (p=0.000 n=10)            138.8 ± 2%       ~ (p=0.631 n=10)           137.6 ± 1%       ~ (p=0.914 n=10+20)
Etcd_l100c1000w                            24.03k ± 1%      24.05k ± 2%       ~ (p=0.529 n=10)         23.38k ± 1%  -2.70% (p=0.000 n=10)           23.44k ± 2%  -2.45% (p=0.019 n=10)          23.26k ± 1%  -3.20% (p=0.000 n=10+20)
Etcd_l1c1lr                                3.865k ± 0%      3.886k ± 1%       ~ (p=0.089 n=10)         3.798k ± 0%  -1.73% (p=0.000 n=10)           3.818k ± 0%  -1.21% (p=0.000 n=10)          3.817k ± 0%  -1.25% (p=0.000 n=10+20)
Etcd_l100c1000lr                           48.59k ± 0%      48.78k ± 2%       ~ (p=0.218 n=10)         48.07k ± 1%  -1.08% (p=0.011 n=10)           48.44k ± 1%       ~ (p=0.393 n=10)          48.66k ± 1%       ~ (p=0.588 n=10+20)
geomean                                    4.590k           4.585k       -0.10%                        4.497k       -2.02%                          4.566k       -0.52%                         4.546k       -0.95%

Some notes on the results:

                        │ pgo.out.log  │           pgobb.out.log            │          pgogreed.out.log           │          pgobbgreed.out.log          │         pgobbbbgreed.out.log         │
                        │    sec/op    │   sec/op     vs base               │    sec/op     vs base               │    sec/op     vs base                │    sec/op     vs base                │
BinaryTree17 (arm)         4.150 ± 0%    4.194 ± 0%  +1.05% (p=0.000 n=10)     4.244 ± 0%   +2.26% (p=0.000 n=10)     4.255 ± 0%   +2.52% (p=0.000 n=10)     4.194 ± 0%   +1.05% (p=0.000 n=10)
BinaryTree17 (x86)         3.231 ± 0%    3.258 ± 1%  +0.83% (p=0.000 n=10)     3.338 ± 1%   +3.31% (p=0.000 n=10)     3.270 ± 1%   +1.20% (p=0.000 n=10)     3.235 ± 1%        ~ (p=0.190 n=10)
Fannkuch11 (arm)           2.820 ± 0%    2.861 ± 0%  +1.48% (p=0.000 n=10)     2.835 ± 0%   +0.53% (p=0.000 n=10)     2.861 ± 3%   +1.47% (p=0.000 n=10)     2.850 ± 0%   +1.09% (p=0.000 n=10)
Fannkuch11 (x86)           2.915 ± 0%    2.919 ± 0%  +0.12% (p=0.000 n=10)     2.907 ± 0%   -0.29% (p=0.002 n=10)     2.909 ± 0%   -0.21% (p=0.000 n=10)     2.869 ± 0%   -1.60% (p=0.000 n=10)

The BinaryTree17 and Fannkuch11 on Kunpeng strongly depends on code alignment and the results may vary from version to version. Also, we can see, that x86 has improvements on the Fannkuch11 and the best improvement is with interaction between pgobb and layout.

                        │ pgo.out.log  │           pgobb.out.log            │          pgogreed.out.log           │          pgobbgreed.out.log          │         pgobbbbgreed.out.log         │
                        │    sec/op    │   sec/op     vs base               │    sec/op     vs base               │    sec/op     vs base                │    sec/op     vs base                │
GobDecode (arm)            7.400m ± 0%   7.460m ± 1%  +0.81% (p=0.000 n=10)    7.463m ± 0%  +0.85% (p=0.000 n=10)    7.462m ± 0%   +0.84% (p=0.000 n=10)    7.496m ± 1%   +1.30% (p=0.000 n=10)
GobDecode (x86)            5.404m ± 1%   5.272m ± 0%  -2.46% (p=0.000 n=10)    5.571m ± 1%  +3.09% (p=0.000 n=10)    5.478m ± 2%   +1.36% (p=0.002 n=10)    5.570m ± 2%   +3.07% (p=0.000 n=10)
GobEncode (arm)            5.697m ± 1%   5.688m ± 0%       ~ (p=0.912 n=10)    5.711m ± 1%       ~ (p=0.315 n=10)    5.743m ± 2%        ~ (p=0.218 n=10)    5.753m ± 2%   +0.98% (p=0.043 n=10)
GobEncode (x86)            4.461m ± 0%   4.377m ± 1%  -1.88% (p=0.000 n=10)    4.438m ± 1%       ~ (p=0.165 n=10)    4.540m ± 1%   +1.76% (p=0.000 n=10)    4.533m ± 1%   +1.60% (p=0.002 n=10)

Here we see, that Gob has improvement for pgobb on x86 and degradation for other combinations on x86. The arm results almost not changed.

                        │ pgo.out.log  │           pgobb.out.log            │          pgogreed.out.log           │          pgobbgreed.out.log          │         pgobbbbgreed.out.log         │
                        │    sec/op    │   sec/op     vs base               │    sec/op     vs base               │    sec/op     vs base                │    sec/op     vs base                │
Gzip (arm)                273.3m ± 0%   272.3m ± 0%   -0.37% (p=0.003 n=10)    272.7m ± 0%        ~ (p=0.165 n=10)    272.9m ± 0%        ~ (p=0.218 n=10)    273.2m ± 0%        ~ (p=1.000 n=10)
Gzip (x86)                297.5m ± 0%   300.7m ± 0%   +1.08% (p=0.000 n=10)    303.1m ± 1%   +1.90% (p=0.000 n=10)    302.6m ± 0%   +1.72% (p=0.000 n=10)    301.9m ± 1%   +1.51% (p=0.000 n=10)
Gunzip (arm)               30.65m ± 0%   30.81m ± 0%  +0.52% (p=0.000 n=10)    30.11m ± 0%   -1.77% (p=0.000 n=10)    30.30m ± 0%   -1.15% (p=0.000 n=10)    29.98m ± 0%   -2.21% (p=0.000 n=10)
Gunzip (x86)               30.15m ± 0%   30.16m ± 1%       ~ (p=0.684 n=10)    30.36m ± 0%   +0.71% (p=0.000 n=10)    30.46m ± 1%   +1.04% (p=0.000 n=10)    29.83m ± 1%   -1.05% (p=0.000 n=10)

We see, that on x86 the performance is worse, but the pgobbbbgreed mode shows improvement in x86 and on arm.

                        │ pgo.out.log  │           pgobb.out.log            │          pgogreed.out.log           │          pgobbgreed.out.log          │         pgobbbbgreed.out.log         │
                        │    sec/op    │   sec/op     vs base               │    sec/op     vs base               │    sec/op     vs base                │    sec/op     vs base                │
TimeParse (arm)           261.2n ± 0%   253.5n ± 0%  -2.97% (p=0.000 n=10)    303.8n ± 0%  +16.27% (p=0.000 n=10)    302.1n ± 0%  +15.62% (p=0.000 n=10)    306.5n ± 0%  +17.32% (p=0.000 n=10)
TimeParse (x86)           249.1n ± 0%   252.4n ± 0%  +1.37% (p=0.000 n=10)    265.5n ± 0%   +6.59% (p=0.000 n=10)    263.0n ± 0%   +5.60% (p=0.000 n=10)    271.5n ± 0%   +9.01% (p=0.000 n=10)
TimeFormat (arm)          250.1n ± 0%   243.6n ± 0%  -2.56% (p=0.000 n=10)    252.8n ± 0%   +1.08% (p=0.000 n=10)    253.4n ± 0%   +1.34% (p=0.000 n=10)    258.0n ± 0%   +3.18% (p=0.000 n=10)
TimeFormat (x86)          226.0n ± 0%   224.9n ± 0%  -0.49% (p=0.000 n=10)    238.8n ± 0%   +5.66% (p=0.000 n=10)    232.0n ± 0%   +2.65% (p=0.000 n=10)    252.6n ± 0%  +11.77% (p=0.000 n=10)

Here we see, that simple pgobb shows improvement, this is and issue for investigation.

                           │ pgo.out.log  │           pgobb.out.log            │          pgogreed.out.log           │          pgobbgreed.out.log          │         pgobbbbgreed.out.log         │
                           │    sec/op    │   sec/op     vs base               │    sec/op     vs base               │    sec/op     vs base                │    sec/op     vs base                │
BaseTest2KB (arm)           992.7n ± 0%    992.3n ±  0%   -0.04% (p=0.000 n=10)    992.4n  ±  0%   -0.03% (p=0.000 n=10)   1584.0n ±  0%  +59.57% (p=0.000 n=10)   1584.0n  ±  0%  +59.57% (p=0.000 n=10)
BaseTest2KB (x86)           801.7n ± 0%    801.5n ±  0%       ~  (p=0.378 n=10)   1012.0n  ±  0%  +26.23% (p=0.000 n=10)   1012.0n ±  0%  +26.23% (p=0.000 n=10)   1012.0n  ±  0%  +26.23% (p=0.000 n=10)
Encoding4KBVerySparse (arm) 20.29µ ± 1%    20.33µ ±  1%       ~  (p=0.315 n=10)     18.60µ ±  1%   -8.30% (p=0.000 n=10)    20.23µ ±  1%        ~ (p=0.436 n=10)     20.33µ ±  1%        ~ (p=0.529 n=10)
Encoding4KBVerySparse (x86) 18.71µ ± 0%    18.76µ ±  0%   +0.28% (p=0.008 n=10)     20.52µ ±  0%   +9.70% (p=0.000 n=10)    18.79µ ±  0%   +0.41% (p=0.000 n=10)     18.79µ ±  0%   +0.41% (p=0.000 n=10)
FastTest2KB (arm)           992.4n ± 0%    992.3n ±  0%       ~  (p=0.496 n=10)    992.5n  ±  0%   +0.01% (p=0.006 n=10)   1584.0n ±  0%  +59.61% (p=0.000 n=10)    992.3n  ±  0%        ~ (p=0.725 n=10)
FastTest2KB (x86)           167.0n ± 0%    167.0n ±  0%       ~  (p=0.720 n=10)    168.6n  ±  0%   +0.93% (p=0.000 n=10)    168.6n ±  0%   +0.93% (p=0.000 n=10)    168.6n  ±  0%   +0.93% (p=0.000 n=10)

On Kunpeng this test strongly depends on code alignemnt, but we see the big degradation even on x86 with greedy algorithm. This is an issue for investigation.

                           │ pgo.out.log  │           pgobb.out.log            │          pgogreed.out.log           │          pgobbgreed.out.log          │         pgobbbbgreed.out.log         │
                           │    sec/op    │   sec/op     vs base               │    sec/op     vs base               │    sec/op     vs base                │    sec/op     vs base                │
RecursiveFunc (arm)         15.88µ ±  1%    17.27µ ± 1%   +8.80% (p=0.000 n=10)    17.66µ ±  1%  +11.20% (p=0.000 n=10)    17.29µ ±  1%   +8.91% (p=0.000 n=10)    17.75µ ±  0%  +11.76% (p=0.000 n=10)
RecursiveFunc (arm)         16.22µ ±  0%    15.95µ ± 1%   -1.67% (p=0.000 n=10)    16.18µ ±  0%   -0.23% (p=0.001 n=10)    14.97µ ±  1%   -7.73% (p=0.000 n=10)    14.97µ ±  1%   -7.73% (p=0.000 n=10)

We see degradation on Kunpeng and improvement on x86. Probably the code alignment issue. Also good improvement for pgobb*greed modes.

                           │ pgo.out.log  │           pgobb.out.log            │          pgogreed.out.log           │          pgobbgreed.out.log          │         pgobbbbgreed.out.log         │
                           │    sec/op    │   sec/op     vs base               │    sec/op     vs base               │    sec/op     vs base                │    sec/op     vs base                │
FuturePromotion10000 (arm)   3.370n ±  0%    3.370n ±  0%     ~ (p=0.387 n=10)    3.370n ±  0%      ~ (p=1.000 n=10)    3.370n ±  0%        ~ (p=0.387 n=10     3.466n ±  0%   +2.85% (p=0.000 n=10)
FuturePromotion10000 (x86)   3.093n ±  0%    3.093n ±  0%     ~ (p=0.911 n=10)    3.093n ±  0%      ~ (p=0.806 n=10)    3.479n ±  0%  +12.48% (p=0.000 n=10)    3.479n ±  0%  +12.48% (p=0.000 n=10)

Test degrades on x86 with pgobbgreed, this is an issue for investigation.

                           │ pgo.out.log  │           pgobb.out.log            │          pgogreed.out.log           │          pgobbgreed.out.log          │         pgobbbbgreed.out.log         │
                           │    sec/op    │   sec/op     vs base               │    sec/op     vs base               │    sec/op     vs base                │    sec/op     vs base                │
RepeatNew (arm)              163.1n ± 1%    155.6n ± 2%  -4.57% (p=0.000 n=10)    157.6n ±  0%   -3.34% (p=0.000 n=10)    155.2n ±  0%   -4.84% (p=0.000 n=10)    160.4n ±  1%   -1.66% (p=0.001 n=10)
RepeatNew (x86)              110.0n ± 2%    101.0n ± 2%  -8.18% (p=0.000 n=10)    114.1n ±  0%   +3.73% (p=0.000 n=10)    112.0n ±  2%   +1.86% (p=0.049 n=10)    112.0n ±  2%   +1.86% (p=0.049 n=10)

Good improvements for pgobb on both platforms, but degradations for greedy algorithm in x86.

DecodehealingTracker (arm)                                 864.6n ±   1%    966.1n ±   1%  +11.74% (p=0.000 n=10)    879.9n ±  1%   +1.77% (p=0.000 n=10)                                              855.5n ±  1%   -1.05% (p=0.000 n=10)
OpDiv128-4                                                 126.2n ±   1%    125.8n ±   0%        ~ (p=0.059 n=10)    136.0n ±  1%   +7.76% (p=0.000 n=10)    138.4n ±  1%   +9.58% (p=0.000 n=10)      135.6n ±  0%   +7.41% (p=0.000 n=10)

Also

We use pgobb in our production code and it shows performance improvement 1.5%, that is good result.

Summary

The pgobb itself is not an optimization, but a framework for profile-guided optimization, but it can show improvements on some tests. We see, than average benchmark results are not as good, as could be, and that should be investigated. I believe, that improving pgobb and pgo in general will bring more benifit to us.

@prattmic
Copy link
Member

Interesting results, thanks for sharing! I recommend also benchmarking with sweet. Bent has a big set of benchmarks, but mostly microbenchmarks, so they aren't as useful for PGO evaluation. Sweet is "integration" benchmarks: larger applications where it IMO makes more sense to apply PGO.

@jinlin-bayarea
Copy link
Contributor

The porting of ext-stp algorithm is incomplete. I do have data structure to store the edge counter information. In addition, you did not port the frequency propagation file. Please remove the incomplete ext-stp implementaiton from https://go-review.googlesource.com/c/go/+/605555.

@alexanius
Copy link
Contributor Author

alexanius commented Aug 20, 2024

The porting of ext-stp algorithm is incomplete. I do have data structure to store the edge counter information. In addition, you did not port the frequency propagation file. Please remove the incomplete ext-stp implementaiton from https://go-review.googlesource.com/c/go/+/605555.

Hello. Yes, I mention, that it is incomplete and do not make measurements with it. If you insist, I will remove it, but this patch is in the WIP status. Is it possible not to remove it, before I finish?

UPD: removed

alexanius added a commit to alexanius/go that referenced this issue Sep 11, 2024
Current patch adds the counters to the AST and SSA nodes. The counters
are loaded from the pprof file, no profile format changes needed.

To use basic block counters you should add an option -pgobb:

go build -a -pgobb -pgo=file.prof

Fixes golang#65466

Change-Id: I5d6be7d87f384625259a9ba794744a652060de4e
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/626996 mentions this issue: cmd/compile,runtime: add column number(discriminator), plumb to pprof

@alexanius
Copy link
Contributor Author

alexanius commented Nov 27, 2024

Sorry for late, here I share the results of sweet:

Xeon E5-2690 (x86-64):

                       │ pgo_base.pgo.results │          pgobb.pgo.results          │
                       │        sec/op        │    sec/op     vs base               │
BleveIndexBatch100-4              5.711 ±  3%    5.677 ±  3%       ~ (p=0.912 n=10)
ESBuildThreeJS-4                 696.3m ±  1%   698.5m ±  1%       ~ (p=0.143 n=10)
ESBuildRomeTS-4                  183.0m ±  1%   183.6m ±  1%       ~ (p=0.481 n=10)
EtcdPut-4                        193.9m ±  6%   189.5m ±  7%       ~ (p=0.280 n=10)
EtcdSTM-4                        567.6m ± 17%   559.8m ± 10%       ~ (p=0.529 n=10)
GoBuildKubelet-4                  140.8 ±  0%    140.8 ±  0%       ~ (p=0.481 n=10)
GoBuildKubeletLink-4              10.49 ±  0%    10.50 ±  0%       ~ (p=0.739 n=10)
GoBuildIstioctl-4                 112.4 ±  0%    112.4 ±  0%       ~ (p=0.631 n=10)
GoBuildIstioctlLink-4             10.75 ±  0%    10.77 ±  0%       ~ (p=0.315 n=10)
GoBuildFrontend-4                 40.58 ±  0%    40.67 ±  0%       ~ (p=0.165 n=10)
GoBuildFrontendLink-4             1.563 ±  1%    1.576 ±  0%  +0.80% (p=0.009 n=10)
GopherLuaKNucleotide-4            27.45 ±  3%    27.29 ±  1%  -0.60% (p=0.043 n=10)
MarkdownRenderXHTML-4            254.1m ±  0%   248.3m ±  0%  -2.28% (p=0.000 n=10)
Tile38QueryLoad-4                567.6µ ±  1%   559.1µ ±  1%  -1.50% (p=0.004 n=10)
geomean                           2.115          2.105        -0.49%

                       │ pgo_base.pgo.results │            pgobb.pgo.results             │
                       │  average-RSS-bytes   │ average-RSS-bytes  vs base               │
BleveIndexBatch100-4             202.6Mi ± 1%        202.5Mi ± 2%       ~ (p=0.912 n=10)
EtcdPut-4                        109.7Mi ± 2%        110.8Mi ± 2%       ~ (p=0.052 n=10)
EtcdSTM-4                        100.5Mi ± 1%        102.3Mi ± 1%  +1.79% (p=0.015 n=10)
GopherLuaKNucleotide-4           34.62Mi ± 2%        34.72Mi ± 2%       ~ (p=0.739 n=10)
MarkdownRenderXHTML-4            20.08Mi ± 4%        20.06Mi ± 4%       ~ (p=0.811 n=10)
Tile38QueryLoad-4                5.752Gi ± 1%        5.736Gi ± 0%       ~ (p=0.218 n=10)
geomean                          144.6Mi             145.2Mi       +0.43%

                       │ pgo_base.pgo.results │           pgobb.pgo.results           │
                       │    peak-RSS-bytes    │ peak-RSS-bytes  vs base               │
BleveIndexBatch100-4             290.9Mi ± 2%     291.9Mi ± 2%       ~ (p=0.853 n=10)
EtcdPut-4                        150.2Mi ± 2%     147.2Mi ± 3%       ~ (p=0.052 n=10)
EtcdSTM-4                        126.3Mi ± 2%     128.4Mi ± 2%       ~ (p=0.089 n=10)
GopherLuaKNucleotide-4           37.50Mi ± 0%     37.45Mi ± 1%       ~ (p=0.271 n=10)
MarkdownRenderXHTML-4            21.34Mi ± 1%     21.38Mi ± 1%       ~ (p=0.725 n=10)
Tile38QueryLoad-4                5.846Gi ± 1%     5.843Gi ± 0%       ~ (p=0.631 n=10)
geomean                          172.6Mi          172.6Mi       -0.01%

                       │ pgo_base.pgo.results │          pgobb.pgo.results           │
                       │    peak-VM-bytes     │ peak-VM-bytes  vs base               │
BleveIndexBatch100-4             1.933Gi ± 0%    1.933Gi ± 0%       ~ (p=0.269 n=10)
EtcdPut-4                        11.26Gi ± 0%    11.26Gi ± 0%       ~ (p=0.560 n=10)
EtcdSTM-4                        11.26Gi ± 0%    11.26Gi ± 0%       ~ (p=0.069 n=10)
GopherLuaKNucleotide-4           1.174Gi ± 0%    1.174Gi ± 0%  +0.00% (p=0.011 n=10)
MarkdownRenderXHTML-4            1.174Gi ± 0%    1.174Gi ± 0%  -0.00% (p=0.049 n=10)
Tile38QueryLoad-4                6.995Gi ± 0%    6.995Gi ± 0%       ~ (p=0.271 n=10)
geomean                          3.649Gi         3.649Gi       -0.00%

                  │ pgo_base.pgo.results │           pgobb.pgo.results            │
                  │   p50-latency-sec    │ p50-latency-sec  vs base               │
EtcdPut-4                   185.8m ±  6%      179.0m ±  7%       ~ (p=0.218 n=10)
EtcdSTM-4                   364.2m ± 16%      359.4m ± 11%       ~ (p=0.631 n=10)
Tile38QueryLoad-4           266.8µ ±  0%      263.4µ ±  0%  -1.24% (p=0.000 n=10)
geomean                     26.23m            25.69m        -2.07%

                  │ pgo_base.pgo.results │           pgobb.pgo.results            │
                  │   p90-latency-sec    │ p90-latency-sec  vs base               │
EtcdPut-4                   274.4m ±  7%       269.9m ± 9%       ~ (p=0.529 n=10)
EtcdSTM-4                    1.208 ± 17%        1.192 ± 9%       ~ (p=0.579 n=10)
Tile38QueryLoad-4           918.7µ ±  1%       875.0µ ± 1%  -4.76% (p=0.000 n=10)
geomean                     67.28m             65.54m       -2.59%

                  │ pgo_base.pgo.results │           pgobb.pgo.results            │
                  │   p99-latency-sec    │ p99-latency-sec  vs base               │
EtcdPut-4                   348.9m ±  9%      342.3m ± 14%       ~ (p=0.218 n=10)
EtcdSTM-4                    2.875 ± 17%       2.765 ± 10%       ~ (p=0.436 n=10)
Tile38QueryLoad-4           5.234m ±  2%      5.244m ±  2%       ~ (p=0.912 n=10)
geomean                     173.8m            170.6m        -1.85%

                  │ pgo_base.pgo.results │         pgobb.pgo.results          │
                  │        ops/s         │    ops/s     vs base               │
EtcdPut-4                   5.111k ±  5%   5.215k ± 7%       ~ (p=0.280 n=10)
EtcdSTM-4                   1.744k ± 15%   1.766k ± 9%       ~ (p=0.529 n=10)
Tile38QueryLoad-4           5.285k ±  1%   5.365k ± 1%  +1.52% (p=0.004 n=10)
geomean                     3.612k         3.669k       +1.59%

Kunpeng920 (AArch):

                       │ pgo_base.pgo.results │         pgobb.pgo.results          │
                       │        sec/op        │   sec/op     vs base               │
BleveIndexBatch100-4               8.247 ± 2%    8.113 ± 2%  -1.63% (p=0.005 n=10)
ESBuildThreeJS-4                  892.4m ± 1%   888.4m ± 1%       ~ (p=0.123 n=10)
ESBuildRomeTS-4                   207.0m ± 1%   207.1m ± 1%       ~ (p=0.971 n=10)
EtcdPut-4                         50.80m ± 1%   50.44m ± 1%       ~ (p=0.123 n=10)
EtcdSTM-4                         276.8m ± 1%   270.8m ± 0%  -2.18% (p=0.000 n=10)
GoBuildKubelet-4                   182.1 ± 6%    183.2 ± 7%       ~ (p=0.912 n=10)
GoBuildKubeletLink-4               18.42 ± 6%    18.55 ± 6%       ~ (p=1.000 n=10)
GoBuildIstioctl-4                  136.8 ± 6%    137.4 ± 5%       ~ (p=0.739 n=10)
GoBuildIstioctlLink-4              11.93 ± 7%    11.94 ± 9%       ~ (p=0.853 n=10)
GoBuildFrontend-4                  48.22 ± 1%    48.32 ± 1%       ~ (p=0.247 n=10)
GoBuildFrontendLink-4              2.219 ± 2%    2.206 ± 4%  -0.59% (p=0.043 n=10)
GopherLuaKNucleotide-4             33.64 ± 1%    33.54 ± 0%  -0.31% (p=0.011 n=10)
MarkdownRenderXHTML-4             278.3m ± 0%   278.1m ± 0%       ~ (p=0.481 n=10)
Tile38QueryLoad-4                 1.031m ± 1%   1.025m ± 1%  -0.56% (p=0.043 n=10)
geomean                            2.307         2.300       -0.31%

                       │ pgo_base.pgo.results │            pgobb.pgo.results             │
                       │  average-RSS-bytes   │ average-RSS-bytes  vs base               │
BleveIndexBatch100-4             187.6Mi ± 1%        187.3Mi ± 1%       ~ (p=0.739 n=10)
EtcdPut-4                        107.3Mi ± 3%        110.0Mi ± 2%  +2.49% (p=0.011 n=10)
EtcdSTM-4                        97.81Mi ± 1%       101.04Mi ± 1%  +3.31% (p=0.000 n=10)
GopherLuaKNucleotide-4           34.35Mi ± 3%        34.36Mi ± 3%       ~ (p=0.739 n=10)
MarkdownRenderXHTML-4            20.03Mi ± 4%        20.12Mi ± 3%       ~ (p=0.529 n=10)
Tile38QueryLoad-4                5.878Gi ± 1%        5.874Gi ± 2%       ~ (p=0.912 n=10)
geomean                          141.9Mi             143.3Mi       +0.99%

                       │ pgo_base.pgo.results │           pgobb.pgo.results           │
                       │    peak-RSS-bytes    │ peak-RSS-bytes  vs base               │
BleveIndexBatch100-4             272.6Mi ± 2%     271.6Mi ± 1%       ~ (p=0.218 n=10)
EtcdPut-4                        147.4Mi ± 4%     149.8Mi ± 4%  +1.64% (p=0.015 n=10)
EtcdSTM-4                        122.4Mi ± 4%     126.7Mi ± 2%  +3.45% (p=0.002 n=10)
GopherLuaKNucleotide-4           36.89Mi ± 3%     36.26Mi ± 3%       ~ (p=0.529 n=10)
MarkdownRenderXHTML-4            21.18Mi ± 1%     21.22Mi ± 1%       ~ (p=0.781 n=10)
Tile38QueryLoad-4                5.974Gi ± 2%     5.969Gi ± 2%       ~ (p=0.853 n=10)
geomean                          169.3Mi          170.1Mi       +0.51%

                       │ pgo_base.pgo.results │          pgobb.pgo.results           │
                       │    peak-VM-bytes     │ peak-VM-bytes  vs base               │
BleveIndexBatch100-4             1.869Gi ± 3%    1.869Gi ± 3%       ~ (p=0.339 n=10)
EtcdPut-4                        11.26Gi ± 1%    11.29Gi ± 0%       ~ (p=0.070 n=10)
EtcdSTM-4                        11.25Gi ± 0%    11.25Gi ± 0%  +0.00% (p=0.000 n=10)
GopherLuaKNucleotide-4           1.174Gi ± 0%    1.174Gi ± 0%  +0.00% (p=0.011 n=10)
MarkdownRenderXHTML-4            1.174Gi ± 0%    1.174Gi ± 0%       ~ (p=0.883 n=10)
Tile38QueryLoad-4                7.121Gi ± 2%    7.122Gi ± 1%       ~ (p=0.912 n=10)
geomean                          3.639Gi         3.641Gi       +0.05%

                  │ pgo_base.pgo.results │           pgobb.pgo.results            │
                  │   p50-latency-sec    │ p50-latency-sec  vs base               │
EtcdPut-4                    47.85m ± 2%       47.69m ± 1%       ~ (p=0.353 n=10)
EtcdSTM-4                    199.7m ± 2%       194.9m ± 1%  -2.44% (p=0.000 n=10)
Tile38QueryLoad-4            342.7µ ± 1%       340.4µ ± 1%  -0.67% (p=0.019 n=10)
geomean                      14.85m            14.68m       -1.15%

                  │ pgo_base.pgo.results │           pgobb.pgo.results            │
                  │   p90-latency-sec    │ p90-latency-sec  vs base               │
EtcdPut-4                    74.33m ± 2%       73.63m ± 1%       ~ (p=0.052 n=10)
EtcdSTM-4                    553.2m ± 1%       541.4m ± 0%  -2.14% (p=0.000 n=10)
Tile38QueryLoad-4            1.394m ± 1%       1.382m ± 1%  -0.84% (p=0.004 n=10)
geomean                      38.55m            38.05m       -1.31%

                  │ pgo_base.pgo.results │           pgobb.pgo.results            │
                  │   p99-latency-sec    │ p99-latency-sec  vs base               │
EtcdPut-4                    103.2m ± 2%       105.3m ± 4%       ~ (p=0.165 n=10)
EtcdSTM-4                     1.100 ± 1%        1.076 ± 1%  -2.22% (p=0.000 n=10)
Tile38QueryLoad-4            11.39m ± 1%       11.33m ± 1%       ~ (p=0.393 n=10)
geomean                      108.9m            108.7m       -0.23%

                  │ pgo_base.pgo.results │         pgobb.pgo.results          │
                  │        ops/s         │    ops/s     vs base               │
EtcdPut-4                    18.66k ± 1%   18.81k ± 1%       ~ (p=0.075 n=10)
EtcdSTM-4                    3.534k ± 1%   3.602k ± 0%  +1.90% (p=0.000 n=10)
Tile38QueryLoad-4            2.910k ± 1%   2.927k ± 1%  +0.58% (p=0.037 n=10)
geomean                      5.768k        5.831k       +1.09%

The Cockroach is not here because of strange behavior, I will re-evaluate it.

UPD:

I tried to revaluate the cockroachdb, but it is very unstable even on the compiler version without my patches.

For x86 I get the following error:

2024/12/02 10:39:36 running benchmark tool
/home/amarkin/repo/benchmarks/sweet/ttt2/cockroachdb/pgo_base.profile/bin/cockroach workload run kv --read-percent=50 --min-block-bytes=1024 --max-block-bytes=1024 --concurrency=10000 --max-rate=30000 --scatter --splits=5 --ramp=15s --duration=1m postgres://root@localhost:26257?sslmode=disable
=== Benchmarking tool stderr ===
I241202 07:39:34.225286 2 workload/cli/run.go:640  [-] 1  random seed: 3228517985462086202
I241202 07:39:36.187696 2 workload/cli/run.go:640  [-] 1  random seed: -3317581241380407703
I241202 07:39:36.187834 2 workload/cli/run.go:432  [-] 2  creating load generator...
I241202 07:39:36.762913 2 workload/cli/run.go:471  [-] 3  creating load generator... done (took 575.076142ms)
W241202 07:39:36.768173 2 workload/pgx_helpers.go:240  [-] 4  error preparing statement. name=kv-2 sql=SELECT k, v FROM kv AS OF SYSTEM TIME follower_read_timestamp() WHERE k IN ($1) ERROR: database "kv" does not exist (SQLSTATE 3D000)
W241202 07:39:36.768319 2 workload/pgx_helpers.go:240  [-] 5  error preparing statement. name=kv-2 sql=SELECT k, v FROM kv AS OF SYSTEM TIME follower_read_timestamp() WHERE k IN ($1) ERROR: database "kv" does not exist (SQLSTATE 3D000)
...
W241202 07:39:39.749456 2 workload/pgx_helpers.go:240  [-] 5637  error preparing statement. name=kv-2 sql=SELECT k, v FROM kv AS OF SYSTEM TIME follower_read_timestamp() WHERE k IN ($1) ERROR: database "kv" does not exist (SQLSTATE 3D000)
Error: failed to connect to `host=localhost user=root database=kv`: dial error (timeout: dial tcp 127.0.0.1:26257: connect: connection timed out)

2024/12/02 10:40:23 shutting down cluster
=== Instance "roach-node" stdout+stderr ===

For ARM it has some fails, but we can see the results:

                           │ results/cockroachdb/pgo_base.pgo.results │ results/cockroachdb/pgobb.pgo.results │
                           │                  sec/op                  │     sec/op       vs base              │
CockroachDBkv0/nodes=1-48                                53.58µ ± 11%      55.76µ ± 16%       ~ (p=0.382 n=8)
CockroachDBkv50/nodes=1-48                               47.97µ ± 13%      44.66µ ± 11%       ~ (p=0.959 n=8)
CockroachDBkv95/nodes=1-48                               35.36µ ±  1%      35.25µ ±  1%       ~ (p=0.111 n=8)
CockroachDBkv0/nodes=3-48                                56.48µ ± 14%      57.98µ ± 12%       ~ (p=0.505 n=8)
CockroachDBkv50/nodes=3-48                               46.93µ ±  4%      47.67µ ± 14%       ~ (p=0.878 n=8)
CockroachDBkv95/nodes=3-48                               35.45µ ±  4%      35.20µ ±  1%       ~ (p=0.161 n=8)
geomean                                                  45.21µ            45.21µ        +0.00%

                           │ results/cockroachdb/pgo_base.pgo.results │  results/cockroachdb/pgobb.pgo.results  │
                           │            average-RSS-bytes             │ average-RSS-bytes  vs base              │
CockroachDBkv0/nodes=1-48                                9.318Gi ± 8%       9.362Gi ±  5%       ~ (p=0.798 n=8)
CockroachDBkv50/nodes=1-48                               8.696Gi ± 9%       8.991Gi ± 10%       ~ (p=0.195 n=8)
CockroachDBkv95/nodes=1-48                               6.026Gi ± 2%       6.005Gi ±  2%       ~ (p=0.505 n=8)
CockroachDBkv0/nodes=3-48                                9.250Gi ± 4%       9.425Gi ±  4%  +1.89% (p=0.050 n=8)
CockroachDBkv50/nodes=3-48                               8.815Gi ± 4%       8.675Gi ±  9%       ~ (p=0.065 n=8)
CockroachDBkv95/nodes=3-48                               6.170Gi ± 4%       6.167Gi ±  4%       ~ (p=0.959 n=8)
geomean                                                  7.914Gi            7.963Gi        +0.62%

                           │ results/cockroachdb/pgo_base.pgo.results │ results/cockroachdb/pgobb.pgo.results │
                           │              peak-RSS-bytes              │ peak-RSS-bytes   vs base              │
CockroachDBkv0/nodes=1-48                               14.45Gi ± 13%     14.41Gi ±  5%       ~ (p=0.798 n=8)
CockroachDBkv50/nodes=1-48                              12.91Gi ±  7%     13.30Gi ± 10%       ~ (p=0.279 n=8)
CockroachDBkv95/nodes=1-48                              9.199Gi ±  6%     8.949Gi ±  5%       ~ (p=0.161 n=8)
CockroachDBkv0/nodes=3-48                               13.93Gi ±  5%     14.73Gi ±  6%       ~ (p=0.161 n=8)
CockroachDBkv50/nodes=3-48                              13.05Gi ±  5%     12.89Gi ±  9%       ~ (p=0.195 n=8)
CockroachDBkv95/nodes=3-48                              9.457Gi ±  8%     9.335Gi ±  6%       ~ (p=0.505 n=8)
geomean                                                 11.98Gi           12.04Gi        +0.50%

                           │ results/cockroachdb/pgo_base.pgo.results │ results/cockroachdb/pgobb.pgo.results │
                           │              peak-VM-bytes               │  peak-VM-bytes   vs base              │
CockroachDBkv0/nodes=1-48                                17.05Gi ± 9%      16.80Gi ± 4%       ~ (p=0.328 n=8)
CockroachDBkv50/nodes=1-48                               15.22Gi ± 4%      15.54Gi ± 6%       ~ (p=0.234 n=8)
CockroachDBkv95/nodes=1-48                               11.52Gi ± 6%      11.20Gi ± 4%       ~ (p=0.161 n=8)
CockroachDBkv0/nodes=3-48                                16.52Gi ± 4%      16.84Gi ± 4%       ~ (p=0.234 n=8)
CockroachDBkv50/nodes=3-48                               15.36Gi ± 4%      15.16Gi ± 5%       ~ (p=0.065 n=8)
CockroachDBkv95/nodes=3-48                               11.79Gi ± 8%      11.66Gi ± 4%       ~ (p=0.442 n=8)
geomean                                                  14.41Gi           14.34Gi       -0.45%

                           │ results/cockroachdb/pgo_base.pgo.results │    results/cockroachdb/pgobb.pgo.results    │
                           │          write-avg-latency-sec           │ write-avg-latency-sec  vs base              │
CockroachDBkv0/nodes=1-48                                509.9m ± 30%            531.9m ± 36%       ~ (p=0.382 n=8)
CockroachDBkv50/nodes=1-48                               621.0m ± 44%            572.7m ± 27%       ~ (p=0.878 n=8)
CockroachDBkv95/nodes=1-48                               344.8m ± 11%            318.8m ± 12%       ~ (p=0.234 n=8)
CockroachDBkv0/nodes=3-48                                527.2m ± 33%            555.2m ± 14%       ~ (p=0.382 n=8)
CockroachDBkv50/nodes=3-48                               602.4m ±  6%            603.9m ± 46%       ~ (p=0.798 n=8)
CockroachDBkv95/nodes=3-48                               382.4m ± 37%            360.8m ± 16%       ~ (p=0.505 n=8)
geomean                                                  486.5m                  476.8m        -1.99%

                           │ results/cockroachdb/pgo_base.pgo.results │    results/cockroachdb/pgobb.pgo.results     │
                           │          write-p100-latency-sec          │ write-p100-latency-sec  vs base              │
CockroachDBkv0/nodes=1-48                                 6.845 ± 25%              7.114 ± 21%       ~ (p=0.364 n=8)
CockroachDBkv50/nodes=1-48                                8.724 ± 29%              8.187 ± 25%       ~ (p=0.694 n=8)
CockroachDBkv95/nodes=1-48                                4.832 ± 11%              4.698 ± 14%       ~ (p=0.350 n=8)
CockroachDBkv0/nodes=3-48                                 6.845 ± 25%              7.919 ± 19%       ~ (p=0.195 n=8)
CockroachDBkv50/nodes=3-48                                8.858 ±  9%              8.187 ± 15%       ~ (p=0.111 n=8)
CockroachDBkv95/nodes=3-48                                5.235 ± 33%              4.563 ± 12%       ~ (p=0.151 n=8)
geomean                                                   6.714                    6.577        -2.04%

                           │ results/cockroachdb/pgo_base.pgo.results │    results/cockroachdb/pgobb.pgo.results    │
                           │          write-p50-latency-sec           │ write-p50-latency-sec  vs base              │
CockroachDBkv0/nodes=1-48                               37.75m ± 100%           21.50m ±  75%       ~ (p=0.069 n=8)
CockroachDBkv50/nodes=1-48                              22.00m ± 110%           23.60m ± 175%       ~ (p=0.591 n=8)
CockroachDBkv95/nodes=1-48                              2.100m ±  10%           1.950m ±  28%       ~ (p=0.418 n=8)
CockroachDBkv0/nodes=3-48                               25.70m ±  39%           27.25m ±  31%       ~ (p=0.937 n=8)
CockroachDBkv50/nodes=3-48                              22.55m ±  63%           20.45m ±  71%       ~ (p=0.555 n=8)
CockroachDBkv95/nodes=3-48                              2.100m ±  24%           2.200m ±  18%       ~ (p=0.738 n=8)
geomean                                                 11.34m                  10.33m         -8.90%

                           │ results/cockroachdb/pgo_base.pgo.results │    results/cockroachdb/pgobb.pgo.results    │
                           │          write-p95-latency-sec           │ write-p95-latency-sec  vs base              │
CockroachDBkv0/nodes=1-48                                 2.416 ± 22%             2.416 ± 22%       ~ (p=0.483 n=8)
CockroachDBkv50/nodes=1-48                                3.087 ± 26%             2.752 ± 17%       ~ (p=0.437 n=8)
CockroachDBkv95/nodes=1-48                                2.550 ±  5%             2.550 ±  5%       ~ (p=0.689 n=8)
CockroachDBkv0/nodes=3-48                                 2.416 ± 17%             2.483 ±  8%       ~ (p=0.324 n=8)
CockroachDBkv50/nodes=3-48                                2.953 ±  5%             2.819 ± 24%  -4.54% (p=0.042 n=8)
CockroachDBkv95/nodes=3-48                                2.617 ± 13%             2.550 ±  5%       ~ (p=0.438 n=8)
geomean                                                   2.661                   2.591        -2.63%

                           │ results/cockroachdb/pgo_base.pgo.results │    results/cockroachdb/pgobb.pgo.results    │
                           │          write-p99-latency-sec           │ write-p99-latency-sec  vs base              │
CockroachDBkv0/nodes=1-48                                 3.221 ± 21%             3.288 ± 27%       ~ (p=0.426 n=8)
CockroachDBkv50/nodes=1-48                                3.892 ± 21%             3.691 ± 24%       ~ (p=0.486 n=8)
CockroachDBkv95/nodes=1-48                                3.423 ±  6%             3.490 ± 12%       ~ (p=0.613 n=8)
CockroachDBkv0/nodes=3-48                                 3.154 ± 19%             3.423 ±  6%       ~ (p=0.135 n=8)
CockroachDBkv50/nodes=3-48                                3.825 ±  5%             3.624 ± 22%  -5.26% (p=0.027 n=8)
CockroachDBkv95/nodes=3-48                                3.624 ± 15%             3.355 ±  8%       ~ (p=0.212 n=8)
geomean                                                   3.512                   3.476        -1.03%

                           │ results/cockroachdb/pgo_base.pgo.results │ results/cockroachdb/pgobb.pgo.results │
                           │              write-ops/sec               │  write-ops/sec   vs base              │
CockroachDBkv0/nodes=1-48                                18.66k ± 14%      17.93k ± 10%       ~ (p=0.382 n=8)
CockroachDBkv50/nodes=1-48                               10.48k ± 14%      11.25k ± 12%       ~ (p=0.878 n=8)
CockroachDBkv95/nodes=1-48                               1.440k ±  1%      1.446k ±  1%       ~ (p=0.168 n=8)
CockroachDBkv0/nodes=3-48                                17.70k ± 14%      17.24k ± 11%       ~ (p=0.505 n=8)
CockroachDBkv50/nodes=3-48                               10.72k ±  7%      10.54k ±  5%       ~ (p=0.878 n=8)
CockroachDBkv95/nodes=3-48                               1.434k ±  2%      1.455k ±  1%  +1.46% (p=0.010 n=8)
geomean                                                  6.517k            6.525k        +0.12%

                           │ results/cockroachdb/pgo_base.pgo.results │ results/cockroachdb/pgobb.pgo.results │
                           │                write-ops                 │    write-ops     vs base              │
CockroachDBkv0/nodes=1-48                                1.120M ± 14%      1.076M ± 10%       ~ (p=0.382 n=8)
CockroachDBkv50/nodes=1-48                               629.0k ± 14%      675.2k ± 12%       ~ (p=0.959 n=8)
CockroachDBkv95/nodes=1-48                               86.40k ±  1%      86.84k ±  1%       ~ (p=0.161 n=8)
CockroachDBkv0/nodes=3-48                                1.062M ± 14%      1.035M ± 11%       ~ (p=0.505 n=8)
CockroachDBkv50/nodes=3-48                               643.6k ±  7%      632.9k ±  5%       ~ (p=0.878 n=8)
CockroachDBkv95/nodes=3-48                               86.10k ±  2%      87.32k ±  1%  +1.42% (p=0.010 n=8)
geomean                                                  391.2k            391.7k        +0.12%

                           │ results/cockroachdb/pgo_base.pgo.results │   results/cockroachdb/pgobb.pgo.results    │
                           │           read-avg-latency-sec           │ read-avg-latency-sec  vs base              │
CockroachDBkv50/nodes=1-48                               117.4m ± 40%           110.7m ± 17%       ~ (p=1.000 n=8)
CockroachDBkv95/nodes=1-48                               76.30m ± 17%           69.35m ± 13%       ~ (p=0.290 n=8)
CockroachDBkv50/nodes=3-48                               111.1m ±  9%           105.5m ± 28%       ~ (p=0.130 n=8)
CockroachDBkv95/nodes=3-48                               73.50m ± 37%           71.35m ± 19%       ~ (p=0.382 n=8)
geomean                                                  92.46m                 87.19m        -5.70%

                           │ results/cockroachdb/pgo_base.pgo.results │    results/cockroachdb/pgobb.pgo.results    │
                           │          read-p100-latency-sec           │ read-p100-latency-sec  vs base              │
CockroachDBkv50/nodes=1-48                                7.785 ± 21%             7.382 ± 24%       ~ (p=0.588 n=8)
CockroachDBkv95/nodes=1-48                                4.832 ±  6%             4.966 ±  8%       ~ (p=0.847 n=8)
CockroachDBkv50/nodes=3-48                                8.322 ± 10%             8.053 ± 17%       ~ (p=0.127 n=8)
CockroachDBkv95/nodes=3-48                                5.100 ± 26%             4.698 ± 11%       ~ (p=0.126 n=8)
geomean                                                   6.321                   6.102        -3.46%

                           │ results/cockroachdb/pgo_base.pgo.results │   results/cockroachdb/pgobb.pgo.results    │
                           │           read-p50-latency-sec           │ read-p50-latency-sec  vs base              │
CockroachDBkv50/nodes=1-48                              2.800m ± 182%          3.200m ± 153%       ~ (p=0.663 n=8)
CockroachDBkv95/nodes=1-48                              1.000m ±   0%          1.000m ±  10%       ~ (p=0.785 n=8)
CockroachDBkv50/nodes=3-48                              2.600m ±  46%          3.050m ±  57%       ~ (p=1.000 n=8)
CockroachDBkv95/nodes=3-48                              1.000m ±  10%          1.000m ±  10%       ~ (p=0.949 n=8)
geomean                                                 1.643m                 1.768m         +7.60%

                           │ results/cockroachdb/pgo_base.pgo.results │    results/cockroachdb/pgobb.pgo.results    │
                           │           read-p95-latency-sec           │ read-p95-latency-sec  vs base               │
CockroachDBkv50/nodes=1-48                               511.7m ± 52%           444.6m ± 28%        ~ (p=1.000 n=8)
CockroachDBkv95/nodes=1-48                               285.2m ± 26%           243.3m ± 34%        ~ (p=0.584 n=8)
CockroachDBkv50/nodes=3-48                               427.8m ± 26%           402.7m ± 33%        ~ (p=0.368 n=8)
CockroachDBkv95/nodes=3-48                               293.6m ± 46%           264.2m ± 33%        ~ (p=0.878 n=8)
geomean                                                  368.0m                 327.5m        -10.99%

                           │ results/cockroachdb/pgo_base.pgo.results │   results/cockroachdb/pgobb.pgo.results    │
                           │           read-p99-latency-sec           │ read-p99-latency-sec  vs base              │
CockroachDBkv50/nodes=1-48                                2.349 ± 26%            2.080 ± 26%       ~ (p=0.448 n=8)
CockroachDBkv95/nodes=1-48                                1.946 ± 14%            1.879 ±  4%       ~ (p=0.115 n=8)
CockroachDBkv50/nodes=3-48                                2.349 ±  9%            2.282 ± 32%       ~ (p=0.103 n=8)
CockroachDBkv95/nodes=3-48                                1.913 ± 30%            1.845 ± 13%       ~ (p=0.163 n=8)
geomean                                                   2.129                  2.014        -5.38%

                           │ results/cockroachdb/pgo_base.pgo.results │ results/cockroachdb/pgobb.pgo.results │
                           │               read-ops/sec               │  read-ops/sec    vs base              │
CockroachDBkv50/nodes=1-48                               10.36k ± 14%      11.14k ± 12%       ~ (p=0.959 n=8)
CockroachDBkv95/nodes=1-48                               26.83k ±  0%      26.92k ±  1%       ~ (p=0.195 n=8)
CockroachDBkv50/nodes=3-48                               10.58k ±  7%      10.44k ±  4%       ~ (p=0.798 n=8)
CockroachDBkv95/nodes=3-48                               26.77k ±  1%      26.96k ±  1%       ~ (p=0.161 n=8)
geomean                                                  16.75k            17.04k        +1.75%

                           │ results/cockroachdb/pgo_base.pgo.results │ results/cockroachdb/pgobb.pgo.results │
                           │                 read-ops                 │    read-ops      vs base              │
CockroachDBkv50/nodes=1-48                               621.7k ± 14%      668.3k ± 12%       ~ (p=0.959 n=8)
CockroachDBkv95/nodes=1-48                               1.610M ±  0%      1.615M ±  1%       ~ (p=0.195 n=8)
CockroachDBkv50/nodes=3-48                               634.8k ±  7%      626.7k ±  4%       ~ (p=0.878 n=8)
CockroachDBkv95/nodes=3-48                               1.607M ±  1%      1.617M ±  1%       ~ (p=0.161 n=8)
geomean                                                  1.005M            1.023M        +1.75%

@alexanius
Copy link
Contributor Author

@prattmic Hello. I had some issues with launching pgo benchmarks on sweet and also few improvements in pgobb, but now I finished it. It would be great to get your feedback on the results and the patch in general.

@prattmic
Copy link
Member

prattmic commented Dec 3, 2024

@JunyangShao has also been prototyping with PGO basic block layout in https://go.dev/cl/628655, and I think has been seeing similar results to you.

@alexanius Do I understand correctly that https://go.dev/cl/605555 is your primary CL?

It would be good to compare the approaches and see if they can learn from one another.

@alexanius
Copy link
Contributor Author

@JunyangShao has also been prototyping with PGO basic block layout in https://go.dev/cl/628655, and I think has been seeing similar results to you.

We have some differences in approaches. In @JunyangShao approach, as I understand, we load basic block counters on the SSA form just before basic block layout. This is the most precise way to load counters, but in this case the advantages of basic block profile can be used only by layout and schedule pass.

In my approach, we load counters on the AST together with common profile. This approach gives less precise counters on the AST nodes (the precision may be improved with discriminators), but this approach allows us to implement all the optimizations from the #62463. For example, we made some experiments with partial inline based on this pgobb.

@alexanius Do I understand correctly that https://go.dev/cl/605555 is your primary CL?

No, my primary CL is https://go.dev/cl/602015. This CL implements the pgobb itself and make corrections for likely/unlikely information. No additional algorithms of basic block reordering added. The results above are evaluated with this CL. So pgobb gives improvement even for existing layout and scheduler algorithms.

About https://go.dev/cl/605555 - that was my experiment, where I combined my pgobb approach and two different algorithms of basic block layout. One from https://go.dev/cl/572975 - the greedy algorithm. In showed degradations, but I did not investigate the reason. The other CL is https://go.dev/cl/602015 where the ext-tsp basic block sorting was implemented. That was not a correct experiment, as the algorithm needs edge counters, and I did not implement it yet.

It would be good to compare the approaches and see if they can learn from one another.

I think, it is possible. As I said above, we load counters on different compilation passes and have different advantages.

@JunyangShao
Copy link
Contributor

@alexanius I also prototyped a CL to generate the exact edge profile with instrumentation, it could be a edge profile ground truth for our passes. :D

https://go.dev/cl/632515

@alexanius
Copy link
Contributor Author

@JunyangShao I think, that edge instrumentation is a great idea! The profiling can benefit much with it. Lack of it was some painful moment in PGO.

It would be great to get some information about your instrumentation approach:

  • Is there any proposal or design or discussion for basic blocks pgo in Go (besides this issue)?
  • Do we need classical profile-generate + profile-use build like in gcc/llvm pgo, or this instrumentation is always applied?
  • You increment edge counter without atomic. Do you have plans to add it?
  • Does Go team ready to introduce instrumentation approach in the compiler?

@cherrymui
Copy link
Member

The purpose of this instrumentation is to get more information about program's edge frequencies, in order to evaluate and verify various approaches to compute it and do block layout from CPU samples.

Currently there is no plan to add instrumentation-based approach to the compiler or general PGO workflow, although the future possibility is open.

The plan for directly supporting edge profiles is to add support of hardware LBR profiles (see also #36821). This has the advantage of easy-to-use AutoFDO-style workflow, with more accurate edge information.

we load counters on different compilation passes and have different advantages.

Totally agree. Another possibility is to load the counters on demand at different stages. When the SSA backend needs the counters, it loads them directly in SSA. If the frontend needs them on AST, it loads them directly to AST. There might be some duplicate work for e.g. applying fixups for inlining and devirtualization. Or it might be possible to have the counters available in a side table and apply fixups there. Not sure this is more complex than propagating and maintaining the counters throughout the passes.

Thanks.

@JunyangShao
Copy link
Contributor

@alexanius

  • Is there any proposal or design or discussion for basic blocks pgo in Go (besides this issue)?
  • Do we need classical profile-generate + profile-use build like in gcc/llvm pgo, or this instrumentation is always applied?
  • Does Go team ready to introduce instrumentation approach in the compiler?
    We don't have a plan to merge this yet. This CL is just used to generate some edge profile ground truth to test the pprof based PGOBB passes.
  • You increment edge counter without atomic. Do you have plans to add it?
    Yes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

8 participants