I used Hyperfine to run a bunch of benchmarks on my Monte Carlo Pi simulation in Rust, and I wanted to share what I learned.
These tests were done on my October 2020 iMac with a 3.6 GHz 10-Core Intel Core i9 processor and 32 GB of RAM.
The column to pay attention to is "relative", where 1.00 represents the best run. I will also have that line bolded.
First, the development binary, 1 million points on a 1 million x 1 million grid:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
target/debug/monte_carlo -c 1000000 -b 1000 -n 1 -g 1000000 |
893.1 ± 8.2 | 887.0 | 915.7 | 3.77 ± 0.09 |
target/debug/monte_carlo -c 1000000 -b 1000 -n 2 -g 1000000 |
449.7 ± 3.5 | 446.2 | 456.1 | 1.90 ± 0.04 |
target/debug/monte_carlo -c 1000000 -b 1000 -n 3 -g 1000000 |
305.4 ± 4.5 | 302.1 | 315.6 | 1.29 ± 0.03 |
target/debug/monte_carlo -c 1000000 -b 1000 -n 4 -g 1000000 |
236.7 ± 5.0 | 230.7 | 245.6 | 1.00 |
Now let's try the release/production binary with the same settings:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 1000000 -b 1000 -n 1 -g 1000000 |
16.7 ± 0.8 | 15.8 | 23.9 | 2.56 ± 0.23 |
target/release/monte_carlo -c 1000000 -b 1000 -n 2 -g 1000000 |
10.2 ± 0.3 | 9.6 | 11.6 | 1.56 ± 0.13 |
target/release/monte_carlo -c 1000000 -b 1000 -n 3 -g 1000000 |
7.8 ± 0.9 | 7.1 | 19.3 | 1.19 ± 0.17 |
target/release/monte_carlo -c 1000000 -b 1000 -n 4 -g 1000000 |
6.5 ± 0.5 | 5.9 | 9.3 | 1.00 |
Whoa! Much faster! That last run took just 6.5 milliseconds!
Going forward, we'll stick with the release binary, and let's try 10 million points:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 10000000 -b 1000 -n 1 -g 1000000 |
139.3 ± 1.3 | 136.9 | 142.2 | 3.32 ± 0.20 |
target/release/monte_carlo -c 10000000 -b 1000 -n 2 -g 1000000 |
78.6 ± 1.8 | 76.9 | 87.9 | 1.87 ± 0.12 |
target/release/monte_carlo -c 10000000 -b 1000 -n 3 -g 1000000 |
54.2 ± 3.3 | 51.8 | 74.4 | 1.29 ± 0.11 |
target/release/monte_carlo -c 10000000 -b 1000 -n 4 -g 1000000 |
42.0 ± 2.6 | 39.3 | 58.2 | 1.00 |
Now let's try upping the batch size (the number of points generated per batch in a thread) from 1000 to 10,000:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 10000000 -b 10000 -n 1 -g 1000000 |
135.4 ± 1.7 | 132.7 | 139.5 | 3.40 ± 0.13 |
target/release/monte_carlo -c 10000000 -b 10000 -n 2 -g 1000000 |
71.5 ± 2.7 | 69.8 | 87.0 | 1.80 ± 0.09 |
target/release/monte_carlo -c 10000000 -b 10000 -n 3 -g 1000000 |
49.9 ± 1.0 | 48.2 | 52.3 | 1.25 ± 0.05 |
target/release/monte_carlo -c 10000000 -b 10000 -n 4 -g 1000000 |
39.8 ± 1.4 | 37.6 | 45.2 | 1.00 |
We got a slight speedup, so let's try a batch size of 100,000 points:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 10000000 -b 100000 -n 1 -g 1000000 |
150.8 ± 4.8 | 145.9 | 168.5 | 3.22 ± 0.23 |
target/release/monte_carlo -c 10000000 -b 100000 -n 2 -g 1000000 |
80.8 ± 2.0 | 77.6 | 84.4 | 1.72 ± 0.12 |
target/release/monte_carlo -c 10000000 -b 100000 -n 3 -g 1000000 |
58.1 ± 2.9 | 54.2 | 72.9 | 1.24 ± 0.10 |
target/release/monte_carlo -c 10000000 -b 100000 -n 4 -g 1000000 |
46.8 ± 3.0 | 43.9 | 64.9 | 1.00 |
Oops--that got worse. Let's go back to 10,000 points per batch, and try Turbo Mode:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 10000000 -b 10000 -n 1 -g 1000000 --turbo |
129.4 ± 1.9 | 127.3 | 135.7 | 3.37 ± 0.21 |
target/release/monte_carlo -c 10000000 -b 10000 -n 2 -g 1000000 --turbo |
68.3 ± 1.7 | 66.9 | 78.4 | 1.78 ± 0.12 |
target/release/monte_carlo -c 10000000 -b 10000 -n 3 -g 1000000 --turbo |
48.0 ± 1.4 | 46.2 | 53.1 | 1.25 ± 0.08 |
target/release/monte_carlo -c 10000000 -b 10000 -n 4 -g 1000000 --turbo |
38.4 ± 2.3 | 35.8 | 53.1 | 1.00 |
That was a small, but noticeable boost.
Now let's try caching:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 10000000 -b 10000 -n 1 -g 1000000 --cache |
225.9 ± 4.0 | 220.1 | 234.4 | 2.24 ± 0.07 |
target/release/monte_carlo -c 10000000 -b 10000 -n 2 -g 1000000 --cache |
146.4 ± 2.7 | 142.7 | 155.0 | 1.45 ± 0.05 |
target/release/monte_carlo -c 10000000 -b 10000 -n 3 -g 1000000 --cache |
116.9 ± 6.5 | 111.7 | 142.4 | 1.16 ± 0.07 |
target/release/monte_carlo -c 10000000 -b 10000 -n 4 -g 1000000 --cache |
101.0 ± 2.7 | 96.9 | 107.0 | 1.00 |
Huh--that took much longer, over twice as long, in fact!
Let's try pre-computing the cache instead:
cache-precompute.md
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 10000000 -b 10000 -n 1 -g 1000000 --cache-precompute |
220.5 ± 7.4 | 210.3 | 240.9 | 2.35 ± 0.14 |
target/release/monte_carlo -c 10000000 -b 10000 -n 2 -g 1000000 --cache-precompute |
139.1 ± 3.5 | 134.2 | 145.7 | 1.48 ± 0.08 |
target/release/monte_carlo -c 10000000 -b 10000 -n 3 -g 1000000 --cache-precompute |
109.6 ± 3.4 | 104.8 | 116.6 | 1.17 ± 0.07 |
target/release/monte_carlo -c 10000000 -b 10000 -n 4 -g 1000000 --cache-precompute |
93.7 ± 4.5 | 89.7 | 114.9 | 1.00 |
Well, it got a little better, but is still roughly 2x the time it took just in turbo mode. It seems that the extra clock cycles spent consulting the cache just isn't worth it.
Now let's try jacking up the grid from 1 million to 100 million points per axis:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 10000000 -b 10000 -n 1 -g 100000000 |
204.5 ± 2.9 | 202.0 | 213.3 | 3.53 ± 0.09 |
target/release/monte_carlo -c 10000000 -b 10000 -n 2 -g 100000000 |
106.8 ± 2.8 | 105.1 | 119.0 | 1.84 ± 0.06 |
target/release/monte_carlo -c 10000000 -b 10000 -n 3 -g 100000000 |
75.1 ± 6.0 | 71.7 | 108.4 | 1.29 ± 0.11 |
target/release/monte_carlo -c 10000000 -b 10000 -n 4 -g 100000000 |
58.0 ± 1.2 | 56.3 | 61.0 | 1.00 |
Not too much longer than the first run with 1 million points in the grid, but not too much more accurate--I'm still only seeing Pi calculated correctly to three decimal places.
Now let's try that run again with the cache:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 10000000 -b 10000 -n 1 -g 100000000 --cache |
1.043 ± 0.009 | 1.032 | 1.062 | 1.07 ± 0.02 |
target/release/monte_carlo -c 10000000 -b 10000 -n 2 -g 100000000 --cache |
0.977 ± 0.015 | 0.963 | 1.010 | 1.00 |
target/release/monte_carlo -c 10000000 -b 10000 -n 3 -g 100000000 --cache |
1.016 ± 0.022 | 0.980 | 1.058 | 1.04 ± 0.03 |
target/release/monte_carlo -c 10000000 -b 10000 -n 4 -g 100000000 --cache |
1.118 ± 0.182 | 1.047 | 1.635 | 1.14 ± 0.19 |
WOW--not only did that take much longer (5x for a single thread), adding more cores made it longer. This is probably because each core has its own cache and there was excessive memory usage for an array with 100 million elements. Also, Rust can have an array with 100 million elements!
Okay, let's try pre-computing the cache:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 10000000 -b 10000 -n 1 -g 100000000 --cache-precompute |
1.089 ± 0.010 | 1.072 | 1.108 | 1.00 |
target/release/monte_carlo -c 10000000 -b 10000 -n 2 -g 100000000 --cache-precompute |
1.112 ± 0.012 | 1.093 | 1.138 | 1.02 ± 0.01 |
target/release/monte_carlo -c 10000000 -b 10000 -n 3 -g 100000000 --cache-precompute |
1.253 ± 0.010 | 1.238 | 1.266 | 1.15 ± 0.01 |
target/release/monte_carlo -c 10000000 -b 10000 -n 4 -g 100000000 --cache-precompute |
1.431 ± 0.018 | 1.408 | 1.472 | 1.31 ± 0.02 |
Oh god. The numbers got even worse. I suppose pre-populating an array with 100 million elements wasn't the best idea.
Time to get stupid. Let's try a grid with 1 billion points:
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 10000000 -b 10000 -n 1 -g 1000000000 |
143.6 ± 2.9 | 140.0 | 150.4 | 3.42 ± 0.18 |
target/release/monte_carlo -c 10000000 -b 10000 -n 2 -g 1000000000 |
76.3 ± 1.6 | 74.1 | 81.6 | 1.82 ± 0.10 |
target/release/monte_carlo -c 10000000 -b 10000 -n 3 -g 1000000000 |
53.3 ± 2.4 | 51.7 | 67.9 | 1.27 ± 0.08 |
target/release/monte_carlo -c 10000000 -b 10000 -n 4 -g 1000000000 |
42.0 ± 2.0 | 40.1 | 56.4 | 1.00 |
That actually worked!
Okay, let's get more stupid. Stupider, if you will. Let's try generating 100 million points:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
target/release/monte_carlo -c 100000000 -b 10000 -n 1 -g 1000000000 |
1.388 ± 0.019 | 1.369 | 1.426 | 3.63 ± 0.06 |
target/release/monte_carlo -c 100000000 -b 10000 -n 2 -g 1000000000 |
0.721 ± 0.007 | 0.713 | 0.738 | 1.89 ± 0.03 |
target/release/monte_carlo -c 100000000 -b 10000 -n 3 -g 1000000000 |
0.496 ± 0.007 | 0.487 | 0.508 | 1.30 ± 0.02 |
target/release/monte_carlo -c 100000000 -b 10000 -n 4 -g 1000000000 |
0.382 ± 0.004 | 0.376 | 0.388 | 1.00 |
Once again, Rust performed like a champ. I'm now getting 4 points of precision with Pi.
Not bad for .3 seconds of work.
Rust code is really fast!