Skip to content

Directly Build UstrMap For Properties #525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

krakow10
Copy link
Contributor

@krakow10 krakow10 commented May 3, 2025

If there's no reason why InstanceBuilder.properties can't be directly assigned to Instance.properties, then let's do that. The same logic is applied to building the properties in rbx_xml. The "Miner's Haven/Deserialize" benchmark says throughput is increased by 30%.

@Dekkonot
Copy link
Member

Dekkonot commented May 3, 2025

For what it's worth, this was a performance decision because a lot of time is spent inserting into that hashmap: #467

On my machine the change in this PR actually degrades performance by quite a lot:
image

Are you able to replicate the boost in performance? What are your system specs like?

@krakow10
Copy link
Contributor Author

krakow10 commented May 4, 2025

Excellent, I assumed it must be that way for performance reasons, but I still don't understand how it's faster to build a vec and then build a hashmap instead of just directly building a hashmap. Is is because of rehashing while allocating more capacity? I'm running the benchmarks on Linux using a 5950X like I mentioned on #529.

I reran the benchmarks first on master and then on properties branch, here is the full output for the properties branch:

     Running benches/suite/main.rs (target/release/deps/suite-4220ebfc15de915e)
Gnuplot not found, using plotters backend
100 Folders/Serialize   time:   [20.673 µs 20.685 µs 20.699 µs]
                        thrpt:  [36.445 MiB/s 36.469 MiB/s 36.490 MiB/s]
                 change:
                        time:   [+0.2859% +0.3785% +0.4745%] (p = 0.00 < 0.05)
                        thrpt:  [-0.4723% -0.3771% -0.2850%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
100 Folders/Deserialize time:   [34.843 µs 34.905 µs 34.985 µs]
                        thrpt:  [20.771 MiB/s 20.819 MiB/s 20.857 MiB/s]
                 change:
                        time:   [-6.5392% -6.4015% -6.2569%] (p = 0.00 < 0.05)
                        thrpt:  [+6.6745% +6.8393% +6.9967%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

100 Deep Folders/Serialize
                        time:   [22.916 µs 22.935 µs 22.952 µs]
                        thrpt:  [32.866 MiB/s 32.891 MiB/s 32.918 MiB/s]
                 change:
                        time:   [-2.3817% -2.0812% -1.8326%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8668% +2.1254% +2.4398%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
100 Deep Folders/Deserialize
                        time:   [40.367 µs 40.395 µs 40.423 µs]
                        thrpt:  [18.001 MiB/s 18.013 MiB/s 18.026 MiB/s]
                 change:
                        time:   [-5.9860% -5.8948% -5.8109%] (p = 0.00 < 0.05)
                        thrpt:  [+6.1694% +6.2640% +6.3671%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

100 100-line ModuleScripts/Serialize
                        time:   [87.191 µs 87.308 µs 87.453 µs]
                        thrpt:  [42.181 MiB/s 42.251 MiB/s 42.307 MiB/s]
                 change:
                        time:   [-0.9348% -0.0077% +1.3494%] (p = 0.98 > 0.05)
                        thrpt:  [-1.3314% +0.0077% +0.9436%]
                        No change in performance detected.
Found 18 outliers among 100 measurements (18.00%)
  2 (2.00%) high mild
  16 (16.00%) high severe
100 100-line ModuleScripts/Deserialize
                        time:   [148.15 µs 148.28 µs 148.43 µs]
                        thrpt:  [24.576 MiB/s 24.601 MiB/s 24.622 MiB/s]
                 change:
                        time:   [-4.6385% -4.1208% -3.6891%] (p = 0.00 < 0.05)
                        thrpt:  [+3.8305% +4.2979% +4.8642%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

1,000 Parts/Serialize   time:   [1.7785 ms 1.7821 ms 1.7863 ms]
                        thrpt:  [2.3432 MiB/s 2.3487 MiB/s 2.3535 MiB/s]
                 change:
                        time:   [+0.8976% +1.1602% +1.4339%] (p = 0.00 < 0.05)
                        thrpt:  [-1.4136% -1.1469% -0.8896%]
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
1,000 Parts/Deserialize time:   [1.9735 ms 1.9768 ms 1.9807 ms]
                        thrpt:  [2.0516 MiB/s 2.0556 MiB/s 2.0591 MiB/s]
                 change:
                        time:   [+6.9658% +7.2007% +7.4589%] (p = 0.00 < 0.05)
                        thrpt:  [-6.9412% -6.7170% -6.5122%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Benchmarking Miner's Haven/Serialize: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 14.3s, or reduce sample count to 30.
Miner's Haven/Serialize time:   [142.46 ms 142.93 ms 143.44 ms]
                        thrpt:  [52.796 MiB/s 52.986 MiB/s 53.159 MiB/s]
                 change:
                        time:   [-9.0355% -8.5930% -8.1495%] (p = 0.00 < 0.05)
                        thrpt:  [+8.8726% +9.4008% +9.9330%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
Benchmarking Miner's Haven/Deserialize: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 18.0s, or reduce sample count to 20.
Miner's Haven/Deserialize
                        time:   [178.69 ms 179.64 ms 180.70 ms]
                        thrpt:  [41.957 MiB/s 42.205 MiB/s 42.428 MiB/s]
                 change:
                        time:   [-26.671% -25.882% -25.149%] (p = 0.00 < 0.05)
                        thrpt:  [+33.600% +34.920% +36.371%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

Seems positive and reproducible.

@kennethloeffler
Copy link
Member

kennethloeffler commented May 5, 2025

This change overall degrades performance on my M1 MacBook Air. Here is the deserialize Miner's Haven benchmark plotted (n = 1000, red is latest master, blue is this branch):
image

I'm not entirely sure why there is such a significant difference, but I did some cursory profiling and noticed that these changes cause ~10% more L1 misses and ~12% more L2 misses when deserializing miners-haven.rbxl on my machine. So, one reason could be that using a scratch Vec for instance builder properties somehow improves cache locality over UstrMap (at least, on my machine and on this workload).

I highly recommend resolving #529 before moving on with any performance work! The benchmarks are only as good as the data they can collect - if the data is polluted with e.g. CPU boosts then it's much, much harder to get a good signal

@krakow10
Copy link
Contributor Author

krakow10 commented May 6, 2025

I looked up how to set the cpu performance governor, and it can be done using cpupower like so:

sudo cpupower frequency-set -g performance

Which supposedly just sets it to always run at the maximum frequency. The default governer was called "powersave".

Running benchmarks repeatedly on the properties branch has been consistent within 5% across 10 runs, but running 5 benchmarks on the master or entries branch both elicit the 70% variance issue, going by feeling I'd say it's fast 50% of the time and slow 50% of the time. The properties branch's consistent benchmark is 10% slower than master for the "fast" variant of the "1,000 Parts/Deserialize" benchmark.

Ustr uses precomputed hashes which I assume are the same for every hashmap. Could it be accidental quadratic behaviour when doing new_map = map.iter().collect();?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants