compare with jetscii aarch64 simd #178

Dr-Emann · 2025-05-21T06:40:42Z

The new jetscii aarch64 algorithm supports an arbitrary set of bytes (though currently limited to 16 to match the existing limitation of the x86 implementation).

It seems to be pretty competitive with memchr3, being a bit faster for smaller haystacks, or when iterating over more common bytes. I think this is probably largely because iterating uses a 64bit bitset of already identified as matching positions, rather than restarting the search every time, and it's able to process 64 bytes at a time without having to do any fixups in case of matches.

I'd like to improve it by using aligned loads like memchr does (using a possibly unaligned load at the start + end)

benchmark                        rust/jetscii/memchr3  rust/memchr/memchr3  rust/memchr/memchr3/fallback  rust/memchr/memchr3/naive
---------                        --------------------  -------------------  ----------------------------  -------------------------
memchr/sherlock/common/huge3     2.1 GB/s (1.00x)      659.4 MB/s (3.23x)   346.0 MB/s (6.16x)            567.4 MB/s (3.76x)
memchr/sherlock/common/small3    7.5 GB/s (1.00x)      759.3 MB/s (10.05x)  1518.6 MB/s (5.02x)           1688.6 MB/s (4.52x)
memchr/sherlock/never/huge3      21.7 GB/s (1.36x)     29.4 GB/s (1.00x)    8.2 GB/s (3.58x)              1795.7 MB/s (16.78x)
memchr/sherlock/never/small3     14.7 GB/s (1.02x)     15.1 GB/s (1.00x)    7.5 GB/s (2.02x)              1688.6 MB/s (9.15x)
memchr/sherlock/never/tiny3      64.3 GB/s (1.00x)     64.3 GB/s (1.00x)    64.3 GB/s (1.00x)             1566.8 MB/s (42.00x)
memchr/sherlock/never/empty3     1.00ns (1.00x)        1.00ns (1.00x)       1.00ns (1.00x)                1.00ns (1.00x)
memchr/sherlock/rare/huge3       20.6 GB/s (1.08x)     22.2 GB/s (1.00x)    7.5 GB/s (2.95x)              1770.3 MB/s (12.86x)
memchr/sherlock/rare/small3      14.7 GB/s (1.02x)     15.1 GB/s (1.00x)    7.5 GB/s (2.02x)              1522.2 MB/s (10.15x)
memchr/sherlock/rare/tiny3       64.3 GB/s (1.00x)     64.3 GB/s (1.00x)    64.3 GB/s (1.00x)             1566.8 MB/s (42.00x)
memchr/sherlock/uncommon/huge3   5.9 GB/s (1.00x)      1812.7 MB/s (3.34x)  1593.0 MB/s (3.80x)           1291.7 MB/s (4.69x)
memchr/sherlock/uncommon/small3  14.7 GB/s (1.00x)     3.7 GB/s (3.98x)     3.7 GB/s (3.98x)              1688.6 MB/s (8.93x)
memchr/sherlock/uncommon/tiny3   64.3 GB/s (1.00x)     792.8 MB/s (83.00x)  1566.8 MB/s (42.00x)          1566.8 MB/s (42.00x)

compare with jetscii aarch64 simd

cf752ff

Dr-Emann force-pushed the jetscii_aarch_simd branch from 384a92e to cf752ff Compare May 21, 2025 06:42

Dr-Emann mentioned this pull request May 21, 2025

aarch64 simd implementation shepmaster/jetscii#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

compare with jetscii aarch64 simd #178

compare with jetscii aarch64 simd #178

Uh oh!

Dr-Emann commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

compare with jetscii aarch64 simd #178

Are you sure you want to change the base?

compare with jetscii aarch64 simd #178

Uh oh!

Conversation

Dr-Emann commented May 21, 2025

Uh oh!

Uh oh!