small non-fixed-size bytewise copy is transformed to much slower `memcpy` #87440

iximeow · 2024-04-03T01:28:15Z

a bytewise copy of small but non-constant size with non-aliasing src/dest is transformed by is transformed LoopIdiomRecognize into an intrinsic memcpy. because the size is non-constant, neither InstCombine nor SelectionDAG transform the small copy back into an appropriate series of loads and stores, typically the intrinsic ends up as a call to memcpy. for small copies (<8 bytes as a fairly unscientific threshold) the library call is much slower than doing the copy with a short loop or inlined instructions. for size-optimized code, at least for x86 targets, a library call is also just larger.

i noticed this in some Rust (godbolt) but it's pretty apparent with restrict arguments in C as well (clang godbolt).

it seems like handling dynamic-but-small-sized memcpy is just particularly tricky, so maybe there's not much we can do here. i didn't see an existing issue similar to this, at least...

i'm not very familiar with how symbolic information is retained in LLVM. it seems that ideally i could write if (Size.isNotConstantButSmallerThan(16)) and decide to insert something better than a memcpy library call, but i can't tell if the max trip count of the original loop is retained as a hint on the memcpy size later, or if it's totally lost by virtue of being non-constant.

even then, in some target-specific cases there are specific instruction sequences that are more profitable than a memcpy - x86 FSRM (already handled in x86 SelectionDAG) is the example i know. so i'm not sure that it is always profitable to inline a small-but-dynamic-size memcpy?

i also couldn't figure out if there's a non-constant SDValue might still have range information associated to try anything in X86SelectionDAGInfo.cpp. did i miss a detail, or is SelectionDAG too late in the process to have range information? maybe an appropriate thing here would be a flag on memcpy to hint later that we knew a memcpy's max size is "small"? (and in that case, is "dynamic but low-upper-bound" something LLVM could determine in LoopIdiomRecognize when creating the memcpy in the first place?)

i was hoping to put together a patch to propose too, but as-is i have no idea what an appropriate change would be 😅 hopefully someone has a better idea?

The text was updated successfully, but these errors were encountered:

@5225225

this empty commit reproduces a github comment that describes the work on commits from this point back to, roughly, 1.2.2. since many commits between these two points are interesting in the context of performance optimization (especially uarch-relevant tweaks), many WIP commits are preserved. as a result there is no clear squash merge, and this commit will be the next best thing. on Rust 1.68.0 and a Xeon E3-1230 V2, relative changes are measured roughly as: starting at ed4f238: - non-fmt ns/decode: 15ns - non-fmt instructions/decode: 94.6 - non-fmt IPC: 1.71 - fmt ns/decode+display: 91ns - fmt instructions/decode+display: 683.8 - fmt IPC: 2.035 ending at 6a5ea10 - non-fmt ns/decode: 15ns - non-fmt instructions/decode: 94.6 - non-fmt IPC: 1.71 - fmt ns/decode+display: 47ns - fmt instructions/decode+display: 329.6 - fmt IPC: 1.898 for an overall ~50% reduction in runtimes to display instructions. writing into InstructionTextBuffer reduces overhead another ~10%. -- original message follows -- this is where much of iximeow/yaxpeax-arch#7 originated. `std::fmt` as a primary writing mechanism has.. some limitations: * rust-lang/rust#92993 (comment) * llvm/llvm-project#87440 * rust-lang/rust#122770 and some more interesting more fundamental limitations - writing to a `T: fmt::Write` means implementations don't know if it's possible to write bytes in reverse order (useful for printing digits) or if it's OK to write too many bytes and then only advance `len` by the correct amount (useful for copying variable-length-but-short strings like register names). these are both perfectly fine to a `String` or `Vec`, less fine to do to a file descriptor like stdout. at the same time, `Colorize` and traits depending on it are very broken, for reasons described in yaxpeax-arch. so, this adapts `yaxpeax-x86` to use the new `DisplaySink` type for writing, with optimizations where appropriate and output spans for certain kinds of tokens - registers, integers, opcodes, etc. it's not a perfect replacement for Colorize-to-ANSI-supporting-outputs but it's more flexible and i think can be made right. along the way this completes the move of `safer_unchecked` out to yaxpeax-arch (ty @5225225 it's still so useful), cleans up some docs, and comes with a few new test cases. because of the major version bump of yaxpeax-arch, and because this removes most functionality of the Colorize impl - it prints the correct words, just without coloring - this is itself a major version bump to 2.0.0. yay! this in turn is a good point to change the `Opcode` enums from being tuple-like to struct-like, and i've done so in 1b8019d. full notes in CHANGELOG ofc. this is notes for myself when i'm trying to remember any of this in two years :)

github-actions bot added the new issue label Apr 3, 2024

EugeneZelenko added llvm:optimizations missed-optimization and removed new issue labels Apr 3, 2024

iximeow mentioned this issue Jun 24, 2024

significantly improve instruction printing efficiency iximeow/yaxpeax-x86#34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

small non-fixed-size bytewise copy is transformed to much slower `memcpy` #87440

small non-fixed-size bytewise copy is transformed to much slower `memcpy` #87440

iximeow commented Apr 3, 2024

small non-fixed-size bytewise copy is transformed to much slower memcpy #87440

small non-fixed-size bytewise copy is transformed to much slower memcpy #87440

Comments

iximeow commented Apr 3, 2024

small non-fixed-size bytewise copy is transformed to much slower `memcpy` #87440

small non-fixed-size bytewise copy is transformed to much slower `memcpy` #87440