Skip to content

slow generic implementation #259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
loveshack opened this issue Sep 28, 2018 · 115 comments
Open

slow generic implementation #259

loveshack opened this issue Sep 28, 2018 · 115 comments

Comments

@loveshack
Copy link
Contributor

I was assuming that BLIS is generally better than reference BLAS, so substituting the latter with BLIS OS packages I'm working on would always be sensible. However, I found BLIS is more than two times slower for medium-sized dgemm on x86_64/RHEL7 for a "generic" build compared with the system reference blas package (which should be built with -O2 -mtune=generic, not -O3). I can't usefully test an architecture without a tuned implementation, but I don't see any reason to think that would be much different, though I haven't looked into the gcc optimization.

Is that expected, or might it be something worth investigating?

@devinamatthews
Copy link
Member

The generic implementation will have better cache behavior than netlib BLAS, but will also do packing which will slow things down for small and medium-sized matrices. It's not totally clear from your comment whether or not this is the configuration that BLIS is using, please correct me if I am mistaken.

@jeffhammond
Copy link
Member

@devinamatthews It may also be that Fortran is better than C :trollface:

@loveshack
Copy link
Contributor Author

loveshack commented Sep 28, 2018 via email

@loveshack
Copy link
Contributor Author

loveshack commented Sep 28, 2018 via email

@devinamatthews
Copy link
Member

OK, I guess I'm not really clear why you care about the performance of the BLIS generic configuration. Even with cache blocking it will never be "high performance".

@cdluminate
Copy link
Contributor

At least it is true that the builds on non-x86_64 architectures are slow due to the slow tests.
https://launchpad.net/~lumin0/+archive/ubuntu/ppa/+sourcepub/9451410/+listing-archive-extra
Click on the builds and there is time elapsed for the whole compiling+testing process.

@fgvanzee
Copy link
Member

fgvanzee commented Sep 29, 2018

@cdluminate I took a look at some of the build times, as you suggest. It is true that the build time is excessive for your s390x build, for example (50 minutes, if I'm reading the output correctly). Much of that can be attributed to the fact that we do not have optimized kernels for every architecture. s390x is one of those unoptimized architectures. Still, this does feel a bit slow.

(Digression: If you would like to reduce the total build time, I recommend running the "fast" version of the BLIS testsuite, which is almost surely where most of the time is being spent. Right now, make test triggers the BLAS test drivers + the full BLIS testsuite. You can instead use make check, which runs the BLAS test drivers + a shortened version of the BLIS testsuite.)

However, strangely, your amd64 build still requires almost 19 minutes. That is still quite long. I just did a quick test on my 3.6GHz Broadwell. Targeting x86_64 at configure-time, I found that:

  • The library build itself takes only 55 seconds.
  • The full BLIS testsuite (build and run) takes about 3 minutes.
  • The BLAS test drivers (build and run) add another 10 seconds.
    Note that no multithreading was used during the execution of any of the BLAS test drivers or BLIS testsuite, though all compilation was done with the -j4 argument to make.

Perhaps your build hardware for the amd64 build is old? Or maybe oversubscribed?

An unrelated question: I assume that the name of your amd64 build refers generically to "the build for x86_64 microarchitectures," as it does in the Gentoo Linux world, and not AMD-specific hardware. Am I correct?

@cdluminate
Copy link
Contributor

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

As for amd64 build, my Intel I5-7440HQ runs the full test quite fast too. It's possible that Ubuntu uses old x86-64 machine in their buildfarm, but I'm not sure "old hardware" is the cause of "20 min" build time.

Debian's term amd64 always equals to x86_64. No matter what brand the physical CPU is.

@fgvanzee
Copy link
Member

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

That's fine. I often prefer the full testsuite in my own development, too, but I thought I would offer the faster alternative since many people in the past have been happy with avoiding many tests that are nearly identical to each other if it saves them 5-10x time.

As for amd64 build, my Intel I5-7440HQ runs the full test quite fast too. It's possible that Ubuntu uses old x86-64 machine in their buildfarm, but I'm not sure "old hardware" is the cause of "20 min" build time.

I'm glad you also see more normal build times. I see no need to worry, then, about the 20 minute build time on the Debian build hardware.

Debian's term amd64 always equals to x86_64. No matter what brand the physical CPU is.

Good, that's what I thought/expected. Thanks.

@cdluminate
Copy link
Contributor

cdluminate commented Sep 30, 2018

I'm glad you also see more normal build times. I see no need to worry, then, about the 20 minute build time on the Debian build hardware.

Just nitpicking: The launchpad, or PPA is Ubuntu's infrastructure, supported by business company Canonical. Debian is supported by independent community that theoretically doesn't rely on Ubuntu or Canonical.

The pages you see are not powered by Debian's build hardware. What I'm doing there is abusing Ubuntu's free build machines to build stuff on Ubuntu cosmic for testing Debian packages. (Ubuntu cosmic, or Ubuntu 18.10 is very close to Debian unstable. So testing Debian packages on Ubuntu machine sometimes makes sense).

@fgvanzee
Copy link
Member

Just nitpicking: ...

Unlike most people, I will almost never be bothered by nitpicking! I like and appreciate nuance. :) Thanks for those details.

@fgvanzee
Copy link
Member

BTW, since I don't use Debian, I have to rely on people like you and @nschloe for your expertise on these topics (understanding how we fit into the Debian/Ubuntu universes). Thanks again.

@jeffhammond
Copy link
Member

jeffhammond commented Sep 30, 2018 via email

@fgvanzee
Copy link
Member

@jeffhammond In principle, I agree with you. However, this is the sort of thing that is not as practical now that our group is so small. (It also doesn't help that maintaining machines in our department comes with a non-trivial amount of cost and of red tape.) Instead, I'm going to channel you circa 2010 and say, "we look forward to your patch." And by that I mean, "someone doing it for us."

@devinamatthews
Copy link
Member

@loveshack Returning to the original question: I think one way to make the "generic" implementation faster would be to add a fully-unrolled branch and temporary storage of C to the kernel, e.g.:

...
if (m == MR && n == NR)
{
    // unroll all MR*NR FMAs into temporaries
}
else
{
    // as usual
}
...
// accumulate at the end instead of along the way

and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual generic configuration will probably still be very slow because it gets very conservative compiler flags.

@loveshack
Copy link
Contributor Author

loveshack commented Oct 1, 2018 via email

@loveshack
Copy link
Contributor Author

loveshack commented Oct 1, 2018 via email

@loveshack
Copy link
Contributor Author

loveshack commented Oct 1, 2018 via email

@loveshack
Copy link
Contributor Author

loveshack commented Oct 3, 2018 via email

@devinamatthews
Copy link
Member

@loveshack What architectures in particular are you having a problem with?

@fgvanzee
Copy link
Member

fgvanzee commented Oct 3, 2018

and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual generic configuration will probably still be very slow because it gets very conservative compiler flags.

@devinamatthews It's not clear from context if you were under the impression that reference kernels were not already compiled with architecture-specific flags, but indeed they are. (Or maybe you are referring to a different variety of flags than I am.) Either way, make V=1 would confirm.

Or did you mention architecture-specific flags because you knew that @loveshack could not use -march=native and the like for packaging purposes?

@devinamatthews
Copy link
Member

@fgvanzee I was mostly talking about the actual generic configuration vs. the reference kernel being used in a particular configuration.

@fgvanzee
Copy link
Member

fgvanzee commented Oct 3, 2018

@devinamatthews Ah, makes sense. Thanks for clarifying. Yeah, generic doesn't do jack except use -O3, which I'm guessing in our world doesn't do much either.

@jeffhammond
Copy link
Member

jeffhammond commented Oct 5, 2018 via email

@loveshack
Copy link
Contributor Author

loveshack commented Oct 5, 2018 via email

@loveshack
Copy link
Contributor Author

loveshack commented Oct 5, 2018 via email

@loveshack
Copy link
Contributor Author

loveshack commented Oct 5, 2018 via email

@devinamatthews
Copy link
Member

devinamatthews commented Oct 5, 2018

i686, ppc64, ppc64le, and s390x

@loveshack For which of those architectures can we assume vectorization with the default flags?

@fgvanzee
Copy link
Member

fgvanzee commented Oct 6, 2018

Yes, it doesn't make much difference experimentally (on x86_64), but you might expect it to help by including vectorization.

I might be willing to add such a flag or flags if you can recommend some that are relatively portable. And ideally, you would tell me the analogues of such flags on clang and icc, if applicable.

@devinamatthews
Copy link
Member

@fgvanzee I would suggest:

  1. Changing the default MR and NR to 4x16, 4x8, 4x8, 4x4 (sdcz).
  2. Rewriting the reference gemm kernel to:
    a. be row-major,
    b. be fully unrolled in the k loop (this means you wouldn't be able to change MR/NR without writing a custom kernel but that seems reasonable),
    c. use temporary variables for C, and
    d. use restrict.
  3. Adding configurations for whatever is missing for packaging (s390x, ppc64, etc.) to get at least baseline vectorization flags for the reference kernels.

Rationale: rewriting the reference kernel this way should allow for a reasonable degree of auto-vectorization given the right flags. The larger kernels size and row-major layout would allow for 128b and 256b vectorization with a higher bandwidth from L1 than L2. I measure up to a 6x increase in performance for AVX2 in a quick mock test.

@fgvanzee
Copy link
Member

I'm not seeing much of a difference when inserting the prefetch builtins, except for smaller problem sizes. Specifically, I tried splitting the k loop into two loops, such that the second loop executes the last 16 iterations. (The prefetches reside between the loops.) Performance seems to plateau around 16.x GFLOPS, so maybe a marginal 2-3% increase at the high end. Performance does ramp up more quickly, though.

fgvanzee added a commit that referenced this issue Jan 24, 2019
Details:
- Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified
  indexing annotated by the #pragma omp simd directive, which a compiler
  can use to vectorize certain constant-bounded loops. (The new kernels
  actually use _Pragma("omp simd") since the kernels are defined via
  templatizing macros.) Modest speedup was observed in most cases using
  gcc 5.4.0, which may improve with newer versions. Thanks to Devin
  Matthews for suggesting this via issue #286 and #259.
- Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to
  be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex,
  respectively, with a default row preference for the gemm ukernel. Also
  updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4,
  respectively, for all datatypes.
- Modified configure to verify that -fopenmp-simd is a valid compiler
  option (via a new detect/omp_simd/omp_simd_detect.c file).
- Added a new header in which prefetch macros are defined according to
  which compiler is detected (via macros such as __GNUC__). These
  prefetch macros are not yet employed anywhere, though.
- Updated the year in copyrights of template license headers in
  build/templates and removed AMD as a default copyright holder.
@fgvanzee
Copy link
Member

I've hopefully addressed this via bdd46f9.

This commit still lacks configurations for the off-beat architectures mentioned earlier in this issue. However, the new kernels, including the #pragma omp simd directives, are used by the generic configuration, which is what these architectures would need to use in the meantime. The commit also contains configure logic that verifies that -fopenmp-simd is a valid compiler flag.

@devinamatthews Please take a look at a sampling of the newly rewritten reference kernels (say, axpyv, axpyf, and gemm) and comment at your convenience.

@fgvanzee
Copy link
Member

Note: bdd46f9 had a couple bugs, which I subsequently fixed in 180f8e4 and 26c5cf4.

Main takeaway: we have to be very careful--particularly with trsm--about mixing optimized kernels with the new reference kernels that use different register blocksizes that are encoded in their constant loop bounds.

@loveshack
Copy link
Contributor Author

loveshack commented Jan 25, 2019 via email

@loveshack
Copy link
Contributor Author

loveshack commented Jan 25, 2019 via email

@loveshack
Copy link
Contributor Author

loveshack commented Jan 25, 2019 via email

@devinamatthews
Copy link
Member

@loveshack re gcc peformance with the #pragma omp simd included: it seems that gcc is vectorizing it just fine (we have to do some additional cajoling to get it to use fma over mul+add but that is just C99), but the problem is that it is keeping the temporary AB product on the stack. For really small kernels (4x4 or so) it will keep it in registers, but anything larger goes on the stack. There are in fact enough registers for up to 6x8, but it seems to be too conservative in allocating them.

clang doesn't seem to actually do a proper vectorization, IIRC it produced a bizarre mishmash of scalar and vector instructions, while also utilizing the stack.

icc produced a "butterfly-style" kernel (@fgvanzee can explain further) which is very interesting in and of itself, but not at all what I was going for.

@fgvanzee
Copy link
Member

I don't understand. The point of the generic configuration from my point of view is support for things without (micro-)architecture-specific kernels.

That's right. My only point was that in the absence of an s390x subconfiguration (for example), which would allow (but not require) the use of optimized kernels for an s390x-type system, any hardware that would want such a subconfiguration would have to use the generic configuration instead (for now). But that's okay since this thread's discussion is centered around speeding up the reference kernels. (Reminder: reference kernels are used (a) in whole by the generic configuration and (b) in part by more well-supported subconfigs such as haswell or skx which lack optimized versions of less important kernels such as subv or invertv.)

@loveshack
Copy link
Contributor Author

loveshack commented Jan 25, 2019 via email

@loveshack
Copy link
Contributor Author

loveshack commented Jan 25, 2019 via email

@loveshack
Copy link
Contributor Author

loveshack commented Jan 25, 2019 via email

@loveshack
Copy link
Contributor Author

This is what I was talking about yesterday, rather long with the included
data...

I've not understood the reported issues with GCC vectorization, but
I'm not convinced by the current generic implementation with the simd
pragmas -- but thanks for re-working it. The pragmas actually hurt
performance, at least testing on an avx2 system.

I'm running Debian stable, with GCC6, on a "Intel(R) Core(TM) i5-6200U CPU @
2.30GHz" (with a full desktop, non-ideally). I'm testing serial square DGEMM
with the OpenBLAS benchmark, using LD_PRELOAD to switch libraries. Obviously
it's not really useful on x86, but I hope the target doesn't make a dramatic
difference to the vectorizer, and it's easiest to try locally before looking
at POWER8, which is the only architecture of interest I can use interactively
(though I can try to get on an aarch64 HPC system).

Using the Debian openblas package (0.2.19, i.e. rather old, pthreaded, so with
OPENBLAS_NUM_THREADS=1) as a reference I get this, with variance of a few
percent between runs:

   SIZE          Flops          Time
    500x500 :    30592.27 MFlops   0.008172 sec
   1000x1000 :    34156.51 MFlops   0.058554 sec
   1500x1500 :    35022.54 MFlops   0.192733 sec
   2000x2000 :    36214.66 MFlops   0.441810 sec
   2500x2500 :    36594.05 MFlops   0.853964 sec
   3000x3000 :    37024.47 MFlops   1.458495 sec
   3500x3500 :    36498.85 MFlops   2.349389 sec
   4000x4000 :    37323.58 MFlops   3.429467 sec
   4500x4500 :    37336.93 MFlops   4.881226 sec
   5000x5000 :    37499.07 MFlops   6.666832 sec

With current BLIS master configured "auto", i.e. haswell in this case:

   SIZE          Flops          Time
    500x500 :    29958.06 MFlops   0.008345 sec
   1000x1000 :    32551.55 MFlops   0.061441 sec
   1500x1500 :    32454.42 MFlops   0.207984 sec
   2000x2000 :    32845.58 MFlops   0.487128 sec
   2500x2500 :    32811.36 MFlops   0.952414 sec
   3000x3000 :    34308.11 MFlops   1.573972 sec
   3500x3500 :    34567.29 MFlops   2.480669 sec
   4000x4000 :    34128.36 MFlops   3.750546 sec
   4500x4500 :    34702.73 MFlops   5.251748 sec
   5000x5000 :    34731.16 MFlops   7.198147 sec

BLIS configured "generic" plus CFLAGS -march=native, in the absence of target
clones:

   SIZE          Flops          Time
    500x500 :    12022.12 MFlops   0.020795 sec
   1000x1000 :    12490.48 MFlops   0.160122 sec
   1500x1500 :    12644.14 MFlops   0.533844 sec
   2000x2000 :    12630.93 MFlops   1.266732 sec
   2500x2500 :    12541.94 MFlops   2.491640 sec
   3000x3000 :    12639.86 MFlops   4.272199 sec
   3500x3500 :    12630.47 MFlops   6.789137 sec
   4000x4000 :    12724.47 MFlops  10.059355 sec
   4500x4500 :    12687.22 MFlops  14.364844 sec
   5000x5000 :    12716.78 MFlops  19.659068 sec

[Omitting -march=native, gives about 7900.]

Now without the SIMD pragma (which requires modifying configure, as
-fno-openmp-simd in CFLAGS gets overridden), but with -march=native
-ffast-math, I get an encouraging ~65% of the tuned version:

   SIZE          Flops          Time
    500x500 :    19809.83 MFlops   0.012620 sec
   1000x1000 :    21244.05 MFlops   0.094144 sec
   1500x1500 :    22450.31 MFlops   0.300664 sec
   2000x2000 :    21960.41 MFlops   0.728584 sec
   2500x2500 :    22413.71 MFlops   1.394236 sec
   3000x3000 :    22560.82 MFlops   2.393530 sec
   3500x3500 :    22559.64 MFlops   3.801036 sec
   4000x4000 :    22522.04 MFlops   5.683323 sec
   4500x4500 :    22396.37 MFlops   8.137478 sec
   5000x5000 :    22538.35 MFlops  11.092205 sec

[prefetch-loop-arrays hurts performance in this case.]

I assumed the forced vectorization isn't all profitable, although
-Wopenmp-simd doesn't complain. However, thinking about it, perhaps the
pragma uses avx but not fma; I haven't tried to check.

For what it's worth, here's the difference in opt-info between using
openmp-simd and just Ofast, i.e. -O3 -ffast-math. (I don't understand the
unrolling note, as --help=optimizers say -funroll-loops is disabled.)

$ diff <(2>&1 gcc -fopt-info -O3 -march=native -fPIC -std=c99 -D_POSIX_C_SOURCE=200112L -Iinclude/generic -I./frame/3/ -I./frame/ind/ukernels/ -I./frame/1m/ -I./frame/1f/ -I./frame/1/ -I./frame/include -DBLIS_VERSION_STRING=\"0.5.1-36\" -fopenmp-simd -DBLIS_CNAME=generic -DBLIS_IS_BUILDING_LIBRARY -c ref_kernels/3/bli_gemm_ref.c -o obj/generic/ref_kernels/generic/3/bli_gemm_generic_ref.o) <(2>&1 gcc -fopt-info -Ofast -march=native -fPIC -std=c99 -D_POSIX_C_SOURCE=200112L -Iinclude/generic -I./frame/3/ -I./frame/ind/ukernels/ -I./frame/1m/ -I./frame/1f/ -I./frame/1/ -I./frame/include -DBLIS_VERSION_STRING=\"0.5.1-36\" -DBLIS_CNAME=generic -DBLIS_IS_BUILDING_LIBRARY -c ref_kernels/3/bli_gemm_ref.c -o obj/generic/ref_kernels/generic/3/bli_gemm_generic_ref.o -Wno-unknown-pragmas)
9,10c9,11
< ref_kernels/3/bli_gemm_ref.c:159:1: note: Loop 13 distributed: split to 0 loops and 1 library calls.
< ref_kernels/3/bli_gemm_ref.c:159:1: note: loop vectorized
---
> ref_kernels/3/bli_gemm_ref.c:159:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:159:1: note: loop with 17 iterations completely unrolled
> ref_kernels/3/bli_gemm_ref.c:159:1: note: Loop 1 distributed: split to 0 loops and 1 library calls.
15,16d15
< ref_kernels/3/bli_gemm_ref.c:159:1: note: loop with 2 iterations completely unrolled
< ref_kernels/3/bli_gemm_ref.c:159:1: note: loop turned into non-loop; it never loops.
18a18
> ref_kernels/3/bli_gemm_ref.c:159:1: note: basic block vectorized
27c27,31
< ref_kernels/3/bli_gemm_ref.c:160:1: note: Loop 13 distributed: split to 0 loops and 1 library calls.
---
> ref_kernels/3/bli_gemm_ref.c:160:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:160:1: note: loop with 9 iterations completely unrolled
> ref_kernels/3/bli_gemm_ref.c:160:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:160:1: note: loop with 5 iterations completely unrolled
> ref_kernels/3/bli_gemm_ref.c:160:1: note: Loop 1 distributed: split to 0 loops and 1 library calls.
40,43c44
< ref_kernels/3/bli_gemm_ref.c:160:1: note: loop turned into non-loop; it never loops.
< ref_kernels/3/bli_gemm_ref.c:160:1: note: loop with 2 iterations completely unrolled
< ref_kernels/3/bli_gemm_ref.c:160:1: note: loop turned into non-loop; it never loops.
< ref_kernels/3/bli_gemm_ref.c:160:1: note: loop with 4 iterations completely unrolled
---
> ref_kernels/3/bli_gemm_ref.c:160:1: note: basic block vectorized
53c54,55
< ref_kernels/3/bli_gemm_ref.c:161:1: note: loop vectorized
---
> ref_kernels/3/bli_gemm_ref.c:161:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:161:1: note: loop with 9 iterations completely unrolled
63,64d64
< ref_kernels/3/bli_gemm_ref.c:161:1: note: loop with 2 iterations completely unrolled
< ref_kernels/3/bli_gemm_ref.c:161:1: note: loop turned into non-loop; it never loops.
66,67d65
< ref_kernels/3/bli_gemm_ref.c:161:1: note: loop turned into non-loop; it never loops.
< ref_kernels/3/bli_gemm_ref.c:161:1: note: loop with 4 iterations completely unrolled
77c75,78
< ref_kernels/3/bli_gemm_ref.c:162:1: note: loop vectorized
---
> ref_kernels/3/bli_gemm_ref.c:162:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:162:1: note: loop with 5 iterations completely unrolled
> ref_kernels/3/bli_gemm_ref.c:162:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:162:1: note: loop with 5 iterations completely unrolled
91,92d91
< ref_kernels/3/bli_gemm_ref.c:162:1: note: loop with 2 iterations completely unrolled
< ref_kernels/3/bli_gemm_ref.c:162:1: note: loop turned into non-loop; it never loops.
94,95c93
< ref_kernels/3/bli_gemm_ref.c:162:1: note: loop turned into non-loop; it never loops.
< ref_kernels/3/bli_gemm_ref.c:162:1: note: loop with 4 iterations completely unrolled
---
> ref_kernels/3/bli_gemm_ref.c:162:1: note: basic block vectorized

I could see what MAQAO makes of the generated code in each case, but I
don't know whether it's worth the effort.

I also tried the native compiler (gcc 4.8) on EL7, which doesn't
support the simd pragma. Bizarrely, -march=native on haswell kills
performance (down to ~1500 from ~5000 without -march).

Using GCC target_clones isn't as straightforward as I hoped; I'm investigating.

@devinamatthews
Copy link
Member

It sounds like:

  1. Recent gcc is much better at automatic vectorization (yay)
  2. #pragma omp simd does force vectorization, but maybe it does not play as nice with other optimizations

I wonder if the vectorization when using omp simd is tuned for BLAS L1-like kernels and gets confused by the much more computationally dense GEMM kernel? Can you send the assembly for the kernel with and without omp simd?

@fgvanzee
Copy link
Member

@loveshack Thanks for sharing these detailed results, and for going to the trouble.

Before commenting on your results, I would be curious to isolate the impact of -ffast-math, which, as of gcc 5.4.0, was shorthand for -fno-math-errno -funsafe-math-optimizations -ffinite-math-only -fno-rounding-math -fno-signaling-nans -fcx-limited-range. The man page description for the option warns that

           This option is not turned on by any -O option besides -Ofast since
           it can result in incorrect output for programs that depend on an
           exact implementation of IEEE or ISO rules/specifications for math
           functions. It may, however, yield faster code for programs that do
           not require the guarantees of these specifications.

Generally speaking, I would consider such options to be off-limits for our purposes. Now, the increase from ~12 to ~22 GFLOPS may have been attributable to something else, e.g. the presence of the pragmas rather than the use of -ffast-math. But it would be good to isolate these so we can evaluate them independently.

@loveshack
Copy link
Contributor Author

loveshack commented Feb 13, 2019 via email

@devinamatthews
Copy link
Member

Even when I used 5 from Ubuntu 16.04 recently, it beat the hand-written intrinsics in the how-to-optimize-blas tutorial when I added "restrict".

If this is what you are observing then we have had very different experiences... At this point, I am happy with the current reference kernels either with or without omp simd. I think the most important things are the vectorization flags, fma (using the C99 fma function or adding flags), and the way the kernel is written, which I think is close to optimal now.

@loveshack
Copy link
Contributor Author

loveshack commented Feb 13, 2019 via email

@fgvanzee
Copy link
Member

I can't remember if I checked in this case, but in general you need -ffast-math for vectorization. (I think someone else also remarked on that, and I pointed out up-thread that icc defaults to it, which is probably a reason for its supposedly much better vectorization.)

As for my general aversion to "fast math" style options, perhaps I am being too conservative. Hopefully others can comment on the potential numerical risk of using -ffast-math/-Ofast.

Also, in my experience, you don't need -ffast-math in order for the compiler to emit vectorized object code; even with older versions of gcc such as 5.4, I've seen AVX (though not FMA) vector code emitted via pragma omp simd. (Maybe you only meant that -ffast-math was needed for better vectorized code?)

@loveshack
Copy link
Contributor Author

loveshack commented Feb 14, 2019 via email

@loveshack
Copy link
Contributor Author

loveshack commented Feb 14, 2019 via email

@devinamatthews
Copy link
Member

@fgvanzee @loveshack re -ffast-math for BLAS I think we are probably doing all of these unsafe optimizations by hand anyways ((a*b)*c=a*(b*c), FMA, complex multiplication, a/b=a*(1/b), etc.). For LAPACK I imagine there are some places where more care would be needed.

@loveshack re "the remaining performance lag relative to the tuned version", this is mostly going to be prefetch, but a little bit of unrolling, instruction reordering, etc. In my experience the one thing the compiler did not do well was vector register allocation: the AB product must be kept in registers for highest performance.

@loveshack
Copy link
Contributor Author

For info, here's additional vectorization that -Ofast gives with gcc-6 compared with -O3 (not implying that it's all important).
It was generated by configuring with -fopt-info-vec -march=haswell and either O3 or Ofast, grepping the output for "vectorized" through sort -u, and diffing the results.
It's just occurred to me, though, that it's actually under-counting the number of extra loops that may have been vectorized, since the line numbers reported are where macros are instantiated typically as multiple loops. At least the level 1 and 3 ref_kernels below don't get any vectorization without fast-math, c.f. optimized zen/haswell versions.

> frame/2/trmv/bli_trmv_unf_var1.c:218:1: note: loop vectorized
> frame/2/trsv/bli_trsv_unf_var1.c:232:1: note: loop vectorized
> frame/compat/bla_dot.c:139:2: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:1152:3: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:1171:3: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:1545:7: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:1588:7: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:392:7: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:435:7: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:796:3: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:815:3: note: loop vectorized
> frame/compat/f2c/bla_sbmv.c:294:3: note: loop vectorized
> frame/compat/f2c/bla_sbmv.c:347:3: note: loop vectorized
> frame/compat/f2c/bla_sbmv.c:645:3: note: loop vectorized
> frame/compat/f2c/bla_sbmv.c:698:3: note: loop vectorized
> frame/compat/f2c/bla_spmv.c:251:3: note: loop vectorized
> frame/compat/f2c/bla_spmv.c:297:3: note: loop vectorized
> frame/compat/f2c/bla_spmv.c:552:3: note: loop vectorized
> frame/compat/f2c/bla_spmv.c:598:3: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:1001:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:1348:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:1369:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:1391:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:1412:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:937:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:958:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:980:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:1342:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:1362:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:1386:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:1406:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:927:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:947:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:971:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:991:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:1157:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:1175:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:1197:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:1216:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:809:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:827:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:849:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:868:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:1152:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:1171:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:1192:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:1211:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:801:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:820:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:841:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:860:7: note: loop vectorized
> frame/util/bli_util_unb_var1.c:265:1: note: loop vectorized
> frame/util/bli_util_unb_var1.c:481:1: note: loop vectorized
> frame/util/bli_util_unb_var1.c:84:1: note: loop vectorized
> ref_kernels/1/bli_dotv_ref.c:118:1: note: loop vectorized
> ref_kernels/1/bli_dotxv_ref.c:127:1: note: loop vectorized
> ref_kernels/1f/bli_dotaxpyv_ref.c:163:1: note: loop vectorized
> ref_kernels/3/bli_trsm_ref.c:247:1: note: loop vectorized
> ref_kernels/3/bli_trsm_ref.c:329:1: note: loop vectorized
> ref_kernels/ind/bli_trsm1m_ref.c:241:1: note: loop vectorized
> ref_kernels/ind/bli_trsm1m_ref.c:447:1: note: loop vectorized
> ref_kernels/ind/bli_trsm3m1_ref.c:159:1: note: loop vectorized
> ref_kernels/ind/bli_trsm3m1_ref.c:283:1: note: loop vectorized
> ref_kernels/ind/bli_trsm4m1_ref.c:168:1: note: loop vectorized
> ref_kernels/ind/bli_trsm4m1_ref.c:284:1: note: loop vectorized

@devinamatthews
Copy link
Member

@loveshack to expand on the prefetching, this is what the hand-tuned kernel does:

  1. prefetch C into L1 (with write hint if possible) ~200 cycles ahead of time, but not so far ahead that loads of A and/or B will flush it out, and spaced such that the prefetcher does not run out of slots. For haswell this is just a block of prefetches at the start, but for skylake and knl this it much more complicated. It is also important to prefetch the address of the last element in each row/column of C.

  2. Prefetch A (for row-major kernels) or B (for column-major kernels) into L1 during the iterations about ~30 cycles ahead of time. The "next" panel of other operand (B or A respectively) can also be prefetched into L2--but usually only about 1/4 of the panel is "warmed-up" this way.

  3. For skylake and knl, we can't keep anything resident in L1, so both A and B have to be prefetched into L1 during the iterations.

devinamatthews added a commit that referenced this issue Feb 6, 2022
The gemm reference kernel now uses the configuration-dependent BLIS_MR_x/BLIS_NR_x macros to control unrolling, rather than fixed values. This fixes #259 and replaces PR #547.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants