slow generic implementation #259

loveshack · 2018-09-28T14:51:05Z

I was assuming that BLIS is generally better than reference BLAS, so substituting the latter with BLIS OS packages I'm working on would always be sensible. However, I found BLIS is more than two times slower for medium-sized dgemm on x86_64/RHEL7 for a "generic" build compared with the system reference blas package (which should be built with -O2 -mtune=generic, not -O3). I can't usefully test an architecture without a tuned implementation, but I don't see any reason to think that would be much different, though I haven't looked into the gcc optimization.

Is that expected, or might it be something worth investigating?

devinamatthews · 2018-09-28T15:02:35Z

The generic implementation will have better cache behavior than netlib BLAS, but will also do packing which will slow things down for small and medium-sized matrices. It's not totally clear from your comment whether or not this is the configuration that BLIS is using, please correct me if I am mistaken.

jeffhammond · 2018-09-28T15:25:00Z

@devinamatthews It may also be that Fortran is better than C

loveshack · 2018-09-28T17:10:53Z

You wrote:

The `generic` implementation will have better cache behavior than netlib BLAS,

That's what I thought.

but will also do packing which will slow things down for small and medium-sized matrices.

but not about that. At what sort of size would that stop hurting (and I wonder if it could usefully be adaptive)? I tried 2000×2000 to run a few goes in a reasonable time. I've just tried 4000 square, which looks about the same.

It's not totally clear from your comment whether or not this is the configuration that BLIS is using, please correct me if I am mistaken.

I built a BLIS dynamic library with default flags and the generic target. I took the openblas dgemm benchmark (which actually linked against openblas), and ran it with either BLIS or reference BLAS LD_PRELOADed. Is that clearer? I could examine the compilation results and profiles at some stage when I have more time, but thought it was worth asking the experts first -- thanks.

loveshack · 2018-09-28T17:11:32Z

You wrote:

@devinamatthews It may also be that Fortran is better than C

Of course, but a sometime GNU Fortran maintainer knows how :-/.

devinamatthews · 2018-09-28T17:13:38Z

OK, I guess I'm not really clear why you care about the performance of the BLIS generic configuration. Even with cache blocking it will never be "high performance".

cdluminate · 2018-09-29T01:47:22Z

At least it is true that the builds on non-x86_64 architectures are slow due to the slow tests.
https://launchpad.net/~lumin0/+archive/ubuntu/ppa/+sourcepub/9451410/+listing-archive-extra
Click on the builds and there is time elapsed for the whole compiling+testing process.

fgvanzee · 2018-09-29T21:27:12Z

@cdluminate I took a look at some of the build times, as you suggest. It is true that the build time is excessive for your s390x build, for example (50 minutes, if I'm reading the output correctly). Much of that can be attributed to the fact that we do not have optimized kernels for every architecture. s390x is one of those unoptimized architectures. Still, this does feel a bit slow.

(Digression: If you would like to reduce the total build time, I recommend running the "fast" version of the BLIS testsuite, which is almost surely where most of the time is being spent. Right now, make test triggers the BLAS test drivers + the full BLIS testsuite. You can instead use make check, which runs the BLAS test drivers + a shortened version of the BLIS testsuite.)

However, strangely, your amd64 build still requires almost 19 minutes. That is still quite long. I just did a quick test on my 3.6GHz Broadwell. Targeting x86_64 at configure-time, I found that:

The library build itself takes only 55 seconds.
The full BLIS testsuite (build and run) takes about 3 minutes.
The BLAS test drivers (build and run) add another 10 seconds.
Note that no multithreading was used during the execution of any of the BLAS test drivers or BLIS testsuite, though all compilation was done with the -j4 argument to make.

Perhaps your build hardware for the amd64 build is old? Or maybe oversubscribed?

An unrelated question: I assume that the name of your amd64 build refers generically to "the build for x86_64 microarchitectures," as it does in the Gentoo Linux world, and not AMD-specific hardware. Am I correct?

cdluminate · 2018-09-30T01:01:02Z

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

As for amd64 build, my Intel I5-7440HQ runs the full test quite fast too. It's possible that Ubuntu uses old x86-64 machine in their buildfarm, but I'm not sure "old hardware" is the cause of "20 min" build time.

Debian's term amd64 always equals to x86_64. No matter what brand the physical CPU is.

fgvanzee · 2018-09-30T01:07:26Z

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

That's fine. I often prefer the full testsuite in my own development, too, but I thought I would offer the faster alternative since many people in the past have been happy with avoiding many tests that are nearly identical to each other if it saves them 5-10x time.

As for amd64 build, my Intel I5-7440HQ runs the full test quite fast too. It's possible that Ubuntu uses old x86-64 machine in their buildfarm, but I'm not sure "old hardware" is the cause of "20 min" build time.

I'm glad you also see more normal build times. I see no need to worry, then, about the 20 minute build time on the Debian build hardware.

Debian's term amd64 always equals to x86_64. No matter what brand the physical CPU is.

Good, that's what I thought/expected. Thanks.

cdluminate · 2018-09-30T01:25:27Z

I'm glad you also see more normal build times. I see no need to worry, then, about the 20 minute build time on the Debian build hardware.

Just nitpicking: The launchpad, or PPA is Ubuntu's infrastructure, supported by business company Canonical. Debian is supported by independent community that theoretically doesn't rely on Ubuntu or Canonical.

The pages you see are not powered by Debian's build hardware. What I'm doing there is abusing Ubuntu's free build machines to build stuff on Ubuntu cosmic for testing Debian packages. (Ubuntu cosmic, or Ubuntu 18.10 is very close to Debian unstable. So testing Debian packages on Ubuntu machine sometimes makes sense).

fgvanzee · 2018-09-30T01:29:24Z

Just nitpicking: ...

Unlike most people, I will almost never be bothered by nitpicking! I like and appreciate nuance. :) Thanks for those details.

fgvanzee · 2018-09-30T01:31:36Z

BTW, since I don't use Debian, I have to rely on people like you and @nschloe for your expertise on these topics (understanding how we fit into the Debian/Ubuntu universes). Thanks again.

jeffhammond · 2018-09-30T01:54:45Z

Field: Next time a vendor offers to donate hardware, you might ask for a big SSD so you can setup a virtual machine for every Linux distro. Just a thought.

fgvanzee · 2018-09-30T19:27:31Z

@jeffhammond In principle, I agree with you. However, this is the sort of thing that is not as practical now that our group is so small. (It also doesn't help that maintaining machines in our department comes with a non-trivial amount of cost and of red tape.) Instead, I'm going to channel you circa 2010 and say, "we look forward to your patch." And by that I mean, "someone doing it for us."

devinamatthews · 2018-10-01T14:48:25Z

@loveshack Returning to the original question: I think one way to make the "generic" implementation faster would be to add a fully-unrolled branch and temporary storage of C to the kernel, e.g.:

...
if (m == MR && n == NR)
{
    // unroll all MR*NR FMAs into temporaries
}
else
{
    // as usual
}
...
// accumulate at the end instead of along the way

and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual generic configuration will probably still be very slow because it gets very conservative compiler flags.

loveshack · 2018-10-01T15:08:31Z

You wrote:

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

For what it's worth, that's not what's normally done for Fedora. On the slower build platforms it would likely time out, and can perturb mass rebuilds considerably. I consider the "check" step in rpm builds basically as a sanity check, especially as in cases like this you can't test a relevant range of micro-architectures. [The make check target was added for that, but I also test the Fortran interface with gfortran, rather than relying on the f2c'd versions.] For Fedora, I don't care about build times unless they're pathological, especially as they're very variable on the build VMs.

Debian's term `amd64` always equals to `x86_64`. No matter what brand the physical CPU is.

[And for confusion, Fedora just uses x86_64 (which is probably less correct).]

loveshack · 2018-10-01T15:10:22Z

You wrote:

OK, I guess I'm not really clear why you care about the performance of the BLIS `generic` configuration. Even with cache blocking it will never be "high performance".

This is for OS packaging purposes. I assumed I could say that using BLIS would be strictly better than reference BLAS, i.e. the reference blas package is redundant for any platforms not supported by the blis or openblas packages (apart from compatibility tests).

loveshack · 2018-10-01T15:11:41Z

You wrote:

Field: Next time a vendor offers to donate hardware, you might ask for a big SSD so you can setup a virtual machine for every Linux distro. Just a thought.

For what it's worth, I frequently spin up VMs with vagrant, which is mostly practical at least up to a cluster of three or so, on an 8GB/HDD laptop. However, it's reasonable to leave specific distribution work to packagers, as long as the basic build system doesn't put obstacles in the way, and I think we've already got the relevant hooks like xFLAGS. thanks. Also for what it's worth, I've tested rpm packaging for SuSE in the configurations supported by Fedora's copr as well as for the range of supported RHEL/Fedora targets, and my amd64 Debian desktop.

loveshack · 2018-10-03T13:25:45Z

You wrote:

@loveshack Returning to the original question: I think one way to make the "generic" implementation faster would be to add a fully-unrolled branch and temporary storage of C to the kernel, e.g.: ```C ... if (m == MR && n == NR) { // unroll all MR*NR FMAs into temporaries } else { // as usual } ... // accumulate at the end instead of along the way ``` **and** arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual `generic` configuration will probably still be very slow because it gets very conservative compiler flags.

I haven't had a chance to investigate further, but I did find that building generic with -march=native -Ofast -funroll-loops doesn't make a dramatic difference, not that -march=native can be used for packaging anyhow. (Part of the reason I expected BLIS to do better is that the -O3 it uses enables vectorization -- though only sse2 with -march=generic -- c.f. -O2 used for the reference blas package.) Then, I've never understood why compilers do so badly on, say, matmul.

devinamatthews · 2018-10-03T14:36:59Z

@loveshack What architectures in particular are you having a problem with?

fgvanzee · 2018-10-03T17:57:12Z

and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual generic configuration will probably still be very slow because it gets very conservative compiler flags.

@devinamatthews It's not clear from context if you were under the impression that reference kernels were not already compiled with architecture-specific flags, but indeed they are. (Or maybe you are referring to a different variety of flags than I am.) Either way, make V=1 would confirm.

Or did you mention architecture-specific flags because you knew that @loveshack could not use -march=native and the like for packaging purposes?

devinamatthews · 2018-10-03T18:06:10Z

@fgvanzee I was mostly talking about the actual generic configuration vs. the reference kernel being used in a particular configuration.

fgvanzee · 2018-10-03T18:09:19Z

@devinamatthews Ah, makes sense. Thanks for clarifying. Yeah, generic doesn't do jack except use -O3, which I'm guessing in our world doesn't do much either.

jeffhammond · 2018-10-05T02:37:03Z

It might be interesting to see if simd pragmas cause anything better to happen with the reference kernel. I’ve got a list of all of those, in addition to the obvious OpenMP one.

loveshack · 2018-10-05T14:19:57Z

You wrote:

@loveshack What architectures in particular are you having a problem with?

The Fedora architectures that BLIS doesn't support I think are i686, ppc64, ppc64le, and s390x; there will be more in Debian. (OpenBLAS does those apart from ppc64, so we can at least use a free BLAS on most Fedora architectures.)

loveshack · 2018-10-05T14:20:31Z

You wrote:

@devinamatthews Ah, makes sense. Thanks for clarifying. Yeah, `generic` doesn't do jack except use `-O3`, which I'm guessing in our world doesn't do much either.

Yes, it doesn't make much difference experimentally (on x86_64), but you might expect it to help by including vectorization.

loveshack · 2018-10-05T14:24:33Z

You wrote:

It might be interesting to see if simd pragmas cause anything better to happen with the reference kernel. I’ve got a list of all of those, in addition to the obvious OpenMP one.

Yes, but I guess the first thing to do is to consult a detailed profile and gcc's optimization report. I'll have a look at it eventually, but I don't know whether results from x86_64 would be representative of other architectures I can't currently access. (I'll try to get on aarch64 and power8 at some stage.)

devinamatthews · 2018-10-05T17:43:53Z

i686, ppc64, ppc64le, and s390x

@loveshack For which of those architectures can we assume vectorization with the default flags?

fgvanzee · 2018-10-06T18:50:08Z

Yes, it doesn't make much difference experimentally (on x86_64), but you might expect it to help by including vectorization.

I might be willing to add such a flag or flags if you can recommend some that are relatively portable. And ideally, you would tell me the analogues of such flags on clang and icc, if applicable.

devinamatthews · 2018-10-06T22:53:42Z

@fgvanzee I would suggest:

Changing the default MR and NR to 4x16, 4x8, 4x8, 4x4 (sdcz).
Rewriting the reference gemm kernel to:
a. be row-major,
b. be fully unrolled in the k loop (this means you wouldn't be able to change MR/NR without writing a custom kernel but that seems reasonable),
c. use temporary variables for C, and
d. use restrict.
Adding configurations for whatever is missing for packaging (s390x, ppc64, etc.) to get at least baseline vectorization flags for the reference kernels.

Rationale: rewriting the reference kernel this way should allow for a reasonable degree of auto-vectorization given the right flags. The larger kernels size and row-major layout would allow for 128b and 256b vectorization with a higher bandwidth from L1 than L2. I measure up to a 6x increase in performance for AVX2 in a quick mock test.

fgvanzee · 2019-01-17T19:38:40Z

I'm not seeing much of a difference when inserting the prefetch builtins, except for smaller problem sizes. Specifically, I tried splitting the k loop into two loops, such that the second loop executes the last 16 iterations. (The prefetches reside between the loops.) Performance seems to plateau around 16.x GFLOPS, so maybe a marginal 2-3% increase at the high end. Performance does ramp up more quickly, though.

Details: - Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified indexing annotated by the #pragma omp simd directive, which a compiler can use to vectorize certain constant-bounded loops. (The new kernels actually use _Pragma("omp simd") since the kernels are defined via templatizing macros.) Modest speedup was observed in most cases using gcc 5.4.0, which may improve with newer versions. Thanks to Devin Matthews for suggesting this via issue #286 and #259. - Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex, respectively, with a default row preference for the gemm ukernel. Also updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4, respectively, for all datatypes. - Modified configure to verify that -fopenmp-simd is a valid compiler option (via a new detect/omp_simd/omp_simd_detect.c file). - Added a new header in which prefetch macros are defined according to which compiler is detected (via macros such as __GNUC__). These prefetch macros are not yet employed anywhere, though. - Updated the year in copyrights of template license headers in build/templates and removed AMD as a default copyright holder.

fgvanzee · 2019-01-24T23:30:02Z

I've hopefully addressed this via bdd46f9.

This commit still lacks configurations for the off-beat architectures mentioned earlier in this issue. However, the new kernels, including the #pragma omp simd directives, are used by the generic configuration, which is what these architectures would need to use in the meantime. The commit also contains configure logic that verifies that -fopenmp-simd is a valid compiler flag.

@devinamatthews Please take a look at a sampling of the newly rewritten reference kernels (say, axpyv, axpyf, and gemm) and comment at your convenience.

fgvanzee · 2019-01-25T01:38:07Z

Note: bdd46f9 had a couple bugs, which I subsequently fixed in 180f8e4 and 26c5cf4.

Main takeaway: we have to be very careful--particularly with trsm--about mixing optimized kernels with the new reference kernels that use different register blocksizes that are encoded in their constant loop bounds.

loveshack · 2019-01-25T15:26:37Z

The trick will be (3): creating sub-configurations with the appropriate optimization/vectorization flags for these off-beat architectures. We'll need people like @loveshack to chime in with that information as we iterate towards something that works as desired.

[Rather late, sorry.] I don't know what sort of thing you need, but I don't think gcc flags will be very architecture-specific. Obviously you need a list of appropriate micro-architectures to clone, but that's in the gcc doc (obviously more in higher gcc versions). I'm happy to provide what info I can, but I can only potentially help with (versions of) GCC, although I guess I can run whatever clang is in Debian and EPEL7. I still don't understand the compiler issues and would like to. When I tried, -fopt-info told me gcc was vectorizing what it apparently wasn't for Devin absent omp simd, but I'm not up to interpreting the assembly. One thing that makes a difference is -ffast-math -- icc defaults to the equivalent option in typical fast-but-incorrect-by-default style -- and I don't know if that was used. Also, is there an advantage to unrolling by hand rather than the compiler doing it?

loveshack · 2019-01-25T15:31:22Z

You wrote:

> __builtin_prefetch(addr, 1) is what you want. > I guess this assumes GCC and maybe Clang/ICC. I said preprocessor because I don't know if Cray, PGI or IBM (non-Clang front-end) supports this, but those may not matter in this context.

For what it's worth, icc is documented as defaulting to the prefetch option.

loveshack · 2019-01-25T15:41:58Z

You wrote:

This commit still lacks configurations for the off-beat architectures mentioned earlier in this issue.

The only architectures documented as having target attributes in GCC 6 are arm, x86, and power(pc) -- apart from whatever "NIOS II" is -- if that's what you mean. GCC 8 also has aarch64 and s390. Note that these don't seem to work properly in pragmas, at least for x86_64 (about which I should raise an issue), but I don't have cross compilers for the other targets to test.

However, the new kernels, including the `#pragma omp simd` directives, are used by the `generic` configuration, which is what these architectures would need to use in the meantime.

I don't understand. The point of the generic configuration from my point of view is support for things without (micro-)architecture-specific kernels. I may be able to look at this further some time next week, if it's clear what's needed.

devinamatthews · 2019-01-25T15:52:16Z

@loveshack re gcc peformance with the #pragma omp simd included: it seems that gcc is vectorizing it just fine (we have to do some additional cajoling to get it to use fma over mul+add but that is just C99), but the problem is that it is keeping the temporary AB product on the stack. For really small kernels (4x4 or so) it will keep it in registers, but anything larger goes on the stack. There are in fact enough registers for up to 6x8, but it seems to be too conservative in allocating them.

clang doesn't seem to actually do a proper vectorization, IIRC it produced a bizarre mishmash of scalar and vector instructions, while also utilizing the stack.

icc produced a "butterfly-style" kernel (@fgvanzee can explain further) which is very interesting in and of itself, but not at all what I was going for.

fgvanzee · 2019-01-25T17:07:07Z

I don't understand. The point of the generic configuration from my point of view is support for things without (micro-)architecture-specific kernels.

That's right. My only point was that in the absence of an s390x subconfiguration (for example), which would allow (but not require) the use of optimized kernels for an s390x-type system, any hardware that would want such a subconfiguration would have to use the generic configuration instead (for now). But that's okay since this thread's discussion is centered around speeding up the reference kernels. (Reminder: reference kernels are used (a) in whole by the generic configuration and (b) in part by more well-supported subconfigs such as haswell or skx which lack optimized versions of less important kernels such as subv or invertv.)

loveshack · 2019-01-25T18:24:43Z

You wrote:

@loveshack re gcc peformance with the `#pragma omp simd` included: it seems that gcc is vectorizing it just fine (we have to do some additional cajoling to get it to use fma over mul+add but that is just C99), but the problem is that it is keeping the temporary AB product on the stack. For really small kernels (4x4 or so) it will keep it in registers, but anything larger goes on the stack. There are in fact enough registers for up to 6x8, but it seems to be too conservative in allocating them.

What I meant is that I saw the same optimization report with the pragma and without on one of your examples (once I fixed it to compile). If you can provide a reproducible example (with gcc version and all flags) I'm happy to try to take it up with gcc developers, though that won't help in the short term unless someone suggests different flags. Anyway, if the generic version is a few times faster now, that's good.

loveshack · 2019-01-25T18:41:53Z

That's right. My only point was that in the absence of an `s390x` subconfiguration (for example), which would allow (but not require) the use of optimized kernels for an `s390x`-type system, any hardware that would *want* such a subconfiguration would have to use the generic configuration instead (for now).

Assuming the one kernel is OK for different cache sizes etc., I'd just expect the generic config to specialize for appropriate cases with GCC. Of the top of my head, and in haste, I'd expect something like ``` #if __GNUC__ // I'm not sure about unlimited, and maybe you want explicit options // for vectorization, not O3 #pragma GCC optimize ("fast-math,O3,vect-cost-model=unlimited") // I don't know if the arch macros are correct off-hand #if __ppc64le__ #define clones __attribute__ ((target_clones ("default,cpu=power8,cpu=power9"))) #elif __s390__ ... #endif #else #define clones #endif ``` and then add "clones" as a function attribute where it's relevant.

loveshack · 2019-01-25T18:45:23Z

Actually, that example needs to check the gcc version for support of the specific architectures. I think such attributes are also supported by appropriate versions of clang.

loveshack · 2019-02-13T12:20:32Z

This is what I was talking about yesterday, rather long with the included
data...

I've not understood the reported issues with GCC vectorization, but
I'm not convinced by the current generic implementation with the simd
pragmas -- but thanks for re-working it. The pragmas actually hurt
performance, at least testing on an avx2 system.

I'm running Debian stable, with GCC6, on a "Intel(R) Core(TM) i5-6200U CPU @
2.30GHz" (with a full desktop, non-ideally). I'm testing serial square DGEMM
with the OpenBLAS benchmark, using LD_PRELOAD to switch libraries. Obviously
it's not really useful on x86, but I hope the target doesn't make a dramatic
difference to the vectorizer, and it's easiest to try locally before looking
at POWER8, which is the only architecture of interest I can use interactively
(though I can try to get on an aarch64 HPC system).

Using the Debian openblas package (0.2.19, i.e. rather old, pthreaded, so with
OPENBLAS_NUM_THREADS=1) as a reference I get this, with variance of a few
percent between runs:

   SIZE          Flops          Time
    500x500 :    30592.27 MFlops   0.008172 sec
   1000x1000 :    34156.51 MFlops   0.058554 sec
   1500x1500 :    35022.54 MFlops   0.192733 sec
   2000x2000 :    36214.66 MFlops   0.441810 sec
   2500x2500 :    36594.05 MFlops   0.853964 sec
   3000x3000 :    37024.47 MFlops   1.458495 sec
   3500x3500 :    36498.85 MFlops   2.349389 sec
   4000x4000 :    37323.58 MFlops   3.429467 sec
   4500x4500 :    37336.93 MFlops   4.881226 sec
   5000x5000 :    37499.07 MFlops   6.666832 sec

With current BLIS master configured "auto", i.e. haswell in this case:

   SIZE          Flops          Time
    500x500 :    29958.06 MFlops   0.008345 sec
   1000x1000 :    32551.55 MFlops   0.061441 sec
   1500x1500 :    32454.42 MFlops   0.207984 sec
   2000x2000 :    32845.58 MFlops   0.487128 sec
   2500x2500 :    32811.36 MFlops   0.952414 sec
   3000x3000 :    34308.11 MFlops   1.573972 sec
   3500x3500 :    34567.29 MFlops   2.480669 sec
   4000x4000 :    34128.36 MFlops   3.750546 sec
   4500x4500 :    34702.73 MFlops   5.251748 sec
   5000x5000 :    34731.16 MFlops   7.198147 sec

BLIS configured "generic" plus CFLAGS -march=native, in the absence of target
clones:

   SIZE          Flops          Time
    500x500 :    12022.12 MFlops   0.020795 sec
   1000x1000 :    12490.48 MFlops   0.160122 sec
   1500x1500 :    12644.14 MFlops   0.533844 sec
   2000x2000 :    12630.93 MFlops   1.266732 sec
   2500x2500 :    12541.94 MFlops   2.491640 sec
   3000x3000 :    12639.86 MFlops   4.272199 sec
   3500x3500 :    12630.47 MFlops   6.789137 sec
   4000x4000 :    12724.47 MFlops  10.059355 sec
   4500x4500 :    12687.22 MFlops  14.364844 sec
   5000x5000 :    12716.78 MFlops  19.659068 sec

[Omitting -march=native, gives about 7900.]

Now without the SIMD pragma (which requires modifying configure, as
-fno-openmp-simd in CFLAGS gets overridden), but with -march=native
-ffast-math, I get an encouraging ~65% of the tuned version:

   SIZE          Flops          Time
    500x500 :    19809.83 MFlops   0.012620 sec
   1000x1000 :    21244.05 MFlops   0.094144 sec
   1500x1500 :    22450.31 MFlops   0.300664 sec
   2000x2000 :    21960.41 MFlops   0.728584 sec
   2500x2500 :    22413.71 MFlops   1.394236 sec
   3000x3000 :    22560.82 MFlops   2.393530 sec
   3500x3500 :    22559.64 MFlops   3.801036 sec
   4000x4000 :    22522.04 MFlops   5.683323 sec
   4500x4500 :    22396.37 MFlops   8.137478 sec
   5000x5000 :    22538.35 MFlops  11.092205 sec

[prefetch-loop-arrays hurts performance in this case.]

I assumed the forced vectorization isn't all profitable, although
-Wopenmp-simd doesn't complain. However, thinking about it, perhaps the
pragma uses avx but not fma; I haven't tried to check.

For what it's worth, here's the difference in opt-info between using
openmp-simd and just Ofast, i.e. -O3 -ffast-math. (I don't understand the
unrolling note, as --help=optimizers say -funroll-loops is disabled.)

$ diff <(2>&1 gcc -fopt-info -O3 -march=native -fPIC -std=c99 -D_POSIX_C_SOURCE=200112L -Iinclude/generic -I./frame/3/ -I./frame/ind/ukernels/ -I./frame/1m/ -I./frame/1f/ -I./frame/1/ -I./frame/include -DBLIS_VERSION_STRING=\"0.5.1-36\" -fopenmp-simd -DBLIS_CNAME=generic -DBLIS_IS_BUILDING_LIBRARY -c ref_kernels/3/bli_gemm_ref.c -o obj/generic/ref_kernels/generic/3/bli_gemm_generic_ref.o) <(2>&1 gcc -fopt-info -Ofast -march=native -fPIC -std=c99 -D_POSIX_C_SOURCE=200112L -Iinclude/generic -I./frame/3/ -I./frame/ind/ukernels/ -I./frame/1m/ -I./frame/1f/ -I./frame/1/ -I./frame/include -DBLIS_VERSION_STRING=\"0.5.1-36\" -DBLIS_CNAME=generic -DBLIS_IS_BUILDING_LIBRARY -c ref_kernels/3/bli_gemm_ref.c -o obj/generic/ref_kernels/generic/3/bli_gemm_generic_ref.o -Wno-unknown-pragmas)
9,10c9,11
< ref_kernels/3/bli_gemm_ref.c:159:1: note: Loop 13 distributed: split to 0 loops and 1 library calls.
< ref_kernels/3/bli_gemm_ref.c:159:1: note: loop vectorized
---
> ref_kernels/3/bli_gemm_ref.c:159:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:159:1: note: loop with 17 iterations completely unrolled
> ref_kernels/3/bli_gemm_ref.c:159:1: note: Loop 1 distributed: split to 0 loops and 1 library calls.
15,16d15
< ref_kernels/3/bli_gemm_ref.c:159:1: note: loop with 2 iterations completely unrolled
< ref_kernels/3/bli_gemm_ref.c:159:1: note: loop turned into non-loop; it never loops.
18a18
> ref_kernels/3/bli_gemm_ref.c:159:1: note: basic block vectorized
27c27,31
< ref_kernels/3/bli_gemm_ref.c:160:1: note: Loop 13 distributed: split to 0 loops and 1 library calls.
---
> ref_kernels/3/bli_gemm_ref.c:160:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:160:1: note: loop with 9 iterations completely unrolled
> ref_kernels/3/bli_gemm_ref.c:160:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:160:1: note: loop with 5 iterations completely unrolled
> ref_kernels/3/bli_gemm_ref.c:160:1: note: Loop 1 distributed: split to 0 loops and 1 library calls.
40,43c44
< ref_kernels/3/bli_gemm_ref.c:160:1: note: loop turned into non-loop; it never loops.
< ref_kernels/3/bli_gemm_ref.c:160:1: note: loop with 2 iterations completely unrolled
< ref_kernels/3/bli_gemm_ref.c:160:1: note: loop turned into non-loop; it never loops.
< ref_kernels/3/bli_gemm_ref.c:160:1: note: loop with 4 iterations completely unrolled
---
> ref_kernels/3/bli_gemm_ref.c:160:1: note: basic block vectorized
53c54,55
< ref_kernels/3/bli_gemm_ref.c:161:1: note: loop vectorized
---
> ref_kernels/3/bli_gemm_ref.c:161:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:161:1: note: loop with 9 iterations completely unrolled
63,64d64
< ref_kernels/3/bli_gemm_ref.c:161:1: note: loop with 2 iterations completely unrolled
< ref_kernels/3/bli_gemm_ref.c:161:1: note: loop turned into non-loop; it never loops.
66,67d65
< ref_kernels/3/bli_gemm_ref.c:161:1: note: loop turned into non-loop; it never loops.
< ref_kernels/3/bli_gemm_ref.c:161:1: note: loop with 4 iterations completely unrolled
77c75,78
< ref_kernels/3/bli_gemm_ref.c:162:1: note: loop vectorized
---
> ref_kernels/3/bli_gemm_ref.c:162:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:162:1: note: loop with 5 iterations completely unrolled
> ref_kernels/3/bli_gemm_ref.c:162:1: note: loop turned into non-loop; it never loops.
> ref_kernels/3/bli_gemm_ref.c:162:1: note: loop with 5 iterations completely unrolled
91,92d91
< ref_kernels/3/bli_gemm_ref.c:162:1: note: loop with 2 iterations completely unrolled
< ref_kernels/3/bli_gemm_ref.c:162:1: note: loop turned into non-loop; it never loops.
94,95c93
< ref_kernels/3/bli_gemm_ref.c:162:1: note: loop turned into non-loop; it never loops.
< ref_kernels/3/bli_gemm_ref.c:162:1: note: loop with 4 iterations completely unrolled
---
> ref_kernels/3/bli_gemm_ref.c:162:1: note: basic block vectorized

I could see what MAQAO makes of the generated code in each case, but I
don't know whether it's worth the effort.

I also tried the native compiler (gcc 4.8) on EL7, which doesn't
support the simd pragma. Bizarrely, -march=native on haswell kills
performance (down to ~1500 from ~5000 without -march).

Using GCC target_clones isn't as straightforward as I hoped; I'm investigating.

devinamatthews · 2019-02-13T16:03:25Z

It sounds like:

Recent gcc is much better at automatic vectorization (yay)
#pragma omp simd does force vectorization, but maybe it does not play as nice with other optimizations

I wonder if the vectorization when using omp simd is tuned for BLAS L1-like kernels and gets confused by the much more computationally dense GEMM kernel? Can you send the assembly for the kernel with and without omp simd?

fgvanzee · 2019-02-13T17:58:52Z

@loveshack Thanks for sharing these detailed results, and for going to the trouble.

Before commenting on your results, I would be curious to isolate the impact of -ffast-math, which, as of gcc 5.4.0, was shorthand for -fno-math-errno -funsafe-math-optimizations -ffinite-math-only -fno-rounding-math -fno-signaling-nans -fcx-limited-range. The man page description for the option warns that

           This option is not turned on by any -O option besides -Ofast since
           it can result in incorrect output for programs that depend on an
           exact implementation of IEEE or ISO rules/specifications for math
           functions. It may, however, yield faster code for programs that do
           not require the guarantees of these specifications.

Generally speaking, I would consider such options to be off-limits for our purposes. Now, the increase from ~12 to ~22 GFLOPS may have been attributable to something else, e.g. the presence of the pragmas rather than the use of -ffast-math. But it would be good to isolate these so we can evaluate them independently.

loveshack · 2019-02-13T21:28:48Z

It sounds like: 1) Recent gcc is much better at automatic vectorization (yay)

6 isn't terribly recent, but better than what? (Even when I used 5 from Ubuntu 16.04 recently, it beat the hand-written intrinsics in the how-to-optimize-blas tutorial when I added "restrict".) I wanted to know exactly how vectorization failures were happening, especially as they may be for versions where -fopenmp-simd isn't available anyhow. [Unfortunately for EL7 package-building at least, the add-on recent compilers with -fopenmp-simd aren't available for relevant architectures.]

2) `#pragma omp simd` does force vectorization, but maybe it does not play as nice with other optimizations

It seems to be like I guessed -- the pragma uses avx, but not fma as far as I can tell grepping for "vfm" instructions in each case.

I wonder if the vectorization when using omp simd is tuned for BLAS L1-like kernels and gets confused by the much more computationally dense GEMM kernel? Can you send the assembly for the kernel with and without omp simd?

From which version of gcc and for which (x86_64) target? GCC 8 gives different -- better, one hopes -- optimization reports from 6 in cases I've tried, but isn't widely available.

devinamatthews · 2019-02-13T21:35:19Z

Even when I used 5 from Ubuntu 16.04 recently, it beat the hand-written intrinsics in the how-to-optimize-blas tutorial when I added "restrict".

If this is what you are observing then we have had very different experiences... At this point, I am happy with the current reference kernels either with or without omp simd. I think the most important things are the vectorization flags, fma (using the C99 fma function or adding flags), and the way the kernel is written, which I think is close to optimal now.

loveshack · 2019-02-13T21:53:32Z

Generally speaking, I would consider such options to be off-limits for our purposes. Now, the increase from ~12 to ~22 GFLOPS may have been attributable to something else, e.g. the presence of the pragmas rather than the use of `-ffast-math`. But it would be good to isolate these so we can evaluate them independently.

I can't remember if I checked in this case, but in general you need -ffast-math for vectorization. (I think someone else also remarked on that, and I pointed out up-thread that icc defaults to it, which is probably a reason for its supposedly much better vectorization.) I'd assume that the assembler kernels would have similar properties, but I don't know. Anyhow make test passes with the generic kernels and -Ofast. I think the factor of two is just down to fma, but I haven't checked -mavx2 v. -march=haswell.

fgvanzee · 2019-02-13T22:30:28Z

I can't remember if I checked in this case, but in general you need -ffast-math for vectorization. (I think someone else also remarked on that, and I pointed out up-thread that icc defaults to it, which is probably a reason for its supposedly much better vectorization.)

As for my general aversion to "fast math" style options, perhaps I am being too conservative. Hopefully others can comment on the potential numerical risk of using -ffast-math/-Ofast.

Also, in my experience, you don't need -ffast-math in order for the compiler to emit vectorized object code; even with older versions of gcc such as 5.4, I've seen AVX (though not FMA) vector code emitted via pragma omp simd. (Maybe you only meant that -ffast-math was needed for better vectorized code?)

loveshack · 2019-02-14T11:34:00Z

If this is what you are observing then we have had very different experiences...

I would like to understand why, given the range of GCC versions people deal with. I guess I could post results to the tutorial tracker.

At this point, I am happy with the current reference kernels either with or without `omp simd`. I think the most important things are the vectorization flags, fma (using the C99 fma function or adding flags), and the way the kernel is written, which I think is close to optimal now.

Yes, this basically solves the issue; thanks. I should try to check it on non-x86_64, though, where it's relevant. Is the remaining performance lag relative to the tuned version likely to be down to block size? I think the remaining task is to get GCC target clones working, so we can potentially optimize amongst micro-architectures, but I need to seek advice on how to make that work in cases like this. (For what it's worth, the problem is getting the resolver function generated and retained in the library.) However, I suspect selecting clones is subject to the same issues as family support for ARM, for instance.

loveshack · 2019-02-14T13:36:00Z

As for my general aversion to "fast math" style options, perhaps I am being too conservative. Hopefully others can comment on the potential numerical risk of using `-ffast-math`/`-Ofast`.

I should perhaps have said something other than fast-math, but I'm not sure which of the sub-options are relevant for vectorization. (Such flags could be confined to specific kernel source with a GCC pragma.)

Also, in my experience, you don't *need* `-ffast-math` in order for the compiler to emit vectorized object code; even with older versions of gcc such as 5.4, I've seen AVX (though not FMA) vector code emitted via `pragma omp simd`. (Maybe you only meant that `-ffast-math` was needed for *better* vectorized code?)

In the case of NEON, the GCC manual says that -funsafe-math-optimizations is required for auto-vectorization. The icc doc lists the diagnostics for auto-vectorization, but it's tedious to find the examples that fail with a (non-default) correct fpu-model (?) option. I've seen reductions as the canonical example, and I thought there were similar issues with parallelization, but I've looked unsuccessfully for a good exposition. Probably GCC should document better what the simd pragma does. For bli_gemm_ref, I do see the same optimization report from -O3 and -Ofast through "sort -u -n -k2 -t:" (to account for different message repetitions in each case). I suppose I could do the experiment for the rest of the code.

devinamatthews · 2019-02-14T14:54:51Z

@fgvanzee @loveshack re -ffast-math for BLAS I think we are probably doing all of these unsafe optimizations by hand anyways ((a*b)*c=a*(b*c), FMA, complex multiplication, a/b=a*(1/b), etc.). For LAPACK I imagine there are some places where more care would be needed.

@loveshack re "the remaining performance lag relative to the tuned version", this is mostly going to be prefetch, but a little bit of unrolling, instruction reordering, etc. In my experience the one thing the compiler did not do well was vector register allocation: the AB product must be kept in registers for highest performance.

loveshack · 2019-02-14T14:56:34Z

For info, here's additional vectorization that -Ofast gives with gcc-6 compared with -O3 (not implying that it's all important).
It was generated by configuring with -fopt-info-vec -march=haswell and either O3 or Ofast, grepping the output for "vectorized" through sort -u, and diffing the results.
It's just occurred to me, though, that it's actually under-counting the number of extra loops that may have been vectorized, since the line numbers reported are where macros are instantiated typically as multiple loops. At least the level 1 and 3 ref_kernels below don't get any vectorization without fast-math, c.f. optimized zen/haswell versions.

> frame/2/trmv/bli_trmv_unf_var1.c:218:1: note: loop vectorized
> frame/2/trsv/bli_trsv_unf_var1.c:232:1: note: loop vectorized
> frame/compat/bla_dot.c:139:2: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:1152:3: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:1171:3: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:1545:7: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:1588:7: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:392:7: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:435:7: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:796:3: note: loop vectorized
> frame/compat/f2c/bla_gbmv.c:815:3: note: loop vectorized
> frame/compat/f2c/bla_sbmv.c:294:3: note: loop vectorized
> frame/compat/f2c/bla_sbmv.c:347:3: note: loop vectorized
> frame/compat/f2c/bla_sbmv.c:645:3: note: loop vectorized
> frame/compat/f2c/bla_sbmv.c:698:3: note: loop vectorized
> frame/compat/f2c/bla_spmv.c:251:3: note: loop vectorized
> frame/compat/f2c/bla_spmv.c:297:3: note: loop vectorized
> frame/compat/f2c/bla_spmv.c:552:3: note: loop vectorized
> frame/compat/f2c/bla_spmv.c:598:3: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:1001:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:1348:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:1369:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:1391:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:1412:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:937:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:958:7: note: loop vectorized
> frame/compat/f2c/bla_tbmv.c:980:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:1342:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:1362:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:1386:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:1406:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:927:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:947:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:971:7: note: loop vectorized
> frame/compat/f2c/bla_tbsv.c:991:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:1157:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:1175:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:1197:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:1216:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:809:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:827:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:849:7: note: loop vectorized
> frame/compat/f2c/bla_tpmv.c:868:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:1152:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:1171:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:1192:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:1211:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:801:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:820:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:841:7: note: loop vectorized
> frame/compat/f2c/bla_tpsv.c:860:7: note: loop vectorized
> frame/util/bli_util_unb_var1.c:265:1: note: loop vectorized
> frame/util/bli_util_unb_var1.c:481:1: note: loop vectorized
> frame/util/bli_util_unb_var1.c:84:1: note: loop vectorized
> ref_kernels/1/bli_dotv_ref.c:118:1: note: loop vectorized
> ref_kernels/1/bli_dotxv_ref.c:127:1: note: loop vectorized
> ref_kernels/1f/bli_dotaxpyv_ref.c:163:1: note: loop vectorized
> ref_kernels/3/bli_trsm_ref.c:247:1: note: loop vectorized
> ref_kernels/3/bli_trsm_ref.c:329:1: note: loop vectorized
> ref_kernels/ind/bli_trsm1m_ref.c:241:1: note: loop vectorized
> ref_kernels/ind/bli_trsm1m_ref.c:447:1: note: loop vectorized
> ref_kernels/ind/bli_trsm3m1_ref.c:159:1: note: loop vectorized
> ref_kernels/ind/bli_trsm3m1_ref.c:283:1: note: loop vectorized
> ref_kernels/ind/bli_trsm4m1_ref.c:168:1: note: loop vectorized
> ref_kernels/ind/bli_trsm4m1_ref.c:284:1: note: loop vectorized

devinamatthews · 2019-02-14T15:01:29Z

@loveshack to expand on the prefetching, this is what the hand-tuned kernel does:

prefetch C into L1 (with write hint if possible) ~200 cycles ahead of time, but not so far ahead that loads of A and/or B will flush it out, and spaced such that the prefetcher does not run out of slots. For haswell this is just a block of prefetches at the start, but for skylake and knl this it much more complicated. It is also important to prefetch the address of the last element in each row/column of C.
Prefetch A (for row-major kernels) or B (for column-major kernels) into L1 during the iterations about ~30 cycles ahead of time. The "next" panel of other operand (B or A respectively) can also be prefetched into L2--but usually only about 1/4 of the panel is "warmed-up" this way.
For skylake and knl, we can't keep anything resident in L1, so both A and B have to be prefetched into L1 during the iterations.

The gemm reference kernel now uses the configuration-dependent BLIS_MR_x/BLIS_NR_x macros to control unrolling, rather than fixed values. This fixes #259 and replaces PR #547.

fgvanzee mentioned this issue Jan 25, 2019

Reference kernels should use #pragma omd simd #286

Closed

loveshack mentioned this issue Feb 15, 2019

revising optimization flags #300

Closed

devinamatthews added the enhancement label Aug 6, 2020

devinamatthews mentioned this issue Feb 6, 2022

Some preparatory re-working of kernels #609

Closed

bartoldeman mentioned this issue Mar 9, 2025

Proof-of-concept: speeding up gemm reference kernel #863

Draft

slow generic implementation #259

slow generic implementation #259

Comments

loveshack commented Sep 28, 2018

devinamatthews commented Sep 28, 2018

jeffhammond commented Sep 28, 2018

loveshack commented Sep 28, 2018 via email

loveshack commented Sep 28, 2018 via email

devinamatthews commented Sep 28, 2018

cdluminate commented Sep 29, 2018

fgvanzee commented Sep 29, 2018 • edited Loading

cdluminate commented Sep 30, 2018

fgvanzee commented Sep 30, 2018

cdluminate commented Sep 30, 2018 • edited Loading

fgvanzee commented Sep 30, 2018

fgvanzee commented Sep 30, 2018

jeffhammond commented Sep 30, 2018 via email • edited Loading

fgvanzee commented Sep 30, 2018

devinamatthews commented Oct 1, 2018

loveshack commented Oct 1, 2018 via email

loveshack commented Oct 1, 2018 via email

loveshack commented Oct 1, 2018 via email

loveshack commented Oct 3, 2018 via email

devinamatthews commented Oct 3, 2018

fgvanzee commented Oct 3, 2018

devinamatthews commented Oct 3, 2018

fgvanzee commented Oct 3, 2018

jeffhammond commented Oct 5, 2018 via email • edited Loading

loveshack commented Oct 5, 2018 via email

loveshack commented Oct 5, 2018 via email

loveshack commented Oct 5, 2018 via email

devinamatthews commented Oct 5, 2018 • edited Loading

fgvanzee commented Oct 6, 2018

devinamatthews commented Oct 6, 2018

fgvanzee commented Jan 17, 2019

fgvanzee commented Jan 24, 2019

fgvanzee commented Jan 25, 2019

loveshack commented Jan 25, 2019 via email

loveshack commented Jan 25, 2019 via email

loveshack commented Jan 25, 2019 via email

devinamatthews commented Jan 25, 2019

fgvanzee commented Jan 25, 2019

loveshack commented Jan 25, 2019 via email

loveshack commented Jan 25, 2019 via email

loveshack commented Jan 25, 2019 via email

loveshack commented Feb 13, 2019

devinamatthews commented Feb 13, 2019

fgvanzee commented Feb 13, 2019

loveshack commented Feb 13, 2019 via email

devinamatthews commented Feb 13, 2019

loveshack commented Feb 13, 2019 via email

fgvanzee commented Feb 13, 2019

loveshack commented Feb 14, 2019 via email

loveshack commented Feb 14, 2019 via email

devinamatthews commented Feb 14, 2019

loveshack commented Feb 14, 2019

devinamatthews commented Feb 14, 2019

fgvanzee commented Sep 29, 2018 •

edited

Loading

cdluminate commented Sep 30, 2018 •

edited

Loading

jeffhammond commented Sep 30, 2018 via email •

edited

Loading

jeffhammond commented Oct 5, 2018 via email •

edited

Loading

devinamatthews commented Oct 5, 2018 •

edited

Loading