-
Notifications
You must be signed in to change notification settings - Fork 378
slow generic implementation #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The |
@devinamatthews It may also be that Fortran is better than C |
You wrote:
The `generic` implementation will have better cache behavior than
netlib BLAS,
That's what I thought.
but will also do packing which will slow things down for
small and medium-sized matrices.
but not about that. At what sort of size would that stop hurting (and I
wonder if it could usefully be adaptive)? I tried 2000×2000 to run a
few goes in a reasonable time. I've just tried 4000 square, which looks
about the same.
It's not totally clear from your
comment whether or not this is the configuration that BLIS is using,
please correct me if I am mistaken.
I built a BLIS dynamic library with default flags and the generic
target. I took the openblas dgemm benchmark (which actually linked
against openblas), and ran it with either BLIS or reference BLAS
LD_PRELOADed. Is that clearer?
I could examine the compilation results and profiles at some stage when
I have more time, but thought it was worth asking the experts first --
thanks.
|
You wrote:
@devinamatthews It may also be that Fortran is better than C
![]() Of course, but a sometime GNU Fortran maintainer knows how :-/.
|
OK, I guess I'm not really clear why you care about the performance of the BLIS |
At least it is true that the builds on non-x86_64 architectures are slow due to the slow tests. |
@cdluminate I took a look at some of the build times, as you suggest. It is true that the build time is excessive for your (Digression: If you would like to reduce the total build time, I recommend running the "fast" version of the BLIS testsuite, which is almost surely where most of the time is being spent. Right now, However, strangely, your
Perhaps your build hardware for the An unrelated question: I assume that the name of your |
Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging. As for Debian's term |
That's fine. I often prefer the full testsuite in my own development, too, but I thought I would offer the faster alternative since many people in the past have been happy with avoiding many tests that are nearly identical to each other if it saves them 5-10x time.
I'm glad you also see more normal build times. I see no need to worry, then, about the 20 minute build time on the Debian build hardware.
Good, that's what I thought/expected. Thanks. |
Just nitpicking: The launchpad, or PPA is Ubuntu's infrastructure, supported by business company Canonical. Debian is supported by independent community that theoretically doesn't rely on Ubuntu or Canonical. The pages you see are not powered by Debian's build hardware. What I'm doing there is abusing Ubuntu's free build machines to build stuff on Ubuntu cosmic for testing Debian packages. (Ubuntu cosmic, or Ubuntu 18.10 is very close to Debian unstable. So testing Debian packages on Ubuntu machine sometimes makes sense). |
Unlike most people, I will almost never be bothered by nitpicking! I like and appreciate nuance. :) Thanks for those details. |
BTW, since I don't use Debian, I have to rely on people like you and @nschloe for your expertise on these topics (understanding how we fit into the Debian/Ubuntu universes). Thanks again. |
Field:
Next time a vendor offers to donate hardware, you might ask for a big SSD
so you can setup a virtual machine for every Linux distro. Just a thought.
|
@jeffhammond In principle, I agree with you. However, this is the sort of thing that is not as practical now that our group is so small. (It also doesn't help that maintaining machines in our department comes with a non-trivial amount of cost and of red tape.) Instead, I'm going to channel you circa 2010 and say, "we look forward to your patch." And by that I mean, "someone doing it for us." |
@loveshack Returning to the original question: I think one way to make the "generic" implementation faster would be to add a fully-unrolled branch and temporary storage of C to the kernel, e.g.: ...
if (m == MR && n == NR)
{
// unroll all MR*NR FMAs into temporaries
}
else
{
// as usual
}
...
// accumulate at the end instead of along the way and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual |
You wrote:
Debian tries to help upstream spot problems, not to build software as
fast as possible. In order to build a reliable linux distribution it's
not a good idea to skip too much tests. Hence the full testsuite is
preferred for packaging.
For what it's worth, that's not what's normally done for Fedora. On the
slower build platforms it would likely time out, and can perturb mass
rebuilds considerably. I consider the "check" step in rpm builds
basically as a sanity check, especially as in cases like this you can't
test a relevant range of micro-architectures. [The make check target
was added for that, but I also test the Fortran interface with gfortran,
rather than relying on the f2c'd versions.]
For Fedora, I don't care about build times unless they're pathological,
especially as they're very variable on the build VMs.
Debian's term `amd64` always equals to `x86_64`. No matter what brand
the physical CPU is.
[And for confusion, Fedora just uses x86_64 (which is probably less
correct).]
|
You wrote:
OK, I guess I'm not really clear why you care about the performance of
the BLIS `generic` configuration. Even with cache blocking it will
never be "high performance".
This is for OS packaging purposes. I assumed I could say that using
BLIS would be strictly better than reference BLAS, i.e. the reference
blas package is redundant for any platforms not supported by the blis or
openblas packages (apart from compatibility tests).
|
You wrote:
Field:
Next time a vendor offers to donate hardware, you might ask for a big SSD
so you can setup a virtual machine for every Linux distro. Just a thought.
For what it's worth, I frequently spin up VMs with vagrant, which is
mostly practical at least up to a cluster of three or so, on an 8GB/HDD
laptop.
However, it's reasonable to leave specific distribution work to
packagers, as long as the basic build system doesn't put obstacles in
the way, and I think we've already got the relevant hooks like xFLAGS.
thanks.
Also for what it's worth, I've tested rpm packaging for SuSE in the
configurations supported by Fedora's copr as well as for the range of
supported RHEL/Fedora targets, and my amd64 Debian desktop.
|
You wrote:
@loveshack Returning to the original question: I think one way to make
the "generic" implementation faster would be to add a fully-unrolled
branch and temporary storage of C to the kernel, e.g.:
```C
...
if (m == MR && n == NR)
{
// unroll all MR*NR FMAs into temporaries
}
else
{
// as usual
}
...
// accumulate at the end instead of along the way
```
**and** arrange for the reference kernel to be compiled with
architecture-appropriate flags. The second issue means that e.g. a
configuration without an optimized kernel would possibly run faster
because of auto-vectorization, but that the actual `generic`
configuration will probably still be very slow because it gets very
conservative compiler flags.
I haven't had a chance to investigate further, but I did find that
building generic with -march=native -Ofast -funroll-loops doesn't make a
dramatic difference, not that -march=native can be used for packaging
anyhow. (Part of the reason I expected BLIS to do better is that the
-O3 it uses enables vectorization -- though only sse2 with
-march=generic -- c.f. -O2 used for the reference blas package.) Then,
I've never understood why compilers do so badly on, say, matmul.
|
@loveshack What architectures in particular are you having a problem with? |
@devinamatthews It's not clear from context if you were under the impression that reference kernels were not already compiled with architecture-specific flags, but indeed they are. (Or maybe you are referring to a different variety of flags than I am.) Either way, Or did you mention architecture-specific flags because you knew that @loveshack could not use |
@fgvanzee I was mostly talking about the actual |
@devinamatthews Ah, makes sense. Thanks for clarifying. Yeah, |
It might be interesting to see if simd pragmas cause anything better to happen with the reference kernel. I’ve got a list of all of those, in addition to the obvious OpenMP one.
|
You wrote:
@loveshack What architectures in particular are you having a problem with?
The Fedora architectures that BLIS doesn't support I think are i686,
ppc64, ppc64le, and s390x; there will be more in Debian. (OpenBLAS does
those apart from ppc64, so we can at least use a free BLAS on most
Fedora architectures.)
|
You wrote:
@devinamatthews Ah, makes sense. Thanks for clarifying. Yeah,
`generic` doesn't do jack except use `-O3`, which I'm guessing in our
world doesn't do much either.
Yes, it doesn't make much difference experimentally (on x86_64), but you
might expect it to help by including vectorization.
|
You wrote:
It might be interesting to see if simd pragmas cause anything better to
happen with the reference kernel. I’ve got a list of all of those, in
addition to the obvious OpenMP one.
Yes, but I guess the first thing to do is to consult a detailed profile
and gcc's optimization report. I'll have a look at it eventually, but I
don't know whether results from x86_64 would be representative of other
architectures I can't currently access. (I'll try to get on aarch64 and
power8 at some stage.)
|
@loveshack For which of those architectures can we assume vectorization with the default flags? |
I might be willing to add such a flag or flags if you can recommend some that are relatively portable. And ideally, you would tell me the analogues of such flags on clang and icc, if applicable. |
@fgvanzee I would suggest:
Rationale: rewriting the reference kernel this way should allow for a reasonable degree of auto-vectorization given the right flags. The larger kernels size and row-major layout would allow for 128b and 256b vectorization with a higher bandwidth from L1 than L2. I measure up to a 6x increase in performance for AVX2 in a quick mock test. |
I'm not seeing much of a difference when inserting the prefetch builtins, except for smaller problem sizes. Specifically, I tried splitting the k loop into two loops, such that the second loop executes the last 16 iterations. (The prefetches reside between the loops.) Performance seems to plateau around 16.x GFLOPS, so maybe a marginal 2-3% increase at the high end. Performance does ramp up more quickly, though. |
Details: - Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified indexing annotated by the #pragma omp simd directive, which a compiler can use to vectorize certain constant-bounded loops. (The new kernels actually use _Pragma("omp simd") since the kernels are defined via templatizing macros.) Modest speedup was observed in most cases using gcc 5.4.0, which may improve with newer versions. Thanks to Devin Matthews for suggesting this via issue #286 and #259. - Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex, respectively, with a default row preference for the gemm ukernel. Also updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4, respectively, for all datatypes. - Modified configure to verify that -fopenmp-simd is a valid compiler option (via a new detect/omp_simd/omp_simd_detect.c file). - Added a new header in which prefetch macros are defined according to which compiler is detected (via macros such as __GNUC__). These prefetch macros are not yet employed anywhere, though. - Updated the year in copyrights of template license headers in build/templates and removed AMD as a default copyright holder.
I've hopefully addressed this via bdd46f9. This commit still lacks configurations for the off-beat architectures mentioned earlier in this issue. However, the new kernels, including the @devinamatthews Please take a look at a sampling of the newly rewritten reference kernels (say, |
The trick will be (3): creating sub-configurations with the
appropriate optimization/vectorization flags for these off-beat
architectures. We'll need people like @loveshack to chime in with that
information as we iterate towards something that works as desired.
[Rather late, sorry.]
I don't know what sort of thing you need, but I don't think gcc flags
will be very architecture-specific. Obviously you need a list of
appropriate micro-architectures to clone, but that's in the gcc doc
(obviously more in higher gcc versions). I'm happy to provide what info
I can, but I can only potentially help with (versions of) GCC, although
I guess I can run whatever clang is in Debian and EPEL7.
I still don't understand the compiler issues and would like to. When I
tried, -fopt-info told me gcc was vectorizing what it apparently wasn't
for Devin absent omp simd, but I'm not up to interpreting the assembly.
One thing that makes a difference is -ffast-math -- icc defaults to the
equivalent option in typical fast-but-incorrect-by-default style -- and
I don't know if that was used. Also, is there an advantage to unrolling
by hand rather than the compiler doing it?
|
You wrote:
> __builtin_prefetch(addr, 1) is what you want.
>
I guess this assumes GCC and maybe Clang/ICC. I said preprocessor because I
don't know if Cray, PGI or IBM (non-Clang front-end) supports this, but
those may not matter in this context.
For what it's worth, icc is documented as defaulting to the prefetch
option.
|
You wrote:
This commit still lacks configurations for the off-beat architectures
mentioned earlier in this issue.
The only architectures documented as having target attributes in GCC 6
are arm, x86, and power(pc) -- apart from whatever "NIOS II" is -- if
that's what you mean. GCC 8 also has aarch64 and s390. Note that these
don't seem to work properly in pragmas, at least for x86_64 (about which
I should raise an issue), but I don't have cross compilers for the other
targets to test.
However, the new kernels, including
the `#pragma omp simd` directives, are used by the `generic`
configuration, which is what these architectures would need to use in
the meantime.
I don't understand. The point of the generic configuration from my
point of view is support for things without
(micro-)architecture-specific kernels.
I may be able to look at this further some time next week, if it's clear
what's needed.
|
@loveshack re gcc peformance with the clang doesn't seem to actually do a proper vectorization, IIRC it produced a bizarre mishmash of scalar and vector instructions, while also utilizing the stack. icc produced a "butterfly-style" kernel (@fgvanzee can explain further) which is very interesting in and of itself, but not at all what I was going for. |
That's right. My only point was that in the absence of an |
You wrote:
@loveshack re gcc peformance with the `#pragma omp simd` included: it
seems that gcc is vectorizing it just fine (we have to do some
additional cajoling to get it to use fma over mul+add but that is just
C99), but the problem is that it is keeping the temporary AB product
on the stack. For really small kernels (4x4 or so) it will keep it in
registers, but anything larger goes on the stack. There are in fact
enough registers for up to 6x8, but it seems to be too conservative in
allocating them.
What I meant is that I saw the same optimization report with the pragma
and without on one of your examples (once I fixed it to compile). If
you can provide a reproducible example (with gcc version and all flags)
I'm happy to try to take it up with gcc developers, though that won't
help in the short term unless someone suggests different flags.
Anyway, if the generic version is a few times faster now, that's good.
|
That's right. My only point was that in the absence of an `s390x`
subconfiguration (for example), which would allow (but not require)
the use of optimized kernels for an `s390x`-type system, any hardware
that would *want* such a subconfiguration would have to use the
generic configuration instead (for now).
Assuming the one kernel is OK for different cache sizes etc., I'd just
expect the generic config to specialize for appropriate cases with GCC.
Of the top of my head, and in haste, I'd expect something like
```
#if __GNUC__
// I'm not sure about unlimited, and maybe you want explicit options
// for vectorization, not O3
#pragma GCC optimize ("fast-math,O3,vect-cost-model=unlimited")
// I don't know if the arch macros are correct off-hand
#if __ppc64le__
#define clones __attribute__ ((target_clones ("default,cpu=power8,cpu=power9")))
#elif __s390__
...
#endif
#else
#define clones
#endif
```
and then add "clones" as a function attribute where it's relevant.
|
Actually, that example needs to check the gcc version for support of the
specific architectures. I think such attributes are also supported by
appropriate versions of clang.
|
This is what I was talking about yesterday, rather long with the included I've not understood the reported issues with GCC vectorization, but I'm running Debian stable, with GCC6, on a "Intel(R) Core(TM) i5-6200U CPU @ Using the Debian openblas package (0.2.19, i.e. rather old, pthreaded, so with
With current BLIS master configured "auto", i.e. haswell in this case:
BLIS configured "generic" plus CFLAGS -march=native, in the absence of target
[Omitting -march=native, gives about 7900.] Now without the SIMD pragma (which requires modifying configure, as
[prefetch-loop-arrays hurts performance in this case.] I assumed the forced vectorization isn't all profitable, although For what it's worth, here's the difference in opt-info between using
I could see what MAQAO makes of the generated code in each case, but I I also tried the native compiler (gcc 4.8) on EL7, which doesn't Using GCC target_clones isn't as straightforward as I hoped; I'm investigating. |
It sounds like:
I wonder if the vectorization when using omp simd is tuned for BLAS L1-like kernels and gets confused by the much more computationally dense GEMM kernel? Can you send the assembly for the kernel with and without omp simd? |
@loveshack Thanks for sharing these detailed results, and for going to the trouble. Before commenting on your results, I would be curious to isolate the impact of
Generally speaking, I would consider such options to be off-limits for our purposes. Now, the increase from ~12 to ~22 GFLOPS may have been attributable to something else, e.g. the presence of the pragmas rather than the use of |
It sounds like:
1) Recent gcc is much better at automatic vectorization (yay)
6 isn't terribly recent, but better than what? (Even when I used 5 from
Ubuntu 16.04 recently, it beat the hand-written intrinsics in the
how-to-optimize-blas tutorial when I added "restrict".) I wanted to
know exactly how vectorization failures were happening, especially as
they may be for versions where -fopenmp-simd isn't available anyhow.
[Unfortunately for EL7 package-building at least, the add-on recent
compilers with -fopenmp-simd aren't available for relevant architectures.]
2) `#pragma omp simd` does force vectorization, but maybe it does not
play as nice with other optimizations
It seems to be like I guessed -- the pragma uses avx, but not fma as far
as I can tell grepping for "vfm" instructions in each case.
I wonder if the vectorization when using omp simd is tuned for BLAS
L1-like kernels and gets confused by the much more computationally
dense GEMM kernel? Can you send the assembly for the kernel with and
without omp simd?
From which version of gcc and for which (x86_64) target? GCC 8 gives
different -- better, one hopes -- optimization reports from 6 in cases
I've tried, but isn't widely available.
|
If this is what you are observing then we have had very different experiences... At this point, I am happy with the current reference kernels either with or without |
Generally speaking, I would consider such options to be off-limits for
our purposes. Now, the increase from ~12 to ~22 GFLOPS may have been
attributable to something else, e.g. the presence of the pragmas
rather than the use of `-ffast-math`. But it would be good to isolate
these so we can evaluate them independently.
I can't remember if I checked in this case, but in general you need
-ffast-math for vectorization. (I think someone else also remarked on
that, and I pointed out up-thread that icc defaults to it, which is
probably a reason for its supposedly much better vectorization.) I'd
assume that the assembler kernels would have similar properties, but I
don't know. Anyhow make test passes with the generic kernels and
-Ofast. I think the factor of two is just down to fma, but I haven't
checked -mavx2 v. -march=haswell.
|
As for my general aversion to "fast math" style options, perhaps I am being too conservative. Hopefully others can comment on the potential numerical risk of using Also, in my experience, you don't need |
If this is what you are observing then we have had very different
experiences...
I would like to understand why, given the range of GCC versions people
deal with. I guess I could post results to the tutorial tracker.
At this point, I am happy with the current reference
kernels either with or without `omp simd`. I think the most important
things are the vectorization flags, fma (using the C99 fma function or
adding flags), and the way the kernel is written, which I think is
close to optimal now.
Yes, this basically solves the issue; thanks. I should try to check it
on non-x86_64, though, where it's relevant. Is the remaining
performance lag relative to the tuned version likely to be down to block
size?
I think the remaining task is to get GCC target clones working, so we
can potentially optimize amongst micro-architectures, but I need to seek
advice on how to make that work in cases like this. (For what it's
worth, the problem is getting the resolver function generated and
retained in the library.) However, I suspect selecting clones is
subject to the same issues as family support for ARM, for instance.
|
As for my general aversion to "fast math" style options, perhaps I am
being too conservative. Hopefully others can comment on the potential
numerical risk of using `-ffast-math`/`-Ofast`.
I should perhaps have said something other than fast-math, but I'm not
sure which of the sub-options are relevant for vectorization. (Such
flags could be confined to specific kernel source with a GCC pragma.)
Also, in my experience, you don't *need* `-ffast-math` in order for
the compiler to emit vectorized object code; even with older versions
of gcc such as 5.4, I've seen AVX (though not FMA) vector code emitted
via `pragma omp simd`. (Maybe you only meant that `-ffast-math` was
needed for *better* vectorized code?)
In the case of NEON, the GCC manual says that
-funsafe-math-optimizations is required for auto-vectorization. The icc
doc lists the diagnostics for auto-vectorization, but it's tedious to
find the examples that fail with a (non-default) correct fpu-model (?)
option. I've seen reductions as the canonical example, and I thought
there were similar issues with parallelization, but I've looked
unsuccessfully for a good exposition. Probably GCC should document
better what the simd pragma does.
For bli_gemm_ref, I do see the same optimization report from -O3 and
-Ofast through "sort -u -n -k2 -t:" (to account for different message
repetitions in each case). I suppose I could do the experiment for the
rest of the code.
|
@fgvanzee @loveshack re @loveshack re "the remaining performance lag relative to the tuned version", this is mostly going to be prefetch, but a little bit of unrolling, instruction reordering, etc. In my experience the one thing the compiler did not do well was vector register allocation: the AB product must be kept in registers for highest performance. |
For info, here's additional vectorization that -Ofast gives with gcc-6 compared with -O3 (not implying that it's all important).
|
@loveshack to expand on the prefetching, this is what the hand-tuned kernel does:
|
I was assuming that BLIS is generally better than reference BLAS, so substituting the latter with BLIS OS packages I'm working on would always be sensible. However, I found BLIS is more than two times slower for medium-sized dgemm on x86_64/RHEL7 for a "generic" build compared with the system reference blas package (which should be built with -O2 -mtune=generic, not -O3). I can't usefully test an architecture without a tuned implementation, but I don't see any reason to think that would be much different, though I haven't looked into the gcc optimization.
Is that expected, or might it be something worth investigating?
The text was updated successfully, but these errors were encountered: