Skip to content
This repository was archived by the owner on May 7, 2024. It is now read-only.

Review testsuite and clean up #8

Conversation

kito-cheng
Copy link
Collaborator

@kito-cheng kito-cheng commented Oct 13, 2016

In this change set, I've review some fail case and mark skip if it's reasonable for RISC-V.

Testsuite after this patch set:

                                    rv32i           rv32g           rv64i           rv64g
gcc       # of expected failures      127 (    0)     127 (    0)     128 (    0)     128 (    0)   
          # of expected passes      82378 (   -2)   82382 (   -2)   83797 (    0)   83817 (    0)   
          # of unexpected failures     62 (   -5)      77 (   -5)      77 (   -7)      92 (   -7)   
          # of unexpected successes     0 (   -1)       0 (   -1)       0 (   -1)       0 (   -1)   
          # of unresolved testcases    10 (    0)      10 (    0)      10 (    0)      10 (    0)   
          # of unsupported tests     1645 (    2)    1636 (    2)    1515 (    2)    1498 (    2)   
g++       # of expected failures      303 (    0)     303 (    0)     303 (    0)     303 (    0)   
          # of expected passes      90602 (    0)   90597 (    0)   92481 (    5)   92482 (    6)   
          # of unexpected failures     30 (   -2)      35 (   -2)      23 (   -4)      22 (   -5)   
          # of unexpected successes     2 (    0)       2 (    0)       2 (    0)       2 (    0)   
          # of unresolved testcases     0 (    0)       0 (    0)       0 (    0)       0 (    0)   
          # of unsupported tests     3876 (    0)    3876 (    0)    3853 (    0)    3853 (    0)   

@sorear
Copy link

sorear commented Oct 13, 2016

Is this based on qemu results?

@kito-cheng
Copy link
Collaborator Author

@sorear yes, it's base on riscvarchive/riscv-qemu@9bfcd47

@kito-cheng kito-cheng force-pushed the review-testsuite-and-clean-up branch from 116b016 to f70433c Compare October 13, 2016 10:54
@kito-cheng
Copy link
Collaborator Author

Updated testsuite result, fix wrong syntax for dg-skip [1][2]

                                    rv32i           rv32g           rv64i           rv64g
gcc       # of expected failures      127 (    0)     127 (    0)     129 (    1)     129 (    1)   
          # of expected passes      82381 (    1)   82385 (    1)   83797 (    0)   83817 (    0)   
          # of unexpected failures     62 (   -5)      77 (   -5)      77 (   -7)      92 (   -7)   
          # of unexpected successes     0 (   -1)       0 (   -1)       0 (   -1)       0 (   -1)   
          # of unresolved testcases    10 (    0)      10 (    0)      10 (    0)      10 (    0)   
          # of unsupported tests     1645 (    2)    1636 (    2)    1516 (    3)    1499 (    3)   
g++       # of expected failures      303 (    0)     303 (    0)     303 (    0)     303 (    0)   
          # of expected passes      90602 (    0)   90597 (    0)   92482 (    6)   92482 (    6)   
          # of unexpected failures     30 (   -2)      35 (   -2)      22 (   -5)      22 (   -5)   
          # of unexpected successes     2 (    0)       2 (    0)       2 (    0)       2 (    0)   
          # of unresolved testcases     0 (    0)       0 (    0)       0 (    0)       0 (    0)   
          # of unsupported tests     3876 (    0)    3876 (    0)    3853 (    0)    3853 (    0)   

[1] f0397f1
[2] 22bb0dc

@palmer-dabbelt palmer-dabbelt merged commit f70433c into riscvarchive:riscv-gcc-6.1.0 Oct 13, 2016
@palmer-dabbelt
Copy link
Contributor

Thanks!

@kito-cheng kito-cheng deleted the review-testsuite-and-clean-up branch July 18, 2019 06:47
cliffordwolf pushed a commit that referenced this pull request Sep 29, 2019
Like the logical operations, expand all shifts early rather than only
sometimes.  The Neon shift expansions are never emitted (not even with
-fneon-for-64bits), so they are not useful.  So all the late expansions
and Neon shift patterns can be removed, and shifts are more optimized
as a result.  Since some extend patterns use Neon DImode shifts, remove
the Neon extend variants and related splits.

A simple example now generates the same efficient code after this
patch with -mfpu=neon and -mfpu=vfp (previously just the fact of
having Neon enabled resulted inefficient code for no reason).

unsigned long long f(unsigned long long x, unsigned long long y)
{ return x & (y >> 33); }

Before:
	strd    r4, r5, [sp, #-8]!
	lsr     r4, r3, #1
	mov     r5, #0
	and     r1, r1, r5
	and     r0, r0, r4
	ldrd    r4, r5, [sp]
	add     sp, sp, #8
	bx      lr

After:
	and     r0, r0, r3, lsr #1
	mov     r1, #0
	bx      lr

Bootstrap and regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57

    gcc/
	* config/arm/iterators.md (qhs_extenddi_cstr): Update.
	(qhs_extenddi_cstr): Likewise.
	* config/arm/arm.md (ashldi3): Always expand early.
	(ashlsi3): Likewise.
	(ashrsi3): Likewise.
	(zero_extend<mode>di2): Remove Neon variants.
	(extend<mode>di2): Likewise.
	* config/arm/neon.md (ashldi3_neon_noclobber): Remove.
	(signed_shift_di3_neon): Likewise.
	(unsigned_shift_di3_neon): Likewise.
	(ashrdi3_neon_imm_noclobber): Likewise.
	(lshrdi3_neon_imm_noclobber): Likewise.
	(<shift>di3_neon): Likewise.
	(split extend): Remove DI extend split patterns.

   gcc/testsuite/
	* gcc.target/arm/neon-extend-1.c: Remove test.
	* gcc.target/arm/neon-extend-2.c: Remove test.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@274824 138bc75d-0d04-0410-961f-82ee72b054a4
cliffordwolf pushed a commit that referenced this pull request Sep 29, 2019
…ested function

In FDPIC mode, the trampoline generated to support pointers to nested
functions looks like:

	   .word	trampoline address
	   .word	trampoline GOT address
	   ldr 		r12, [pc, #8]
	   ldr 		r9, [pc, #8]
	   ldr		pc, [pc, #8]
	   .word	static chain value
	   .word	GOT address
	   .word	function's address

because in FDPIC function pointers are actually pointers to function
descriptors, we have to actually generate a function descriptor for
the trampoline.

2019--09-10  Christophe Lyon  <[email protected]>
	Mickaël Guêné <[email protected]>

	gcc/
	* config/arm/arm.c (arm_asm_trampoline_template): Add FDPIC
	support.
	(arm_trampoline_init): Likewise.
	(arm_trampoline_adjust_address): Likewise.
	* config/arm/arm.h (TRAMPOLINE_SIZE): Likewise.



git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@275571 138bc75d-0d04-0410-961f-82ee72b054a4
kito-cheng pushed a commit that referenced this pull request Mar 18, 2020
When using `check-function-bodies`, the subroutine `parse_function_bodies` uses
the `fluff` regexp to remove uninteresting assembly lines.

Arm targets generate assembly with some lines prefixed by `@`, these lines are
left by this process.

As an example of some lines prefixed by `@': the assembly output from the
`stacktest1` function in "bfloat16_simd_3_1.c" is:

        .align  2
        .global stacktest1
        .arch armv8.2-a
        .syntax unified
        .arm
        .fpu neon-fp-armv8
        .type   stacktest1, %function
stacktest1:
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        sub     sp, sp, #8
        add     r3, sp, #6
        vst1.16 {d0[0]}, [r3]
        vld1.16 {d0[0]}, [r3]
        add     sp, sp, #8
        @ sp needed
        bx      lr
        .size   stacktest1, .-stacktest1

It seems that previous uses of `check-function-bodies` in the arm backend have
avoided problems with such lines since they use the `...` regexp in each place
such fluff occurs.

I'm currently writing a patch that I'd like to match the entire function body,
so I'd like to remove such `@` lines automatically.

gcc/testsuite/ChangeLog:

2020-03-11  Matthew Malcomson  <[email protected]>

	* lib/scanasm.exp (parse_function_bodies): Lines starting with '@' also
	counted as fluff.
kito-cheng pushed a commit to sifive/riscv-gcc that referenced this pull request Aug 14, 2020
The stack-protector-1.c test fails when compiled for Cortex-M:
- for Cortex-M0/M1, str r0, [sp #-8]! is not supported
- for Cortex-M3/M4..., the assembler complains that "use of r13 is
  deprecated"

This patch replaces the str instruction with
     sub   sp, sp, riscvarchive#8
     str r0, [sp]
and removes the check for r13, which is unlikely to leak the canary
value.

2020-08-11  Christophe Lyon  <[email protected]>

	gcc/testsuite/
	* gcc.target/arm/stack-protector-1.c: Adapt code to Cortex-M
	restrictions.
kito-cheng pushed a commit that referenced this pull request Jun 23, 2021
This patch fixes PR99748 which shows us trying to pass the argument to
__aeabi_f2iz in the VFP register s0 when the library function is
expecting to use the GPR r0. It also fixes the __aeabi_f2uiz case which
was broken in the same way.

For the testcase in the PR, here is the code we generate before the
patch (with -mfloat-abi=hard -march=armv8.1-m.main+mve -O0):

main:
    push    {r7, lr}
    sub     sp, sp, #8
    add     r7, sp, #0
    mov     r3, #1065353216
    str     r3, [r7, #4]    @ float
    vldr.32 s0, [r7, #4]
    bl      __aeabi_f2iz
    mov     r3, r0
    cmp     r3, #1
    [...]

This becomes:

main:
    push    {r7, lr}
    sub     sp, sp, #8
    add     r7, sp, #0
    mov     r3, #1065353216
    str     r3, [r7, #4]    @ float
    ldr     r0, [r7, #4]    @ float
    bl      __aeabi_f2iz
    mov     r3, r0
    cmp     r3, #1
    [...]

after the patch. We see a similar change for the same testcase with a
cast to unsigned instead of int.

gcc/ChangeLog:

	PR target/99748
	* config/arm/arm.c (arm_libcall_uses_aapcs_base): Also use base
	PCS for [su]fix_optab.
scpcom pushed a commit to scpcom/riscv-gcc that referenced this pull request Dec 28, 2021
This patch extends support for -mpure-code to all thumb-1 processors,
by removing the need for MOVT.

Symbol addresses are built using upper8_15, upper0_7, lower8_15 and
lower0_7 relocations, and constants are built using sequences of
movs/adds and lsls instructions.

The extension of the *thumb1_movhf pattern uses always the same size
(6) although it can emit a shorter sequence when possible. This is
similar to what *arm32_movhf already does.

CASE_VECTOR_PC_RELATIVE is now false with -mpure-code, to avoid
generating invalid assembly code with differences from symbols from
two different sections (the difference cannot be computed by the
assembler).

Tests pr45701-[12].c needed a small adjustment to avoid matching
upper8_15 when looking for the r8 register.

Test no-literal-pool.c is augmented with __fp16, so it now uses
-mfp16-format=ieee.

Test thumb1-Os-mult.c generates an inline code sequence with
-mpure-code and computes the multiplication by using a sequence of
add/shift rather than using the multiply instruction, so we skip it in
presence of -mpure-code.

With -mcpu=cortex-m0, the pure-code/no-literal-pool.c fails because
code like:
static char *p = "Hello World";
char *
testchar ()
{
  return p + 4;
}

generates 2 indirections (I removed non-essential directives/code)
          .section        .rodata
	  .LC0:
	  .ascii  "Hello World\000"
	  .data
	  p:
	  .word   .LC0
	  .section        .rodata
	  .LC2:
	  .word   p
	  .section .text,"0x20000006",%progbits
	  testchar:
	  push    {r7, lr}
	  add     r7, sp, #0
	  movs    r3, #:upper8_15:#.LC2
	  lsls    r3, riscvarchive#8
	  adds    r3, #:upper0_7:#.LC2
	  lsls    r3, riscvarchive#8
	  adds    r3, #:lower8_15:#.LC2
	  lsls    r3, riscvarchive#8
	  adds    r3, #:lower0_7:#.LC2
	  ldr     r3, [r3]
	  ldr     r3, [r3]
	  adds    r3, r3, riscvarchive#4
	  movs    r0, r3
	  mov     sp, r7
	  @ sp needed
	  pop     {r7, pc}

By contrast, when using -mcpu=cortex-m4, the code looks like:
        .section        .rodata
	.LC0:
	.ascii  "Hello World\000"
	.data
	p:
	.word   .LC0
	testchar:
	push    {r7}
	add     r7, sp, #0
	movw    r3, #:lower16:p
	movt    r3, #:upper16:p
	ldr     r3, [r3]
	adds    r3, r3, riscvarchive#4
	mov     r0, r3
	mov     sp, r7
	pop     {r7}
	bx      lr

I haven't found yet how to make code for cortex-m0 apply upper/lower
relocations to "p" instead of .LC2. The current code looks functional,
but could be improved.

2020-02-25  Christophe Lyon  <[email protected]>

	Backport from mainline
	2019-10-18  Christophe Lyon  <[email protected]>

	gcc/
	* config/arm/arm-protos.h (thumb1_gen_const_int): Add new prototype.
	* config/arm/arm.c (arm_option_check_internal): Remove restriction
	on MOVT for -mpure-code.
	(thumb1_gen_const_int): New function.
	(thumb1_legitimate_address_p): Support -mpure-code.
	(thumb1_rtx_costs): Likewise.
	(thumb1_size_rtx_costs): Likewise.
	(arm_thumb1_mi_thunk): Likewise.
	* config/arm/arm.h (CASE_VECTOR_PC_RELATIVE): Likewise.
	* config/arm/thumb1.md (thumb1_movsi_symbol_ref): New.
	(*thumb1_movhf): Support -mpure-code.

	gcc/testsuite/
	* gcc.target/arm/pr45701-1.c: Adjust for -mpure-code.
	* gcc.target/arm/pr45701-2.c: Likewise.
	* gcc.target/arm/pure-code/no-literal-pool.c: Add tests for
	__fp16.
	* gcc.target/arm/pure-code/pure-code.exp: Remove thumb2 and movt
	conditions.
	* gcc.target/arm/thumb1-Os-mult.c: Skip if -mpure-code is used.
pz9115 pushed a commit to plctlab/riscv-gcc that referenced this pull request Mar 7, 2022
…04617]

On
 #define A(n) int foo1##n(void) { return 1##n; }
 #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n#riscvarchive#5) A(n#riscvarchive#6) A(n#riscvarchive#7) A(n#riscvarchive#8) A(n#riscvarchive#9)
 #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n#riscvarchive#5) B(n#riscvarchive#6) B(n#riscvarchive#7) B(n#riscvarchive#8) B(n#riscvarchive#9)
 #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n#riscvarchive#5) C(n#riscvarchive#6) C(n#riscvarchive#7) C(n#riscvarchive#8) C(n#riscvarchive#9)
 #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n#riscvarchive#5) D(n#riscvarchive#6) D(n#riscvarchive#7) D(n#riscvarchive#8) D(n#riscvarchive#9)
 E(0) E(1) E(2) D(30) D(31) C(320) C(321) C(322) C(323) C(324) C(325)
 B(3260) B(3261) B(3262) B(3263) A(32640) A(32641) A(32642)
testcase with
./xgcc -B ./ -c -g -fpic -ffat-lto-objects -flto  -O0 -o foo1.o foo1.c -ffunction-sections
./xgcc -B ./ -shared -g -fpic -flto -O0 -o foo1.so foo1.o
/tmp/ccTW8mBm.debug.temp.o: file not recognized: file format not recognized
(testcase too slow to be included into testsuite).
The problem is clearly reported by readelf:
readelf: foo1.o.debug.temp.o: Warning: Section 2 has an out of range sh_link value of 65321
readelf: foo1.o.debug.temp.o: Warning: Section 5 has an out of range sh_link value of 65321
readelf: foo1.o.debug.temp.o: Warning: Section 10 has an out of range sh_link value of 65323
readelf: foo1.o.debug.temp.o: Warning: [ 2]: Link field (65321) should index a symtab section.
readelf: foo1.o.debug.temp.o: Warning: [ 5]: Link field (65321) should index a symtab section.
readelf: foo1.o.debug.temp.o: Warning: [10]: Link field (65323) should index a string section.
because simple_object_elf_copy_lto_debug_sections doesn't adjust sh_info and
sh_link fields in ElfNN_Shdr if they are in between SHN_{LO,HI}RESERVE
inclusive.  Not adjusting those is incorrect though, SHN_{LO,HI}RESERVE
range is only relevant to the 16-bit fields, mainly st_shndx in ElfNN_Sym
where if one needs >= SHN_LORESERVE section number, SHN_XINDEX should be
used instead and .symtab_shndx section should contain the real section
index, and in ElfNN_Ehdr e_shnum and e_shstrndx fields, where if >=
SHN_LORESERVE value is needed it should put those into
Shdr[0].sh_{size,link}.  But, sh_{link,info} are 32-bit fields which can
contain any section index.

Note, as simple-object-elf.c mentions, binutils from 2.12 to 2.18 (so before
2011) used to mishandle the > 63.75K sections case and assumed there is a
hole in between the sections, but what
simple_object_elf_copy_lto_debug_sections does wouldn't help in that case
for the debug temp object creation, we'd need to detect the case also in
that routine and take it into account in the remapping etc.  I think
it is not worth it given that it is over 10 years, if somebody needs
63.75K or more sections, better use more recent binutils.

2022-02-22  Jakub Jelinek  <[email protected]>

	PR lto/104617
	* simple-object-elf.c (simple_object_elf_match): Fix up URL
	in comment.
	(simple_object_elf_copy_lto_debug_sections): Remap sh_info and
	sh_link even if they are in the SHN_LORESERVE .. SHN_HIRESERVE
	range (inclusive).
kito-cheng pushed a commit that referenced this pull request Aug 15, 2022
This patch extends the fix for PR106253 to AArch32.  As with AArch64,
we were using ACLE intrinsics to vectorise scalar built-ins, even
though the two sometimes have different ECF_* flags.  (That in turn
is because the ACLE intrinsics should follow the instruction semantics
as closely as possible, whereas the scalar built-ins follow language
specs.)

The patch also removes the copysignf built-in, which only existed
for this purpose and wasn't a “real” arm_neon.h built-in.

Doing this also has the side-effect of enabling vectorisation of
rint and roundeven.  Logically that should be a separate patch,
but making it one would have meant adding a new int iterator
for the original set of instructions and then removing it again
when including new functions.

I've restricted the bswap tests to little-endian because we end
up with excessive spilling on big-endian.  E.g.:

        sub     sp, sp, #8
        vstr    d1, [sp]
        vldr    d16, [sp]
        vrev16.8        d16, d16
        vstr    d16, [sp]
        vldr    d0, [sp]
        add     sp, sp, #8
        @ sp needed
        bx      lr

Similarly, the copysign tests require little-endian because on
big-endian we unnecessarily load the constant from the constant pool:

        vldr.32 s15, .L3
        vdup.32 d0, d7[1]
        vbsl    d0, d2, d1
        bx      lr
.L3:
        .word   -2147483648

gcc/
	PR target/106253
	* config/arm/arm-builtins.cc (arm_builtin_vectorized_function):
	Delete.
	* config/arm/arm-protos.h (arm_builtin_vectorized_function): Delete.
	* config/arm/arm.cc (TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION):
	Delete.
	* config/arm/arm_neon_builtins.def (copysignf): Delete.
	* config/arm/iterators.md (nvrint_pattern): New attribute.
	* config/arm/neon.md (<NEON_VRINT:nvrint_pattern><VCVTF:mode>2):
	New pattern.
	(l<NEON_VCVT:nvrint_pattern><su_optab><VCVTF:mode><v_cmp_result>2):
	Likewise.
	(neon_copysignf<mode>): Rename to...
	(copysign<mode>3): ...this.

gcc/testsuite/
	PR target/106253
	* gcc.target/arm/vect_unary_1.c: New test.
	* gcc.target/arm/vect_binary_1.c: Likewise.
kito-cheng pushed a commit that referenced this pull request Sep 2, 2022
Currently SLP tries to force permute operations "down" the graph
from loads in the hope of reducing the total number of permutations
needed or (in the best case) removing the need for the permutations
entirely.  This patch tries to extend it as follows:

- Allow loads to take a different permutation from the one they
  started with, rather than choosing between "original permutation"
  and "no permutation".

- Allow changes in both directions, if the target supports the
  reverse permutation.

- Treat the placement of permutations as a two-way dataflow problem:
  after propagating information from leaves to roots (as now), propagate
  information back up the graph.

- Take execution frequency into account when optimising for speed,
  so that (for example) permutations inside loops have a higher
  cost than permutations outside loops.

- Try to reduce the total number of permutations when optimising for
  size, even if that increases the number of permutations on a given
  execution path.

See the big block comment above vect_optimize_slp_pass for
a detailed description.

The original motivation for doing this was to add a framework that would
allow other layout differences in future.  The two main ones are:

- Make it easier to represent predicated operations, including
  predicated operations with gaps.  E.g.:

     a[0] += 1;
     a[1] += 1;
     a[3] += 1;

  could be a single load/add/store for SVE.  We could handle this
  by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
  (depending on what's being counted).  We might need to move
  elements between lanes at various points, like with permutes.

  (This would first mean adding support for stores with gaps.)

- Make it easier to switch between an even/odd and unpermuted layout
  when switching between wide and narrow elements.  E.g. if a widening
  operation produces an even vector and an odd vector, we should try
  to keep operations on the wide elements in that order rather than
  force them to be permuted back "in order".

To give some examples of what the patch does:

int f1(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  a[0] = (b[1] << c[3]) - d[1];
  a[1] = (b[0] << c[2]) - d[0];
  a[2] = (b[3] << c[1]) - d[3];
  a[3] = (b[2] << c[0]) - d[2];
}

continues to produce the same code as before when optimising for
speed: b, c and d are permuted at load time.  But when optimising
for size we instead permute c into the same order as b+d and then
permute the result of the arithmetic into the same order as a:

        ldr     q1, [x2]
        ldr     q0, [x1]
        ext     v1.16b, v1.16b, v1.16b, #8     // <------
        sshl    v0.4s, v0.4s, v1.4s
        ldr     q1, [x3]
        sub     v0.4s, v0.4s, v1.4s
        rev64   v0.4s, v0.4s                   // <------
        str     q0, [x0]
        ret

The following function:

int f2(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  a[0] = (b[3] << c[3]) - d[3];
  a[1] = (b[2] << c[2]) - d[2];
  a[2] = (b[1] << c[1]) - d[1];
  a[3] = (b[0] << c[0]) - d[0];
}

continues to push the reverse down to just before the store,
like the previous code did.

In:

int f3(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  for (int i = 0; i < 100; ++i)
    {
      a[0] = (a[0] + c[3]);
      a[1] = (a[1] + c[2]);
      a[2] = (a[2] + c[1]);
      a[3] = (a[3] + c[0]);
      c += 4;
    }
}

the loads of a are hoisted and the stores of a are sunk, so that
only the load from c happens in the loop.  When optimising for
speed, we prefer to have the loop operate on the reversed layout,
changing on entry and exit from the loop:

        mov     x3, x0
        adrp    x0, .LC0
        add     x1, x2, 1600
        ldr     q2, [x0, #:lo12:.LC0]
        ldr     q0, [x3]
        mov     v1.16b, v0.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
        .p2align 3,,7
.L6:
        ldr     q1, [x2], 16
        add     v0.4s, v0.4s, v1.4s
        cmp     x2, x1
        bne     .L6
        mov     v1.16b, v0.16b
        adrp    x0, .LC0
        ldr     q2, [x0, #:lo12:.LC0]
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
        str     q0, [x3]
        ret

Similarly, for the very artificial testcase:

int f4(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  int a0 = a[0];
  int a1 = a[1];
  int a2 = a[2];
  int a3 = a[3];
  for (int i = 0; i < 100; ++i)
    {
      a0 ^= c[0];
      a1 ^= c[1];
      a2 ^= c[2];
      a3 ^= c[3];
      c += 4;
      for (int j = 0; j < 100; ++j)
	{
	  a0 += d[1];
	  a1 += d[0];
	  a2 += d[3];
	  a3 += d[2];
	  d += 4;
	}
      b[0] = a0;
      b[1] = a1;
      b[2] = a2;
      b[3] = a3;
      b += 4;
    }
  a[0] = a0;
  a[1] = a1;
  a[2] = a2;
  a[3] = a3;
}

the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
even though it's part of an SCC that includes the outer loop.
In other words, this is a motivating case for not assigning
permutes at SCC granularity.  The code we get is:

        ldr     q0, [x0]
        mov     x4, x1
        mov     x5, x0
        add     x1, x3, 1600
        add     x3, x4, 1600
        .p2align 3,,7
.L11:
        ldr     q1, [x2], 16
        sub     x0, x1, #1600
        eor     v0.16b, v1.16b, v0.16b
        rev64   v0.4s, v0.4s              // <---
        .p2align 3,,7
.L10:
        ldr     q1, [x0], 16
        add     v0.4s, v0.4s, v1.4s
        cmp     x0, x1
        bne     .L10
        rev64   v0.4s, v0.4s              // <---
        add     x1, x0, 1600
        str     q0, [x4], 16
        cmp     x3, x4
        bne     .L11
        str     q0, [x5]
        ret

bb-slp-layout-17.c is a collection of compile tests for problems
I hit with earlier versions of the patch.  The same prolems might
show up elsewhere, but it seemed worth having the test anyway.

In slp-11b.c we previously pushed the permutation of the in[i*4]
group down from the load to just before the store.  That didn't
reduce the number or frequency of the permutations (or increase
them either).  But separating the permute from the load meant
that we could no longer use load/store lanes.

Whether load/store lanes are a good idea here is another question.
If there were two sets of loads, and if we could use a single
permutation instead of one per load, then avoiding load/store
lanes should be a good thing even under the current abstract
cost model.  But I think under the current model we should
try to avoid splitting up potential load/store lanes groups
if there is no specific benefit to the split.

Preferring load/store lanes is still a source of missed optimisations
that we should fix one day...

gcc/
	* params.opt (-param=vect-max-layout-candidates=): New parameter.
	* doc/invoke.texi (vect-max-layout-candidates): Document it.
	* tree-vectorizer.h (auto_lane_permutation_t): New typedef.
	(auto_load_permutation_t): Likewise.
	* tree-vect-slp.cc (vect_slp_node_weight): New function.
	(slpg_layout_cost): New class.
	(slpg_vertex): Replace perm_in and perm_out with partition,
	out_degree, weight and out_weight.
	(slpg_partition_info, slpg_partition_layout_costs): New classes.
	(vect_optimize_slp_pass): Likewise, cannibalizing some part of
	the previous vect_optimize_slp.
	(vect_optimize_slp): Use it.

gcc/testsuite/
	* lib/target-supports.exp (check_effective_target_vect_var_shift):
	Return true for aarch64.
	* gcc.dg/vect/bb-slp-layout-1.c: New test.
	* gcc.dg/vect/bb-slp-layout-2.c: New test.
	* gcc.dg/vect/bb-slp-layout-3.c: New test.
	* gcc.dg/vect/bb-slp-layout-4.c: New test.
	* gcc.dg/vect/bb-slp-layout-5.c: New test.
	* gcc.dg/vect/bb-slp-layout-6.c: New test.
	* gcc.dg/vect/bb-slp-layout-7.c: New test.
	* gcc.dg/vect/bb-slp-layout-8.c: New test.
	* gcc.dg/vect/bb-slp-layout-9.c: New test.
	* gcc.dg/vect/bb-slp-layout-10.c: New test.
	* gcc.dg/vect/bb-slp-layout-11.c: New test.
	* gcc.dg/vect/bb-slp-layout-13.c: New test.
	* gcc.dg/vect/bb-slp-layout-14.c: New test.
	* gcc.dg/vect/bb-slp-layout-15.c: New test.
	* gcc.dg/vect/bb-slp-layout-16.c: New test.
	* gcc.dg/vect/bb-slp-layout-17.c: New test.
	* gcc.dg/vect/slp-11b.c: XFAIL SLP test for load-lanes targets.
zhongjuzhe pushed a commit that referenced this pull request Jan 31, 2023
The aarch64 ISA specification allows a left shift amount to be applied
after extension in the range of 0 to 4 (encoded in the imm3 field).

This is true for at least the following instructions:

 * ADD (extend register)
 * ADDS (extended register)
 * SUB (extended register)

The result of this patch can be seen, when compiling the following code:

uint64_t myadd(uint64_t a, uint64_t b)
{
    return a+(((uint8_t)b)<<4);
}

Without the patch the following sequence will be generated:

0000000000000000 <myadd>:
   0:	d37c1c21 	ubfiz	x1, x1, #4, #8
   4:	8b000020 	add	x0, x1, x0
   8:	d65f03c0 	ret

With the patch the ubfiz will be merged into the add instruction:

0000000000000000 <myadd>:
   0:	8b211000 	add	x0, x0, w1, uxtb #4
   4:	d65f03c0 	ret

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_uxt_size): fix an
	off-by-one in checking the permissible shift-amount.
kito-cheng pushed a commit that referenced this pull request Mar 24, 2023
…hook [PR108583]

This replaces the custom division hook with just an implementation through
add_highpart.  For NEON we implement the add highpart (Addition + extraction of
the upper highpart of the register in the same precision) as ADD + LSR.

This representation allows us to easily optimize the sequence using existing
sequences. This gets us a pretty decent sequence using SRA:

        umull   v1.8h, v0.8b, v3.8b
        umull2  v0.8h, v0.16b, v3.16b
        add     v5.8h, v1.8h, v2.8h
        add     v4.8h, v0.8h, v2.8h
        usra    v1.8h, v5.8h, 8
        usra    v0.8h, v4.8h, 8
        uzp2    v1.16b, v1.16b, v0.16b

To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
is half the precision of the mode of the operation into addhn + uaddw which is
a general good optimization on its own and gets us back to:

.L4:
        ldr     q0, [x3]
        umull   v1.8h, v0.8b, v5.8b
        umull2  v0.8h, v0.16b, v5.16b
        addhn   v3.8b, v1.8h, v4.8h
        addhn   v2.8b, v0.8h, v4.8h
        uaddw   v1.8h, v1.8h, v3.8b
        uaddw   v0.8h, v0.8h, v2.8b
        uzp2    v1.16b, v1.16b, v0.16b
        str     q1, [x3], 16
        cmp     x3, x4
        bne     .L4

For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        add     z1.h, z0.h, z3.h
        usra    z0.h, z1.h, #8
        lsr     z0.h, z0.h, #8
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3
.L1:
        ret

and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
to addhnb which gets us to:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        addhnb  z1.b, z0.h, z3.h
        addhnb  z0.b, z0.h, z1.h
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3

There are multiple RTL representations possible for these optimizations, I did
not represent them using a zero_extend because we seem very inconsistent in this
in the backend.  Since they are unspecs we won't match them from vector ops
anyway. I figured maintainers would prefer this, but my maintainer ouija board
is still out for repairs :)

There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.

gcc/ChangeLog:

	PR target/108583
	* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
	(*bitmask_shift_plus<mode>): New.
	* config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
	(@aarch64_bitmask_udiv<mode>3): Remove.
	* config/aarch64/aarch64.cc
	(aarch64_vectorize_can_special_div_by_constant,
	TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
	(TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
	aarch64_vectorize_preferred_div_as_shifts_over_mult): New.
Incarnation-p-lee pushed a commit that referenced this pull request Apr 28, 2023
This patch adds support for xstormy16's swpb (swap bytes) and swpw (swap
words) instructions.  The most obvious application of these to implement
the __builtin_bswap16 and __builtin_bswap32 intrinsics.

Currently, __builtin_bswap16 is implemented as:
foo:    mov r7,r2
        shl r7,#8
        shr r2,#8
        or r2,r7
        ret

but with this patch becomes:
foo:	swpb r2
	ret

Likewise, __builtin_bswap32 now becomes:
foo:	swpb r2 | swpb r3 | swpw r2,r3
        ret

Finally, the swpw instruction on its own can be used to exchange
two word mode registers without a temporary, so a new pattern and
peephole2 have been added to catch this.  As described in the
PR rtl-optimization/106518, register allocation can (in theory)
be more efficient on targets that provide a swap/exchange instruction.
The slightly unusual swap<mode> naming matches that used in i386.md.

2024-04-26  Roger Sayle  <[email protected]>

gcc/ChangeLog
	* config/stormy16/stormy16.md (bswaphi2): New define_insn.
	(bswapsi2): New define_insn.
	(swaphi): New define_insn to exchange two registers (swpw).
	(define_peephole2): Recognize exchange of registers as swaphi.

gcc/testsuite/ChangeLog
	* gcc.target/xstormy16/bswap16.c: New test case.
	* gcc.target/xstormy16/bswap32.c: Likewise.
	* gcc.target/xstormy16/swpb.c: Likewise.
	* gcc.target/xstormy16/swpw-1.c: Likewise.
	* gcc.target/xstormy16/swpw-2.c: Likewise.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants