This repository was archived by the owner on May 7, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 274
Review testsuite and clean up #8
Merged
palmer-dabbelt
merged 9 commits into
riscvarchive:riscv-gcc-6.1.0
from
kito-cheng:review-testsuite-and-clean-up
Oct 13, 2016
Merged
Review testsuite and clean up #8
palmer-dabbelt
merged 9 commits into
riscvarchive:riscv-gcc-6.1.0
from
kito-cheng:review-testsuite-and-clean-up
Oct 13, 2016
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Is this based on qemu results? |
@sorear yes, it's base on riscvarchive/riscv-qemu@9bfcd47 |
- Seem like alpha and powerpc.
- Same as mips64 since TRULY_NOOP_TRUNCATION (sizeof (int) * CHAR_BIT, sizeof (long long) * CHAR_BIT) is 0 for RV64.
- Testcase: g++.dg/cpp0x/constexpr-rom.C
116b016
to
f70433c
Compare
Updated testsuite result, fix wrong syntax for dg-skip [1][2]
|
Thanks! |
cliffordwolf
pushed a commit
that referenced
this pull request
Sep 29, 2019
Like the logical operations, expand all shifts early rather than only sometimes. The Neon shift expansions are never emitted (not even with -fneon-for-64bits), so they are not useful. So all the late expansions and Neon shift patterns can be removed, and shifts are more optimized as a result. Since some extend patterns use Neon DImode shifts, remove the Neon extend variants and related splits. A simple example now generates the same efficient code after this patch with -mfpu=neon and -mfpu=vfp (previously just the fact of having Neon enabled resulted inefficient code for no reason). unsigned long long f(unsigned long long x, unsigned long long y) { return x & (y >> 33); } Before: strd r4, r5, [sp, #-8]! lsr r4, r3, #1 mov r5, #0 and r1, r1, r5 and r0, r0, r4 ldrd r4, r5, [sp] add sp, sp, #8 bx lr After: and r0, r0, r3, lsr #1 mov r1, #0 bx lr Bootstrap and regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57 gcc/ * config/arm/iterators.md (qhs_extenddi_cstr): Update. (qhs_extenddi_cstr): Likewise. * config/arm/arm.md (ashldi3): Always expand early. (ashlsi3): Likewise. (ashrsi3): Likewise. (zero_extend<mode>di2): Remove Neon variants. (extend<mode>di2): Likewise. * config/arm/neon.md (ashldi3_neon_noclobber): Remove. (signed_shift_di3_neon): Likewise. (unsigned_shift_di3_neon): Likewise. (ashrdi3_neon_imm_noclobber): Likewise. (lshrdi3_neon_imm_noclobber): Likewise. (<shift>di3_neon): Likewise. (split extend): Remove DI extend split patterns. gcc/testsuite/ * gcc.target/arm/neon-extend-1.c: Remove test. * gcc.target/arm/neon-extend-2.c: Remove test. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@274824 138bc75d-0d04-0410-961f-82ee72b054a4
cliffordwolf
pushed a commit
that referenced
this pull request
Sep 29, 2019
…ested function In FDPIC mode, the trampoline generated to support pointers to nested functions looks like: .word trampoline address .word trampoline GOT address ldr r12, [pc, #8] ldr r9, [pc, #8] ldr pc, [pc, #8] .word static chain value .word GOT address .word function's address because in FDPIC function pointers are actually pointers to function descriptors, we have to actually generate a function descriptor for the trampoline. 2019--09-10 Christophe Lyon <[email protected]> Mickaël Guêné <[email protected]> gcc/ * config/arm/arm.c (arm_asm_trampoline_template): Add FDPIC support. (arm_trampoline_init): Likewise. (arm_trampoline_adjust_address): Likewise. * config/arm/arm.h (TRAMPOLINE_SIZE): Likewise. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@275571 138bc75d-0d04-0410-961f-82ee72b054a4
kito-cheng
pushed a commit
that referenced
this pull request
Mar 18, 2020
When using `check-function-bodies`, the subroutine `parse_function_bodies` uses the `fluff` regexp to remove uninteresting assembly lines. Arm targets generate assembly with some lines prefixed by `@`, these lines are left by this process. As an example of some lines prefixed by `@': the assembly output from the `stacktest1` function in "bfloat16_simd_3_1.c" is: .align 2 .global stacktest1 .arch armv8.2-a .syntax unified .arm .fpu neon-fp-armv8 .type stacktest1, %function stacktest1: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. sub sp, sp, #8 add r3, sp, #6 vst1.16 {d0[0]}, [r3] vld1.16 {d0[0]}, [r3] add sp, sp, #8 @ sp needed bx lr .size stacktest1, .-stacktest1 It seems that previous uses of `check-function-bodies` in the arm backend have avoided problems with such lines since they use the `...` regexp in each place such fluff occurs. I'm currently writing a patch that I'd like to match the entire function body, so I'd like to remove such `@` lines automatically. gcc/testsuite/ChangeLog: 2020-03-11 Matthew Malcomson <[email protected]> * lib/scanasm.exp (parse_function_bodies): Lines starting with '@' also counted as fluff.
kito-cheng
pushed a commit
to sifive/riscv-gcc
that referenced
this pull request
Aug 14, 2020
The stack-protector-1.c test fails when compiled for Cortex-M: - for Cortex-M0/M1, str r0, [sp #-8]! is not supported - for Cortex-M3/M4..., the assembler complains that "use of r13 is deprecated" This patch replaces the str instruction with sub sp, sp, riscvarchive#8 str r0, [sp] and removes the check for r13, which is unlikely to leak the canary value. 2020-08-11 Christophe Lyon <[email protected]> gcc/testsuite/ * gcc.target/arm/stack-protector-1.c: Adapt code to Cortex-M restrictions.
kito-cheng
pushed a commit
that referenced
this pull request
Jun 23, 2021
This patch fixes PR99748 which shows us trying to pass the argument to __aeabi_f2iz in the VFP register s0 when the library function is expecting to use the GPR r0. It also fixes the __aeabi_f2uiz case which was broken in the same way. For the testcase in the PR, here is the code we generate before the patch (with -mfloat-abi=hard -march=armv8.1-m.main+mve -O0): main: push {r7, lr} sub sp, sp, #8 add r7, sp, #0 mov r3, #1065353216 str r3, [r7, #4] @ float vldr.32 s0, [r7, #4] bl __aeabi_f2iz mov r3, r0 cmp r3, #1 [...] This becomes: main: push {r7, lr} sub sp, sp, #8 add r7, sp, #0 mov r3, #1065353216 str r3, [r7, #4] @ float ldr r0, [r7, #4] @ float bl __aeabi_f2iz mov r3, r0 cmp r3, #1 [...] after the patch. We see a similar change for the same testcase with a cast to unsigned instead of int. gcc/ChangeLog: PR target/99748 * config/arm/arm.c (arm_libcall_uses_aapcs_base): Also use base PCS for [su]fix_optab.
scpcom
pushed a commit
to scpcom/riscv-gcc
that referenced
this pull request
Dec 28, 2021
This patch extends support for -mpure-code to all thumb-1 processors, by removing the need for MOVT. Symbol addresses are built using upper8_15, upper0_7, lower8_15 and lower0_7 relocations, and constants are built using sequences of movs/adds and lsls instructions. The extension of the *thumb1_movhf pattern uses always the same size (6) although it can emit a shorter sequence when possible. This is similar to what *arm32_movhf already does. CASE_VECTOR_PC_RELATIVE is now false with -mpure-code, to avoid generating invalid assembly code with differences from symbols from two different sections (the difference cannot be computed by the assembler). Tests pr45701-[12].c needed a small adjustment to avoid matching upper8_15 when looking for the r8 register. Test no-literal-pool.c is augmented with __fp16, so it now uses -mfp16-format=ieee. Test thumb1-Os-mult.c generates an inline code sequence with -mpure-code and computes the multiplication by using a sequence of add/shift rather than using the multiply instruction, so we skip it in presence of -mpure-code. With -mcpu=cortex-m0, the pure-code/no-literal-pool.c fails because code like: static char *p = "Hello World"; char * testchar () { return p + 4; } generates 2 indirections (I removed non-essential directives/code) .section .rodata .LC0: .ascii "Hello World\000" .data p: .word .LC0 .section .rodata .LC2: .word p .section .text,"0x20000006",%progbits testchar: push {r7, lr} add r7, sp, #0 movs r3, #:upper8_15:#.LC2 lsls r3, riscvarchive#8 adds r3, #:upper0_7:#.LC2 lsls r3, riscvarchive#8 adds r3, #:lower8_15:#.LC2 lsls r3, riscvarchive#8 adds r3, #:lower0_7:#.LC2 ldr r3, [r3] ldr r3, [r3] adds r3, r3, riscvarchive#4 movs r0, r3 mov sp, r7 @ sp needed pop {r7, pc} By contrast, when using -mcpu=cortex-m4, the code looks like: .section .rodata .LC0: .ascii "Hello World\000" .data p: .word .LC0 testchar: push {r7} add r7, sp, #0 movw r3, #:lower16:p movt r3, #:upper16:p ldr r3, [r3] adds r3, r3, riscvarchive#4 mov r0, r3 mov sp, r7 pop {r7} bx lr I haven't found yet how to make code for cortex-m0 apply upper/lower relocations to "p" instead of .LC2. The current code looks functional, but could be improved. 2020-02-25 Christophe Lyon <[email protected]> Backport from mainline 2019-10-18 Christophe Lyon <[email protected]> gcc/ * config/arm/arm-protos.h (thumb1_gen_const_int): Add new prototype. * config/arm/arm.c (arm_option_check_internal): Remove restriction on MOVT for -mpure-code. (thumb1_gen_const_int): New function. (thumb1_legitimate_address_p): Support -mpure-code. (thumb1_rtx_costs): Likewise. (thumb1_size_rtx_costs): Likewise. (arm_thumb1_mi_thunk): Likewise. * config/arm/arm.h (CASE_VECTOR_PC_RELATIVE): Likewise. * config/arm/thumb1.md (thumb1_movsi_symbol_ref): New. (*thumb1_movhf): Support -mpure-code. gcc/testsuite/ * gcc.target/arm/pr45701-1.c: Adjust for -mpure-code. * gcc.target/arm/pr45701-2.c: Likewise. * gcc.target/arm/pure-code/no-literal-pool.c: Add tests for __fp16. * gcc.target/arm/pure-code/pure-code.exp: Remove thumb2 and movt conditions. * gcc.target/arm/thumb1-Os-mult.c: Skip if -mpure-code is used.
pz9115
pushed a commit
to plctlab/riscv-gcc
that referenced
this pull request
Mar 7, 2022
…04617] On #define A(n) int foo1##n(void) { return 1##n; } #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n#riscvarchive#5) A(n#riscvarchive#6) A(n#riscvarchive#7) A(n#riscvarchive#8) A(n#riscvarchive#9) #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n#riscvarchive#5) B(n#riscvarchive#6) B(n#riscvarchive#7) B(n#riscvarchive#8) B(n#riscvarchive#9) #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n#riscvarchive#5) C(n#riscvarchive#6) C(n#riscvarchive#7) C(n#riscvarchive#8) C(n#riscvarchive#9) #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n#riscvarchive#5) D(n#riscvarchive#6) D(n#riscvarchive#7) D(n#riscvarchive#8) D(n#riscvarchive#9) E(0) E(1) E(2) D(30) D(31) C(320) C(321) C(322) C(323) C(324) C(325) B(3260) B(3261) B(3262) B(3263) A(32640) A(32641) A(32642) testcase with ./xgcc -B ./ -c -g -fpic -ffat-lto-objects -flto -O0 -o foo1.o foo1.c -ffunction-sections ./xgcc -B ./ -shared -g -fpic -flto -O0 -o foo1.so foo1.o /tmp/ccTW8mBm.debug.temp.o: file not recognized: file format not recognized (testcase too slow to be included into testsuite). The problem is clearly reported by readelf: readelf: foo1.o.debug.temp.o: Warning: Section 2 has an out of range sh_link value of 65321 readelf: foo1.o.debug.temp.o: Warning: Section 5 has an out of range sh_link value of 65321 readelf: foo1.o.debug.temp.o: Warning: Section 10 has an out of range sh_link value of 65323 readelf: foo1.o.debug.temp.o: Warning: [ 2]: Link field (65321) should index a symtab section. readelf: foo1.o.debug.temp.o: Warning: [ 5]: Link field (65321) should index a symtab section. readelf: foo1.o.debug.temp.o: Warning: [10]: Link field (65323) should index a string section. because simple_object_elf_copy_lto_debug_sections doesn't adjust sh_info and sh_link fields in ElfNN_Shdr if they are in between SHN_{LO,HI}RESERVE inclusive. Not adjusting those is incorrect though, SHN_{LO,HI}RESERVE range is only relevant to the 16-bit fields, mainly st_shndx in ElfNN_Sym where if one needs >= SHN_LORESERVE section number, SHN_XINDEX should be used instead and .symtab_shndx section should contain the real section index, and in ElfNN_Ehdr e_shnum and e_shstrndx fields, where if >= SHN_LORESERVE value is needed it should put those into Shdr[0].sh_{size,link}. But, sh_{link,info} are 32-bit fields which can contain any section index. Note, as simple-object-elf.c mentions, binutils from 2.12 to 2.18 (so before 2011) used to mishandle the > 63.75K sections case and assumed there is a hole in between the sections, but what simple_object_elf_copy_lto_debug_sections does wouldn't help in that case for the debug temp object creation, we'd need to detect the case also in that routine and take it into account in the remapping etc. I think it is not worth it given that it is over 10 years, if somebody needs 63.75K or more sections, better use more recent binutils. 2022-02-22 Jakub Jelinek <[email protected]> PR lto/104617 * simple-object-elf.c (simple_object_elf_match): Fix up URL in comment. (simple_object_elf_copy_lto_debug_sections): Remap sh_info and sh_link even if they are in the SHN_LORESERVE .. SHN_HIRESERVE range (inclusive).
kito-cheng
pushed a commit
that referenced
this pull request
Aug 15, 2022
This patch extends the fix for PR106253 to AArch32. As with AArch64, we were using ACLE intrinsics to vectorise scalar built-ins, even though the two sometimes have different ECF_* flags. (That in turn is because the ACLE intrinsics should follow the instruction semantics as closely as possible, whereas the scalar built-ins follow language specs.) The patch also removes the copysignf built-in, which only existed for this purpose and wasn't a “real” arm_neon.h built-in. Doing this also has the side-effect of enabling vectorisation of rint and roundeven. Logically that should be a separate patch, but making it one would have meant adding a new int iterator for the original set of instructions and then removing it again when including new functions. I've restricted the bswap tests to little-endian because we end up with excessive spilling on big-endian. E.g.: sub sp, sp, #8 vstr d1, [sp] vldr d16, [sp] vrev16.8 d16, d16 vstr d16, [sp] vldr d0, [sp] add sp, sp, #8 @ sp needed bx lr Similarly, the copysign tests require little-endian because on big-endian we unnecessarily load the constant from the constant pool: vldr.32 s15, .L3 vdup.32 d0, d7[1] vbsl d0, d2, d1 bx lr .L3: .word -2147483648 gcc/ PR target/106253 * config/arm/arm-builtins.cc (arm_builtin_vectorized_function): Delete. * config/arm/arm-protos.h (arm_builtin_vectorized_function): Delete. * config/arm/arm.cc (TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION): Delete. * config/arm/arm_neon_builtins.def (copysignf): Delete. * config/arm/iterators.md (nvrint_pattern): New attribute. * config/arm/neon.md (<NEON_VRINT:nvrint_pattern><VCVTF:mode>2): New pattern. (l<NEON_VCVT:nvrint_pattern><su_optab><VCVTF:mode><v_cmp_result>2): Likewise. (neon_copysignf<mode>): Rename to... (copysign<mode>3): ...this. gcc/testsuite/ PR target/106253 * gcc.target/arm/vect_unary_1.c: New test. * gcc.target/arm/vect_binary_1.c: Likewise.
kito-cheng
pushed a commit
that referenced
this pull request
Sep 2, 2022
Currently SLP tries to force permute operations "down" the graph from loads in the hope of reducing the total number of permutations needed or (in the best case) removing the need for the permutations entirely. This patch tries to extend it as follows: - Allow loads to take a different permutation from the one they started with, rather than choosing between "original permutation" and "no permutation". - Allow changes in both directions, if the target supports the reverse permutation. - Treat the placement of permutations as a two-way dataflow problem: after propagating information from leaves to roots (as now), propagate information back up the graph. - Take execution frequency into account when optimising for speed, so that (for example) permutations inside loops have a higher cost than permutations outside loops. - Try to reduce the total number of permutations when optimising for size, even if that increases the number of permutations on a given execution path. See the big block comment above vect_optimize_slp_pass for a detailed description. The original motivation for doing this was to add a framework that would allow other layout differences in future. The two main ones are: - Make it easier to represent predicated operations, including predicated operations with gaps. E.g.: a[0] += 1; a[1] += 1; a[3] += 1; could be a single load/add/store for SVE. We could handle this by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 } (depending on what's being counted). We might need to move elements between lanes at various points, like with permutes. (This would first mean adding support for stores with gaps.) - Make it easier to switch between an even/odd and unpermuted layout when switching between wide and narrow elements. E.g. if a widening operation produces an even vector and an odd vector, we should try to keep operations on the wide elements in that order rather than force them to be permuted back "in order". To give some examples of what the patch does: int f1(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { a[0] = (b[1] << c[3]) - d[1]; a[1] = (b[0] << c[2]) - d[0]; a[2] = (b[3] << c[1]) - d[3]; a[3] = (b[2] << c[0]) - d[2]; } continues to produce the same code as before when optimising for speed: b, c and d are permuted at load time. But when optimising for size we instead permute c into the same order as b+d and then permute the result of the arithmetic into the same order as a: ldr q1, [x2] ldr q0, [x1] ext v1.16b, v1.16b, v1.16b, #8 // <------ sshl v0.4s, v0.4s, v1.4s ldr q1, [x3] sub v0.4s, v0.4s, v1.4s rev64 v0.4s, v0.4s // <------ str q0, [x0] ret The following function: int f2(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { a[0] = (b[3] << c[3]) - d[3]; a[1] = (b[2] << c[2]) - d[2]; a[2] = (b[1] << c[1]) - d[1]; a[3] = (b[0] << c[0]) - d[0]; } continues to push the reverse down to just before the store, like the previous code did. In: int f3(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { for (int i = 0; i < 100; ++i) { a[0] = (a[0] + c[3]); a[1] = (a[1] + c[2]); a[2] = (a[2] + c[1]); a[3] = (a[3] + c[0]); c += 4; } } the loads of a are hoisted and the stores of a are sunk, so that only the load from c happens in the loop. When optimising for speed, we prefer to have the loop operate on the reversed layout, changing on entry and exit from the loop: mov x3, x0 adrp x0, .LC0 add x1, x2, 1600 ldr q2, [x0, #:lo12:.LC0] ldr q0, [x3] mov v1.16b, v0.16b tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- .p2align 3,,7 .L6: ldr q1, [x2], 16 add v0.4s, v0.4s, v1.4s cmp x2, x1 bne .L6 mov v1.16b, v0.16b adrp x0, .LC0 ldr q2, [x0, #:lo12:.LC0] tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- str q0, [x3] ret Similarly, for the very artificial testcase: int f4(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { int a0 = a[0]; int a1 = a[1]; int a2 = a[2]; int a3 = a[3]; for (int i = 0; i < 100; ++i) { a0 ^= c[0]; a1 ^= c[1]; a2 ^= c[2]; a3 ^= c[3]; c += 4; for (int j = 0; j < 100; ++j) { a0 += d[1]; a1 += d[0]; a2 += d[3]; a3 += d[2]; d += 4; } b[0] = a0; b[1] = a1; b[2] = a2; b[3] = a3; b += 4; } a[0] = a0; a[1] = a1; a[2] = a2; a[3] = a3; } the a vector in the inner loop maintains the order { 1, 0, 3, 2 }, even though it's part of an SCC that includes the outer loop. In other words, this is a motivating case for not assigning permutes at SCC granularity. The code we get is: ldr q0, [x0] mov x4, x1 mov x5, x0 add x1, x3, 1600 add x3, x4, 1600 .p2align 3,,7 .L11: ldr q1, [x2], 16 sub x0, x1, #1600 eor v0.16b, v1.16b, v0.16b rev64 v0.4s, v0.4s // <--- .p2align 3,,7 .L10: ldr q1, [x0], 16 add v0.4s, v0.4s, v1.4s cmp x0, x1 bne .L10 rev64 v0.4s, v0.4s // <--- add x1, x0, 1600 str q0, [x4], 16 cmp x3, x4 bne .L11 str q0, [x5] ret bb-slp-layout-17.c is a collection of compile tests for problems I hit with earlier versions of the patch. The same prolems might show up elsewhere, but it seemed worth having the test anyway. In slp-11b.c we previously pushed the permutation of the in[i*4] group down from the load to just before the store. That didn't reduce the number or frequency of the permutations (or increase them either). But separating the permute from the load meant that we could no longer use load/store lanes. Whether load/store lanes are a good idea here is another question. If there were two sets of loads, and if we could use a single permutation instead of one per load, then avoiding load/store lanes should be a good thing even under the current abstract cost model. But I think under the current model we should try to avoid splitting up potential load/store lanes groups if there is no specific benefit to the split. Preferring load/store lanes is still a source of missed optimisations that we should fix one day... gcc/ * params.opt (-param=vect-max-layout-candidates=): New parameter. * doc/invoke.texi (vect-max-layout-candidates): Document it. * tree-vectorizer.h (auto_lane_permutation_t): New typedef. (auto_load_permutation_t): Likewise. * tree-vect-slp.cc (vect_slp_node_weight): New function. (slpg_layout_cost): New class. (slpg_vertex): Replace perm_in and perm_out with partition, out_degree, weight and out_weight. (slpg_partition_info, slpg_partition_layout_costs): New classes. (vect_optimize_slp_pass): Likewise, cannibalizing some part of the previous vect_optimize_slp. (vect_optimize_slp): Use it. gcc/testsuite/ * lib/target-supports.exp (check_effective_target_vect_var_shift): Return true for aarch64. * gcc.dg/vect/bb-slp-layout-1.c: New test. * gcc.dg/vect/bb-slp-layout-2.c: New test. * gcc.dg/vect/bb-slp-layout-3.c: New test. * gcc.dg/vect/bb-slp-layout-4.c: New test. * gcc.dg/vect/bb-slp-layout-5.c: New test. * gcc.dg/vect/bb-slp-layout-6.c: New test. * gcc.dg/vect/bb-slp-layout-7.c: New test. * gcc.dg/vect/bb-slp-layout-8.c: New test. * gcc.dg/vect/bb-slp-layout-9.c: New test. * gcc.dg/vect/bb-slp-layout-10.c: New test. * gcc.dg/vect/bb-slp-layout-11.c: New test. * gcc.dg/vect/bb-slp-layout-13.c: New test. * gcc.dg/vect/bb-slp-layout-14.c: New test. * gcc.dg/vect/bb-slp-layout-15.c: New test. * gcc.dg/vect/bb-slp-layout-16.c: New test. * gcc.dg/vect/bb-slp-layout-17.c: New test. * gcc.dg/vect/slp-11b.c: XFAIL SLP test for load-lanes targets.
zhongjuzhe
pushed a commit
that referenced
this pull request
Jan 31, 2023
The aarch64 ISA specification allows a left shift amount to be applied after extension in the range of 0 to 4 (encoded in the imm3 field). This is true for at least the following instructions: * ADD (extend register) * ADDS (extended register) * SUB (extended register) The result of this patch can be seen, when compiling the following code: uint64_t myadd(uint64_t a, uint64_t b) { return a+(((uint8_t)b)<<4); } Without the patch the following sequence will be generated: 0000000000000000 <myadd>: 0: d37c1c21 ubfiz x1, x1, #4, #8 4: 8b000020 add x0, x1, x0 8: d65f03c0 ret With the patch the ubfiz will be merged into the add instruction: 0000000000000000 <myadd>: 0: 8b211000 add x0, x0, w1, uxtb #4 4: d65f03c0 ret gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_uxt_size): fix an off-by-one in checking the permissible shift-amount.
kito-cheng
pushed a commit
that referenced
this pull request
Mar 24, 2023
…hook [PR108583] This replaces the custom division hook with just an implementation through add_highpart. For NEON we implement the add highpart (Addition + extraction of the upper highpart of the register in the same precision) as ADD + LSR. This representation allows us to easily optimize the sequence using existing sequences. This gets us a pretty decent sequence using SRA: umull v1.8h, v0.8b, v3.8b umull2 v0.8h, v0.16b, v3.16b add v5.8h, v1.8h, v2.8h add v4.8h, v0.8h, v2.8h usra v1.8h, v5.8h, 8 usra v0.8h, v4.8h, 8 uzp2 v1.16b, v1.16b, v0.16b To get the most optimal sequence however we match (a + ((b + c) >> n)) where n is half the precision of the mode of the operation into addhn + uaddw which is a general good optimization on its own and gets us back to: .L4: ldr q0, [x3] umull v1.8h, v0.8b, v5.8b umull2 v0.8h, v0.16b, v5.16b addhn v3.8b, v1.8h, v4.8h addhn v2.8b, v0.8h, v4.8h uaddw v1.8h, v1.8h, v3.8b uaddw v0.8h, v0.8h, v2.8b uzp2 v1.16b, v1.16b, v0.16b str q1, [x3], 16 cmp x3, x4 bne .L4 For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us: .L3: ld1b z0.h, p0/z, [x0, x3] mul z0.h, p1/m, z0.h, z2.h add z1.h, z0.h, z3.h usra z0.h, z1.h, #8 lsr z0.h, z0.h, #8 st1b z0.h, p0, [x0, x3] inch x3 whilelo p0.h, w3, w2 b.any .L3 .L1: ret and to get the most optimal sequence I match (a + b) >> n (same constraint on n) to addhnb which gets us to: .L3: ld1b z0.h, p0/z, [x0, x3] mul z0.h, p1/m, z0.h, z2.h addhnb z1.b, z0.h, z3.h addhnb z0.b, z0.h, z1.h st1b z0.h, p0, [x0, x3] inch x3 whilelo p0.h, w3, w2 b.any .L3 There are multiple RTL representations possible for these optimizations, I did not represent them using a zero_extend because we seem very inconsistent in this in the backend. Since they are unspecs we won't match them from vector ops anyway. I figured maintainers would prefer this, but my maintainer ouija board is still out for repairs :) There are no new test as new correctness tests were added to the mid-end and the existing codegen tests for this already exist. gcc/ChangeLog: PR target/108583 * config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove. (*bitmask_shift_plus<mode>): New. * config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New. (@aarch64_bitmask_udiv<mode>3): Remove. * config/aarch64/aarch64.cc (aarch64_vectorize_can_special_div_by_constant, TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed. (TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT, aarch64_vectorize_preferred_div_as_shifts_over_mult): New.
Incarnation-p-lee
pushed a commit
that referenced
this pull request
Apr 28, 2023
This patch adds support for xstormy16's swpb (swap bytes) and swpw (swap words) instructions. The most obvious application of these to implement the __builtin_bswap16 and __builtin_bswap32 intrinsics. Currently, __builtin_bswap16 is implemented as: foo: mov r7,r2 shl r7,#8 shr r2,#8 or r2,r7 ret but with this patch becomes: foo: swpb r2 ret Likewise, __builtin_bswap32 now becomes: foo: swpb r2 | swpb r3 | swpw r2,r3 ret Finally, the swpw instruction on its own can be used to exchange two word mode registers without a temporary, so a new pattern and peephole2 have been added to catch this. As described in the PR rtl-optimization/106518, register allocation can (in theory) be more efficient on targets that provide a swap/exchange instruction. The slightly unusual swap<mode> naming matches that used in i386.md. 2024-04-26 Roger Sayle <[email protected]> gcc/ChangeLog * config/stormy16/stormy16.md (bswaphi2): New define_insn. (bswapsi2): New define_insn. (swaphi): New define_insn to exchange two registers (swpw). (define_peephole2): Recognize exchange of registers as swaphi. gcc/testsuite/ChangeLog * gcc.target/xstormy16/bswap16.c: New test case. * gcc.target/xstormy16/bswap32.c: Likewise. * gcc.target/xstormy16/swpb.c: Likewise. * gcc.target/xstormy16/swpw-1.c: Likewise. * gcc.target/xstormy16/swpw-2.c: Likewise.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In this change set, I've review some fail case and mark skip if it's reasonable for RISC-V.
Testsuite after this patch set: