-
Notifications
You must be signed in to change notification settings - Fork 92
[WIP] [Deepin-Kernel-SIG] [linux 6.6-y] [Deepin] Loongarch: optimize syscall reg save #881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: linux-6.6.y
Are you sure you want to change the base?
[WIP] [Deepin-Kernel-SIG] [linux 6.6-y] [Deepin] Loongarch: optimize syscall reg save #881
Conversation
deepin inclusion category: performance It saves a st.d in the hot syscall path, and let the compiler know to optimize it in asm, and helps to improve the syscall performance little. I have test in 3A6000 After patch: Benchmark Run: 一 6月 16 2025 20:38:10 - 20:47:09 8 CPUs in system; running 1 parallel copy of tests Dhrystone 2 using register variables 47066632.8 lps (10.0 s, 2 samples) Double-Precision Whetstone 5036.1 MWIPS (10.0 s, 2 samples) Execl Throughput 4484.2 lps (29.2 s, 1 samples) File Copy 1024 bufsize 2000 maxblocks 656586.0 KBps (30.0 s, 1 samples) File Copy 256 bufsize 500 maxblocks 175086.0 KBps (30.0 s, 1 samples) File Copy 4096 bufsize 8000 maxblocks 1998702.0 KBps (30.0 s, 1 samples) Pipe Throughput 1365130.7 lps (10.0 s, 2 samples) Pipe-based Context Switching 126232.9 lps (10.0 s, 2 samples) Process Creation 9202.7 lps (30.0 s, 1 samples) Shell Scripts (1 concurrent) 12501.2 lpm (60.0 s, 1 samples) Shell Scripts (8 concurrent) 4974.9 lpm (60.0 s, 1 samples) System Call Overhead 1467021.7 lps (10.0 s, 2 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 47066632.8 4033.1 Double-Precision Whetstone 55.0 5036.1 915.7 Execl Throughput 43.0 4484.2 1042.8 File Copy 1024 bufsize 2000 maxblocks 3960.0 656586.0 1658.0 File Copy 256 bufsize 500 maxblocks 1655.0 175086.0 1057.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 1998702.0 3446.0 Pipe Throughput 12440.0 1365130.7 1097.4 Pipe-based Context Switching 4000.0 126232.9 315.6 Process Creation 126.0 9202.7 730.4 Shell Scripts (1 concurrent) 42.4 12501.2 2948.4 Shell Scripts (8 concurrent) 6.0 4974.9 8291.5 System Call Overhead 15000.0 1467021.7 978.0 ======== System Benchmarks Index Score 1510.2 ------------------------------------------------------------------------ Benchmark Run: 一 6月 16 2025 20:47:09 - 20:56:08 8 CPUs in system; running 8 parallel copies of tests Dhrystone 2 using register variables 221748966.2 lps (10.0 s, 2 samples) Double-Precision Whetstone 37218.5 MWIPS (10.0 s, 2 samples) Execl Throughput 24364.4 lps (29.0 s, 1 samples) File Copy 1024 bufsize 2000 maxblocks 3681637.0 KBps (30.0 s, 1 samples) File Copy 256 bufsize 500 maxblocks 1020033.0 KBps (30.0 s, 1 samples) File Copy 4096 bufsize 8000 maxblocks 8054794.0 KBps (30.0 s, 1 samples) Pipe Throughput 8209249.1 lps (10.0 s, 2 samples) Pipe-based Context Switching 1058150.7 lps (10.0 s, 2 samples) Process Creation 49636.4 lps (30.0 s, 1 samples) Shell Scripts (1 concurrent) 43521.6 lpm (60.0 s, 1 samples) Shell Scripts (8 concurrent) 5672.4 lpm (60.0 s, 1 samples) System Call Overhead 9407101.4 lps (10.0 s, 2 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 221748966.2 19001.6 Double-Precision Whetstone 55.0 37218.5 6767.0 Execl Throughput 43.0 24364.4 5666.2 File Copy 1024 bufsize 2000 maxblocks 3960.0 3681637.0 9297.1 File Copy 256 bufsize 500 maxblocks 1655.0 1020033.0 6163.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 8054794.0 13887.6 Pipe Throughput 12440.0 8209249.1 6599.1 Pipe-based Context Switching 4000.0 1058150.7 2645.4 Process Creation 126.0 49636.4 3939.4 Shell Scripts (1 concurrent) 42.4 43521.6 10264.5 Shell Scripts (8 concurrent) 6.0 5672.4 9454.0 System Call Overhead 15000.0 9407101.4 6271.4 ======== System Benchmarks Index Score 7335.3 Before patch: Benchmark Run: 一 6月 16 2025 22:58:12 - 23:07:11 8 CPUs in system; running 1 parallel copy of tests Dhrystone 2 using register variables 41001790.5 lps (10.0 s, 2 samples) Double-Precision Whetstone 5036.1 MWIPS (10.0 s, 2 samples) Execl Throughput 4482.0 lps (29.6 s, 1 samples) File Copy 1024 bufsize 2000 maxblocks 654904.0 KBps (30.0 s, 1 samples) File Copy 256 bufsize 500 maxblocks 173158.0 KBps (30.0 s, 1 samples) File Copy 4096 bufsize 8000 maxblocks 2008222.0 KBps (30.0 s, 1 samples) Pipe Throughput 1370314.7 lps (10.0 s, 2 samples) Pipe-based Context Switching 126314.0 lps (10.0 s, 2 samples) Process Creation 9063.9 lps (30.0 s, 1 samples) Shell Scripts (1 concurrent) 12506.3 lpm (60.0 s, 1 samples) Shell Scripts (8 concurrent) 4972.7 lpm (60.0 s, 1 samples) System Call Overhead 1448942.6 lps (10.0 s, 2 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 41001790.5 3513.4 Double-Precision Whetstone 55.0 5036.1 915.7 Execl Throughput 43.0 4482.0 1042.3 File Copy 1024 bufsize 2000 maxblocks 3960.0 654904.0 1653.8 File Copy 256 bufsize 500 maxblocks 1655.0 173158.0 1046.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 2008222.0 3462.5 Pipe Throughput 12440.0 1370314.7 1101.5 Pipe-based Context Switching 4000.0 126314.0 315.8 Process Creation 126.0 9063.9 719.4 Shell Scripts (1 concurrent) 42.4 12506.3 2949.6 Shell Scripts (8 concurrent) 6.0 4972.7 8287.8 System Call Overhead 15000.0 1448942.6 966.0 ======== System Benchmarks Index Score 1488.9 ------------------------------------------------------------------------ Benchmark Run: 一 6月 16 2025 23:07:11 - 23:16:11 8 CPUs in system; running 8 parallel copies of tests Dhrystone 2 using register variables 221753204.3 lps (10.0 s, 2 samples) Double-Precision Whetstone 37215.6 MWIPS (10.0 s, 2 samples) Execl Throughput 24319.0 lps (30.0 s, 1 samples) File Copy 1024 bufsize 2000 maxblocks 3656936.0 KBps (30.0 s, 1 samples) File Copy 256 bufsize 500 maxblocks 1016886.0 KBps (30.0 s, 1 samples) File Copy 4096 bufsize 8000 maxblocks 7966493.0 KBps (30.0 s, 1 samples) Pipe Throughput 8211487.8 lps (10.0 s, 2 samples) Pipe-based Context Switching 1066013.7 lps (10.0 s, 2 samples) Process Creation 50743.5 lps (30.0 s, 1 samples) Shell Scripts (1 concurrent) 43664.4 lpm (60.0 s, 1 samples) Shell Scripts (8 concurrent) 5674.7 lpm (60.0 s, 1 samples) System Call Overhead 9320000.0 lps (10.0 s, 2 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 221753204.3 19002.0 Double-Precision Whetstone 55.0 37215.6 6766.5 Execl Throughput 43.0 24319.0 5655.6 File Copy 1024 bufsize 2000 maxblocks 3960.0 3656936.0 9234.7 File Copy 256 bufsize 500 maxblocks 1655.0 1016886.0 6144.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 7966493.0 13735.3 Pipe Throughput 12440.0 8211487.8 6600.9 Pipe-based Context Switching 4000.0 1066013.7 2665.0 Process Creation 126.0 50743.5 4027.3 Shell Scripts (1 concurrent) 42.4 43664.4 10298.2 Shell Scripts (8 concurrent) 6.0 5674.7 9457.8 System Call Overhead 15000.0 9320000.0 6213.3 ======== System Benchmarks Index Score 7336.1 Signed-off-by: Wentao Guan <[email protected]>
Reviewer's GuideThis patch optimizes the Loongarch system-call hot path by eliminating an unnecessary assembly-level store of the zero register and relocating that initialization into C, thus saving one memory write and enabling better compiler optimization for faster syscalls. Sequence diagram for optimized syscall register initialization in LoongarchsequenceDiagram
participant User as actor User Process
participant Kernel as Kernel (entry.S)
participant SyscallC as do_syscall (syscall.c)
User->>Kernel: Trigger syscall (trap)
Kernel->>Kernel: Save registers (except PT_R0)
Kernel->>SyscallC: Call do_syscall(regs)
SyscallC->>SyscallC: regs->regs[0] = 0 (now in C, not asm)
SyscallC->>SyscallC: if (nr < NR_syscalls) regs->regs[0] = nr + 1
SyscallC-->>Kernel: Return from do_syscall
Kernel-->>User: Return to user mode
Class diagram for pt_regs and syscall register handling changesclassDiagram
class pt_regs {
+unsigned long regs[32]
+unsigned long prmd
+unsigned long crmd
...
}
class do_syscall {
+void do_syscall(struct pt_regs *regs)
// now sets regs->regs[0] = 0 in C
}
pt_regs <.. do_syscall : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
deepin pr auto review代码审查意见:
综上所述,建议在 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the LoongArch syscall path by eliminating a redundant zero store in the assembly entry and moving the zero-initialization of the return register into the C-level do_syscall
, reducing a memory write in the hot path.
- Initialize
regs->regs[0]
to 0 indo_syscall
instead of in assembly - Comment out the redundant
st.d zero, sp, PT_R0
in the assembly entry
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
arch/loongarch/kernel/syscall.c | Add regs->regs[0] = 0; in do_syscall and a comment |
arch/loongarch/kernel/entry.S | Comment out the redundant zero store in handle_syscall |
Comments suppressed due to low confidence (2)
arch/loongarch/kernel/syscall.c:47
- [nitpick] Consider clarifying this comment to explain why moving the zero-initialization here improves performance, e.g., "Initialize regs->regs[0] to 0 in C to eliminate a redundant store in the assembly entry and reduce memory access."
// Move from handle_syscall macro to save a memio
arch/loongarch/kernel/entry.S:34
- [nitpick] Remove the commented-out
st.d zero, sp, PT_R0
instead of leaving dead code in the assembly file, to keep the entry path concise.
# st.d zero, sp, PT_R0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deepin inclusion
category: performance
It saves a st.d in the hot syscall path, and let
the compiler know to optimize it in asm,
and helps to improve the syscall performance little.
I have test in 3A6000 After patch:
Benchmark Run: 一 6月 16 2025 20:38:10 - 20:47:09
8 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 47066632.8 lps (10.0 s, 2 samples)
Double-Precision Whetstone 5036.1 MWIPS (10.0 s, 2 samples)
Execl Throughput 4484.2 lps (29.2 s, 1 samples)
File Copy 1024 bufsize 2000 maxblocks 656586.0 KBps (30.0 s, 1 samples)
File Copy 256 bufsize 500 maxblocks 175086.0 KBps (30.0 s, 1 samples)
File Copy 4096 bufsize 8000 maxblocks 1998702.0 KBps (30.0 s, 1 samples)
Pipe Throughput 1365130.7 lps (10.0 s, 2 samples)
Pipe-based Context Switching 126232.9 lps (10.0 s, 2 samples)
Process Creation 9202.7 lps (30.0 s, 1 samples)
Shell Scripts (1 concurrent) 12501.2 lpm (60.0 s, 1 samples)
Shell Scripts (8 concurrent) 4974.9 lpm (60.0 s, 1 samples)
System Call Overhead 1467021.7 lps (10.0 s, 2 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 47066632.8 4033.1
Double-Precision Whetstone 55.0 5036.1 915.7
Execl Throughput 43.0 4484.2 1042.8
File Copy 1024 bufsize 2000 maxblocks 3960.0 656586.0 1658.0
File Copy 256 bufsize 500 maxblocks 1655.0 175086.0 1057.9
File Copy 4096 bufsize 8000 maxblocks 5800.0 1998702.0 3446.0
Pipe Throughput 12440.0 1365130.7 1097.4
Pipe-based Context Switching 4000.0 126232.9 315.6
Process Creation 126.0 9202.7 730.4
Shell Scripts (1 concurrent) 42.4 12501.2 2948.4
Shell Scripts (8 concurrent) 6.0 4974.9 8291.5
System Call Overhead 15000.0 1467021.7 978.0
========
System Benchmarks Index Score 1510.2
------------------------------------------------------------------------ Benchmark Run: 一 6月 16 2025 20:47:09 - 20:56:08
8 CPUs in system; running 8 parallel copies of tests
Dhrystone 2 using register variables 221748966.2 lps (10.0 s, 2 samples)
Double-Precision Whetstone 37218.5 MWIPS (10.0 s, 2 samples)
Execl Throughput 24364.4 lps (29.0 s, 1 samples)
File Copy 1024 bufsize 2000 maxblocks 3681637.0 KBps (30.0 s, 1 samples)
File Copy 256 bufsize 500 maxblocks 1020033.0 KBps (30.0 s, 1 samples)
File Copy 4096 bufsize 8000 maxblocks 8054794.0 KBps (30.0 s, 1 samples)
Pipe Throughput 8209249.1 lps (10.0 s, 2 samples)
Pipe-based Context Switching 1058150.7 lps (10.0 s, 2 samples)
Process Creation 49636.4 lps (30.0 s, 1 samples)
Shell Scripts (1 concurrent) 43521.6 lpm (60.0 s, 1 samples)
Shell Scripts (8 concurrent) 5672.4 lpm (60.0 s, 1 samples)
System Call Overhead 9407101.4 lps (10.0 s, 2 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 221748966.2 19001.6
Double-Precision Whetstone 55.0 37218.5 6767.0
Execl Throughput 43.0 24364.4 5666.2
File Copy 1024 bufsize 2000 maxblocks 3960.0 3681637.0 9297.1
File Copy 256 bufsize 500 maxblocks 1655.0 1020033.0 6163.3
File Copy 4096 bufsize 8000 maxblocks 5800.0 8054794.0 13887.6
Pipe Throughput 12440.0 8209249.1 6599.1
Pipe-based Context Switching 4000.0 1058150.7 2645.4
Process Creation 126.0 49636.4 3939.4
Shell Scripts (1 concurrent) 42.4 43521.6 10264.5
Shell Scripts (8 concurrent) 6.0 5672.4 9454.0
System Call Overhead 15000.0 9407101.4 6271.4
========
System Benchmarks Index Score 7335.3
Before patch:
Benchmark Run: 一 6月 16 2025 22:58:12 - 23:07:11
8 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 41001790.5 lps (10.0 s, 2 samples)
Double-Precision Whetstone 5036.1 MWIPS (10.0 s, 2 samples)
Execl Throughput 4482.0 lps (29.6 s, 1 samples)
File Copy 1024 bufsize 2000 maxblocks 654904.0 KBps (30.0 s, 1 samples)
File Copy 256 bufsize 500 maxblocks 173158.0 KBps (30.0 s, 1 samples)
File Copy 4096 bufsize 8000 maxblocks 2008222.0 KBps (30.0 s, 1 samples)
Pipe Throughput 1370314.7 lps (10.0 s, 2 samples)
Pipe-based Context Switching 126314.0 lps (10.0 s, 2 samples)
Process Creation 9063.9 lps (30.0 s, 1 samples)
Shell Scripts (1 concurrent) 12506.3 lpm (60.0 s, 1 samples)
Shell Scripts (8 concurrent) 4972.7 lpm (60.0 s, 1 samples)
System Call Overhead 1448942.6 lps (10.0 s, 2 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 41001790.5 3513.4
Double-Precision Whetstone 55.0 5036.1 915.7
Execl Throughput 43.0 4482.0 1042.3
File Copy 1024 bufsize 2000 maxblocks 3960.0 654904.0 1653.8
File Copy 256 bufsize 500 maxblocks 1655.0 173158.0 1046.3
File Copy 4096 bufsize 8000 maxblocks 5800.0 2008222.0 3462.5
Pipe Throughput 12440.0 1370314.7 1101.5
Pipe-based Context Switching 4000.0 126314.0 315.8
Process Creation 126.0 9063.9 719.4
Shell Scripts (1 concurrent) 42.4 12506.3 2949.6
Shell Scripts (8 concurrent) 6.0 4972.7 8287.8
System Call Overhead 15000.0 1448942.6 966.0
========
System Benchmarks Index Score 1488.9
------------------------------------------------------------------------ Benchmark Run: 一 6月 16 2025 23:07:11 - 23:16:11
8 CPUs in system; running 8 parallel copies of tests
Dhrystone 2 using register variables 221753204.3 lps (10.0 s, 2 samples)
Double-Precision Whetstone 37215.6 MWIPS (10.0 s, 2 samples)
Execl Throughput 24319.0 lps (30.0 s, 1 samples)
File Copy 1024 bufsize 2000 maxblocks 3656936.0 KBps (30.0 s, 1 samples)
File Copy 256 bufsize 500 maxblocks 1016886.0 KBps (30.0 s, 1 samples)
File Copy 4096 bufsize 8000 maxblocks 7966493.0 KBps (30.0 s, 1 samples)
Pipe Throughput 8211487.8 lps (10.0 s, 2 samples)
Pipe-based Context Switching 1066013.7 lps (10.0 s, 2 samples)
Process Creation 50743.5 lps (30.0 s, 1 samples)
Shell Scripts (1 concurrent) 43664.4 lpm (60.0 s, 1 samples)
Shell Scripts (8 concurrent) 5674.7 lpm (60.0 s, 1 samples)
System Call Overhead 9320000.0 lps (10.0 s, 2 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 221753204.3 19002.0
Double-Precision Whetstone 55.0 37215.6 6766.5
Execl Throughput 43.0 24319.0 5655.6
File Copy 1024 bufsize 2000 maxblocks 3960.0 3656936.0 9234.7
File Copy 256 bufsize 500 maxblocks 1655.0 1016886.0 6144.3
File Copy 4096 bufsize 8000 maxblocks 5800.0 7966493.0 13735.3
Pipe Throughput 12440.0 8211487.8 6600.9
Pipe-based Context Switching 4000.0 1066013.7 2665.0
Process Creation 126.0 50743.5 4027.3
Shell Scripts (1 concurrent) 42.4 43664.4 10298.2
Shell Scripts (8 concurrent) 6.0 5674.7 9457.8
System Call Overhead 15000.0 9320000.0 6213.3
========
System Benchmarks Index Score 7336.1
Summary by Sourcery
Optimize the LoongArch syscall path by removing a redundant assembly store and moving zero-initialization of the return register into the C-level do_syscall, reducing a memory write and boosting syscall performance.
Enhancements: