Skip to content

CANN: Add support for async operator submission #12864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hipudding
Copy link
Collaborator

@hipudding hipudding commented Apr 10, 2025

Submit operators using asynchronous threads to improve performance.

Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.

Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.

SYNC_MODE

llama_perf_sampler_print:    sampling time =      76.81 ms /   316 runs   (    0.24 ms per token,  4113.94 tokens per second)
llama_perf_context_print:        load time =    2880.65 ms
llama_perf_context_print: prompt eval time =      23.05 ms /    27 tokens (    0.85 ms per token,  1171.11 tokens per second)
llama_perf_context_print:        eval time =    6727.99 ms /   288 runs   (   23.36 ms per token,    42.81 tokens per second)
llama_perf_context_print:       total time =    7838.36 ms /   315 tokens

ASYNC_MODE

llama_perf_sampler_print:    sampling time =      51.17 ms /   220 runs   (    0.23 ms per token,  4299.73 tokens per second)
llama_perf_context_print:        load time =    2751.20 ms
llama_perf_context_print: prompt eval time =      17.26 ms /    27 tokens (    0.64 ms per token,  1563.95 tokens per second)
llama_perf_context_print:        eval time =    3037.53 ms /   192 runs   (   15.82 ms per token,    63.21 tokens per second)
llama_perf_context_print:       total time =    3343.86 ms /   219 tokens

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 10, 2025
@hipudding hipudding self-assigned this Apr 11, 2025
@hipudding hipudding added the Ascend NPU issues specific to Ascend NPUs label Apr 11, 2025
Submit operators using asynchronous threads to improve performance.

Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.

Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.
@hipudding hipudding changed the title CANN: add async task submit CANN: Add support for async operator submission Apr 15, 2025
@hipudding hipudding marked this pull request as ready for review April 15, 2025 03:22
#include <thread>
#include <unistd.h>
#include <functional>
#include <deque>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused header file.

Copy link
Contributor

@noemotiovon noemotiovon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR improves NPU utilization through asynchronous dispatching — impressive work!


if (!running_) {
thread_ = std::thread(&cann_task_queue::execute, this);
running_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a potential multithreading concurrency issue here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants