DAXPY outperforms DSCAL in multi-threaded environments

Issue Description:

I'm observing a significant performance disparity between dscal and daxpy when performing vector-scalar multiplication on an Intel(R) Xeon(R) Platinum 8378C CPU @ 2.80GHz. My code involves the operation y=ax, where x is a vector of length 80,000.

Observed Behavior:

Despite setting the OPENBLAS_NUM_THREADS environment variable to either 1 or 16, the execution time for dscal remains unchanged, indicating no utilization of multiple cores.

However, when I replace dscal with an equivalent operation using daxpy, specifically y=(a−1)x+x (having a loss in precision), I observe a multi-fold performance improvement in the multi-core scenario.

Problem:

Given that dscal and daxpy have very similar computational patterns, I'm seeking to understand why there's such a substantial difference in their multi-core performance. This behavior suggests that dscal is not effectively leveraging the available CPU cores, unlike daxpy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DAXPY outperforms DSCAL in multi-threaded environments #5328

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DAXPY outperforms DSCAL in multi-threaded environments #5328

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions