Skip to content

DAXPY outperforms DSCAL in multi-threaded environments #5328

Open
@HKUST-Yujun

Description

@HKUST-Yujun

Issue Description:

I'm observing a significant performance disparity between dscal and daxpy when performing vector-scalar multiplication on an Intel(R) Xeon(R) Platinum 8378C CPU @ 2.80GHz. My code involves the operation y=ax, where x is a vector of length 80,000.

Observed Behavior:

Despite setting the OPENBLAS_NUM_THREADS environment variable to either 1 or 16, the execution time for dscal remains unchanged, indicating no utilization of multiple cores.

However, when I replace dscal with an equivalent operation using daxpy, specifically y=(a−1)x+x (having a loss in precision), I observe a multi-fold performance improvement in the multi-core scenario.

Problem:

Given that dscal and daxpy have very similar computational patterns, I'm seeking to understand why there's such a substantial difference in their multi-core performance. This behavior suggests that dscal is not effectively leveraging the available CPU cores, unlike daxpy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions