Description
Issue Description:
I'm observing a significant performance disparity between dscal and daxpy when performing vector-scalar multiplication on an Intel(R) Xeon(R) Platinum 8378C CPU @ 2.80GHz. My code involves the operation y=ax, where x is a vector of length 80,000.
Observed Behavior:
Despite setting the OPENBLAS_NUM_THREADS environment variable to either 1 or 16, the execution time for dscal remains unchanged, indicating no utilization of multiple cores.
However, when I replace dscal with an equivalent operation using daxpy, specifically y=(a−1)x+x (having a loss in precision), I observe a multi-fold performance improvement in the multi-core scenario.
Problem:
Given that dscal and daxpy have very similar computational patterns, I'm seeking to understand why there's such a substantial difference in their multi-core performance. This behavior suggests that dscal is not effectively leveraging the available CPU cores, unlike daxpy.