-
Notifications
You must be signed in to change notification settings - Fork 942
Add a public API for copying a table_view to device array #18450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
Add a public API for copying a table_view to device array #18450
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial feedback attached. Thanks for your work on this!
Benchmark Resultstable_to_array[0] NVIDIA RTX 5880 Ada Generation
[1] NVIDIA RTX 5880 Ada Generation
|
|
||
void* d_temp_storage = nullptr; | ||
size_t temp_storage_bytes = 0; | ||
cub::DeviceMemcpy::Batched(d_temp_storage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please benchmark cub::DeviceMemcpy::Batched
against cudaMemcpyBatchAsync
on CUDA 12.8. I'd like to see if there is a benefit for using the new API where it is supported. I think it should be more efficient.
I also want to check the performance against a multi-stream copy like I implemented for gather
in #14162, but that can be done for a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to do the benchmark against cudaMemcpyBatchAsync
. I get errors like...
cudaErrorInvalidValue invalid argument
cudaErrorInvalidDevice: invalid device ordinal
cudaErrorInvalidResourceHandle: invalid resource handle
Do you have any ideas on root causes? I've already checked for null pointers. And the next thing I'll try is changing the source access order in the attrs
arg (currently set to cudaMemcpySrcAccessOrderStream
).
Regardless, I did benchmark cub::DeviceMemcpy::Batched
against num_columns
cudaMemcpyAsync
calls and it generally under performs when only two buffers are copied. Performance looks good when 10 and 100 buffers are copied.
$ python nvbench/scripts/nvbench_compare.py table_to_array_bench_cudaMemcpyBatchAsync.json table_to_array_bench_cub::DeviceMemcpy::Batched.json
table_to_array
[0] NVIDIA RTX 5880 Ada Generation
num_rows | columns | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
---|---|---|---|---|---|---|---|---|
262144 | 2 | 11.118 us | 11.28% | 19.879 us | 4.33% | 8.761 us | 78.80% | SLOW |
2097152 | 2 | 43.316 us | 2.46% | 59.083 us | 2.05% | 15.767 us | 36.40% | SLOW |
16777216 | 2 | 321.465 us | 0.54% | 356.736 us | 0.84% | 35.272 us | 10.97% | SLOW |
32768 | 10 | 27.827 us | 4.91% | 18.861 us | 4.22% | -8.966 us | -32.22% | FAST |
262144 | 10 | 44.987 us | 2.13% | 41.143 us | 2.56% | -3.844 us | -8.54% | FAST |
2097152 | 10 | 201.164 us | 0.66% | 228.313 us | 1.05% | 27.149 us | 13.50% | SLOW |
16777216 | 10 | 1.589 ms | 0.27% | 1.668 ms | 0.92% | 79.189 us | 4.98% | SLOW |
32768 | 100 | 229.088 us | 1.71% | 50.000 us | 2.46% | -179.088 us | -78.17% | FAST |
262144 | 100 | 397.733 us | 0.36% | 283.406 us | 0.91% | -114.328 us | -28.74% | FAST |
2097152 | 100 | 1.929 ms | 0.25% | 2.060 ms | 1.36% | 130.913 us | 6.78% | SLOW |
16777216 | 100 | 15.836 ms | 0.09% | 16.030 ms | 0.18% | 193.286 us | 1.22% | SLOW |
32768 | 2 | 7.285 us | 9.26% | 31.464 us | 3.99% | 24.179 us | 331.92% | SLOW |
262144 | 2 | 11.670 us | 6.06% | 35.361 us | 4.10% | 23.690 us | 203.00% | SLOW |
2097152 | 2 | 43.570 us | 1.85% | 245.818 us | 3.46% | 202.247 us | 464.19% | SLOW |
16777216 | 2 | 321.276 us | 0.52% | 544.230 us | 1.25% | 222.954 us | 69.40% | SLOW |
32768 | 10 | 25.450 us | 3.65% | 32.252 us | 3.21% | 6.802 us | 26.73% | SLOW |
262144 | 10 | 44.983 us | 1.88% | 56.335 us | 2.68% | 11.352 us | 25.24% | SLOW |
2097152 | 10 | 201.004 us | 0.68% | 417.523 us | 2.45% | 216.519 us | 107.72% | SLOW |
16777216 | 10 | 1.595 ms | 0.30% | 1.872 ms | 12.15% | 276.805 us | 17.35% | SLOW |
32768 | 100 | 230.065 us | 1.79% | 64.702 us | 2.72% | -165.363 us | -71.88% | FAST |
262144 | 100 | 394.490 us | 0.40% | 298.147 us | 0.95% | -96.343 us | -24.42% | FAST |
2097152 | 100 | 1.939 ms | 0.32% | 2.270 ms | 5.17% | 331.012 us | 17.07% | SLOW |
16777216 | 100 | 15.895 ms | 0.09% | 16.164 ms | 1.25% | 269.368 us | 1.69% | SLOW |
[1] NVIDIA RTX 5880 Ada Generation
num_rows | columns | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
---|---|---|---|---|---|---|---|---|
262144 | 2 | 11.118 us | 11.28% | 19.879 us | 4.33% | 8.761 us | 78.80% | SLOW |
2097152 | 2 | 43.316 us | 2.46% | 59.083 us | 2.05% | 15.767 us | 36.40% | SLOW |
16777216 | 2 | 321.465 us | 0.54% | 356.736 us | 0.84% | 35.272 us | 10.97% | SLOW |
32768 | 10 | 27.827 us | 4.91% | 18.861 us | 4.22% | -8.966 us | -32.22% | FAST |
262144 | 10 | 44.987 us | 2.13% | 41.143 us | 2.56% | -3.844 us | -8.54% | FAST |
2097152 | 10 | 201.164 us | 0.66% | 228.313 us | 1.05% | 27.149 us | 13.50% | SLOW |
16777216 | 10 | 1.589 ms | 0.27% | 1.668 ms | 0.92% | 79.189 us | 4.98% | SLOW |
32768 | 100 | 229.088 us | 1.71% | 50.000 us | 2.46% | -179.088 us | -78.17% | FAST |
262144 | 100 | 397.733 us | 0.36% | 283.406 us | 0.91% | -114.328 us | -28.74% | FAST |
2097152 | 100 | 1.929 ms | 0.25% | 2.060 ms | 1.36% | 130.913 us | 6.78% | SLOW |
16777216 | 100 | 15.836 ms | 0.09% | 16.030 ms | 0.18% | 193.286 us | 1.22% | SLOW |
32768 | 2 | 7.285 us | 9.26% | 31.464 us | 3.99% | 24.179 us | 331.92% | SLOW |
262144 | 2 | 11.670 us | 6.06% | 35.361 us | 4.10% | 23.690 us | 203.00% | SLOW |
2097152 | 2 | 43.570 us | 1.85% | 245.818 us | 3.46% | 202.247 us | 464.19% | SLOW |
16777216 | 2 | 321.276 us | 0.52% | 544.230 us | 1.25% | 222.954 us | 69.40% | SLOW |
32768 | 10 | 25.450 us | 3.65% | 32.252 us | 3.21% | 6.802 us | 26.73% | SLOW |
262144 | 10 | 44.983 us | 1.88% | 56.335 us | 2.68% | 11.352 us | 25.24% | SLOW |
2097152 | 10 | 201.004 us | 0.68% | 417.523 us | 2.45% | 216.519 us | 107.72% | SLOW |
16777216 | 10 | 1.595 ms | 0.30% | 1.872 ms | 12.15% | 276.805 us | 17.35% | SLOW |
32768 | 100 | 230.065 us | 1.79% | 64.702 us | 2.72% | -165.363 us | -71.88% | FAST |
262144 | 100 | 394.490 us | 0.40% | 298.147 us | 0.95% | -96.343 us | -24.42% | FAST |
2097152 | 100 | 1.939 ms | 0.32% | 2.270 ms | 5.17% | 331.012 us | 17.07% | SLOW |
16777216 | 100 | 15.895 ms | 0.09% | 16.164 ms | 1.25% | 269.368 us | 1.69% | SLOW |
[2] NVIDIA GeForce GT 1030
num_rows | columns | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
---|---|---|---|---|---|---|---|---|
262144 | 2 | 11.118 us | 11.28% | 19.879 us | 4.33% | 8.761 us | 78.80% | SLOW |
2097152 | 2 | 43.316 us | 2.46% | 59.083 us | 2.05% | 15.767 us | 36.40% | SLOW |
16777216 | 2 | 321.465 us | 0.54% | 356.736 us | 0.84% | 35.272 us | 10.97% | SLOW |
32768 | 10 | 27.827 us | 4.91% | 18.861 us | 4.22% | -8.966 us | -32.22% | FAST |
262144 | 10 | 44.987 us | 2.13% | 41.143 us | 2.56% | -3.844 us | -8.54% | FAST |
2097152 | 10 | 201.164 us | 0.66% | 228.313 us | 1.05% | 27.149 us | 13.50% | SLOW |
16777216 | 10 | 1.589 ms | 0.27% | 1.668 ms | 0.92% | 79.189 us | 4.98% | SLOW |
32768 | 100 | 229.088 us | 1.71% | 50.000 us | 2.46% | -179.088 us | -78.17% | FAST |
262144 | 100 | 397.733 us | 0.36% | 283.406 us | 0.91% | -114.328 us | -28.74% | FAST |
2097152 | 100 | 1.929 ms | 0.25% | 2.060 ms | 1.36% | 130.913 us | 6.78% | SLOW |
16777216 | 100 | 15.836 ms | 0.09% | 16.030 ms | 0.18% | 193.286 us | 1.22% | SLOW |
32768 | 2 | 7.285 us | 9.26% | 31.464 us | 3.99% | 24.179 us | 331.92% | SLOW |
262144 | 2 | 11.670 us | 6.06% | 35.361 us | 4.10% | 23.690 us | 203.00% | SLOW |
2097152 | 2 | 43.570 us | 1.85% | 245.818 us | 3.46% | 202.247 us | 464.19% | SLOW |
16777216 | 2 | 321.276 us | 0.52% | 544.230 us | 1.25% | 222.954 us | 69.40% | SLOW |
32768 | 10 | 25.450 us | 3.65% | 32.252 us | 3.21% | 6.802 us | 26.73% | SLOW |
262144 | 10 | 44.983 us | 1.88% | 56.335 us | 2.68% | 11.352 us | 25.24% | SLOW |
2097152 | 10 | 201.004 us | 0.68% | 417.523 us | 2.45% | 216.519 us | 107.72% | SLOW |
16777216 | 10 | 1.595 ms | 0.30% | 1.872 ms | 12.15% | 276.805 us | 17.35% | SLOW |
32768 | 100 | 230.065 us | 1.79% | 64.702 us | 2.72% | -165.363 us | -71.88% | FAST |
262144 | 100 | 394.490 us | 0.40% | 298.147 us | 0.95% | -96.343 us | -24.42% | FAST |
2097152 | 100 | 1.939 ms | 0.32% | 2.270 ms | 5.17% | 331.012 us | 17.07% | SLOW |
16777216 | 100 | 15.895 ms | 0.09% | 16.164 ms | 1.25% | 269.368 us | 1.69% | SLOW |
Summary
- Total Matches: 69
- Pass (diff <= min_noise): 0
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the code yet, but maybe using our wrapper batched_memcpy_async
would be helpful here, it simplifies the use a bit.
My bad, thought the issue was with cub::DeviceMemcpy::Batched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elstehle we're seeing a great perf boost when copying a hundred buffers using cub::DeviceMemcpy::Batched
instead of a set of memcpy
s. However, the performance of the batched memcpy is significantly worse when only two buffers are copied. Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for joining the discussion a bit late.
@elstehle we're seeing a great perf boost when copying a hundred buffers using
cub::DeviceMemcpy::Batched
instead of a set ofmemcpy
s. However, the performance of the batched memcpy is significantly worse when only two buffers are copied. Is this expected?
If both, the number of columns is small and the total time to copy is small, like ~10s of microseconds, I think that slowdown comes from kernel launch overhead from DeviceMemcpy::Batched
launching a couple of kernels.
Another factor may be underutilized SMs. If the total number of bytes being copied is too small to saturate all SMs, I expect using copy engines is more efficient.
I wasn't able to do the benchmark against
cudaMemcpyBatchAsync
. I get errors like...cudaErrorInvalidValue invalid argument cudaErrorInvalidDevice: invalid device ordinal cudaErrorInvalidResourceHandle: invalid resource handle
Do you have any ideas on root causes? I've already checked for null pointers. And the next thing I'll try is changing the source access order in the
attrs
arg (currently set tocudaMemcpySrcAccessOrderStream
).
I'm not sure if you had accounted for this, but iirc, cudaMemcpyBatchAsync
expects a host array of device pointers (aka host-pointers-to-device-pointers). If that doesn't help, could you try running your benchmarks just on a single device to see if your issues relate to currentDevice
? If these are nvbench, you can just pass mybench --device 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @elstehle
I'm not sure if you had accounted for this, but iirc, cudaMemcpyBatchAsync expects a host array of device pointers (aka host-pointers-to-device-pointers).
Yup, I added the version of table_to_array_impl
that uses cudaMemcpyBatchAsync
. I'm passing a host array of device pointers.
If that doesn't help, could you try running your benchmarks just on a single device to see if your issues relate to currentDevice? If these are nvbench, you can just pass mybench --device 0.
I tried running on a single device and I get the same errors.
|
||
void* d_temp_storage = nullptr; | ||
size_t temp_storage_bytes = 0; | ||
cub::DeviceMemcpy::Batched(d_temp_storage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elstehle we're seeing a great perf boost when copying a hundred buffers using cub::DeviceMemcpy::Batched
instead of a set of memcpy
s. However, the performance of the batched memcpy is significantly worse when only two buffers are copied. Is this expected?
Some TODOs:
|
Description
Contributes to #16483.
This PR adds a new libcudf API:
cudf::table_to_array
, which copies data from a table_view into a preallocated column-major device array usingcub::DeviceMemcpy::Batched
.The primary use case for this API is to accelerate the conversion of a cudf.DataFrame to a CuPy array when users access
DataFrame.values
in Python.In a follow-up PR, I'll integrate this API into the cudf Python layer.