Skip to content

Add a public API for copying a table_view to device array #18450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: branch-25.06
Choose a base branch
from

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Apr 7, 2025

Description

Contributes to #16483.

This PR adds a new libcudf API: cudf::table_to_array, which copies data from a table_view into a preallocated column-major device array using cub::DeviceMemcpy::Batched.

The primary use case for this API is to accelerate the conversion of a cudf.DataFrame to a CuPy array when users access DataFrame.values in Python.

In a follow-up PR, I'll integrate this API into the cudf Python layer.

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Apr 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 7, 2025
@Matt711 Matt711 added feature request New feature or request non-breaking Non-breaking change labels Apr 7, 2025
@Matt711
Copy link
Contributor Author

Matt711 commented Apr 7, 2025

/ok to test

@Matt711 Matt711 marked this pull request as ready for review April 8, 2025 03:51
@Matt711 Matt711 requested review from a team as code owners April 8, 2025 03:51
@Matt711 Matt711 requested review from bdice and vuule April 8, 2025 03:51
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial feedback attached. Thanks for your work on this!

@Matt711
Copy link
Contributor Author

Matt711 commented Apr 8, 2025

Benchmark Results

table_to_array

[0] NVIDIA RTX 5880 Ada Generation

num_rows columns Samples CPU Time Noise GPU Time Noise GlobalMem BW BWUtil
262144 2 24544x 24.381 us 20.36% 20.378 us 5.97% 205.824 GB/s 21.44%
2097152 2 8656x 61.408 us 6.84% 57.816 us 2.96% 580.363 GB/s 60.45%
16777216 2 1424x 357.499 us 1.32% 353.912 us 0.85% 758.480 GB/s 79.00%
32768 10 25488x 23.607 us 21.45% 19.629 us 6.31% 133.547 GB/s 13.91%
262144 10 11792x 46.189 us 9.27% 42.504 us 3.45% 493.399 GB/s 51.39%
2097152 10 2192x 233.055 us 1.91% 229.399 us 1.06% 731.356 GB/s 76.18%
16777216 10 544x 1.666 ms 1.23% 1.663 ms 1.21% 807.174 GB/s 84.07%
32768 100 9744x 55.100 us 8.09% 51.326 us 3.50% 510.739 GB/s 53.20%
262144 100 1744x 290.759 us 1.57% 287.065 us 0.91% 730.551 GB/s 76.09%
2097152 100 352x 2.059 ms 1.30% 2.056 ms 1.29% 816.148 GB/s 85.01%
16777216 100 32x 16.006 ms 0.26% 16.002 ms 0.25% 838.756 GB/s 87.36%

[1] NVIDIA RTX 5880 Ada Generation

num_rows columns Samples CPU Time Noise GPU Time Noise GlobalMem BW BWUtil
32768 2 15344x 36.213 us 13.56% 32.615 us 8.08% 16.075 GB/s 1.67%
262144 2 14064x 39.508 us 12.78% 35.582 us 6.11% 117.878 GB/s 12.28%
2097152 2 6496x 80.498 us 4.96% 77.064 us 2.18% 435.412 GB/s 45.35%
16777216 2 1648x 546.025 us 1.37% 542.291 us 1.18% 495.003 GB/s 51.56%
32768 10 13840x 39.317 us 9.62% 36.142 us 4.01% 72.532 GB/s 7.55%
262144 10 8544x 61.967 us 6.53% 58.526 us 2.85% 358.326 GB/s 37.32%
2097152 10 2064x 246.372 us 2.02% 242.862 us 1.07% 690.814 GB/s 71.95%
16777216 10 928x 1.877 ms 2.02% 1.873 ms 2.01% 716.410 GB/s 74.62%
32768 100 7632x 68.997 us 5.91% 65.532 us 2.68% 400.023 GB/s 41.66%
262144 100 1680x 303.596 us 1.48% 300.142 us 0.93% 698.719 GB/s 72.78%
2097152 100 576x 2.117 ms 0.80% 2.114 ms 0.78% 793.753 GB/s 82.67%
16777216 100 31x 16.148 ms 0.26% 16.144 ms 0.26% 831.395 GB/s 86.60%

@Matt711 Matt711 requested review from davidwendt and bdice April 9, 2025 01:41

void* d_temp_storage = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceMemcpy::Batched(d_temp_storage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please benchmark cub::DeviceMemcpy::Batched against cudaMemcpyBatchAsync on CUDA 12.8. I'd like to see if there is a benefit for using the new API where it is supported. I think it should be more efficient.

I also want to check the performance against a multi-stream copy like I implemented for gather in #14162, but that can be done for a follow-up.

Copy link
Contributor Author

@Matt711 Matt711 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to do the benchmark against cudaMemcpyBatchAsync. I get errors like...

cudaErrorInvalidValue invalid argument
cudaErrorInvalidDevice: invalid device ordinal
cudaErrorInvalidResourceHandle: invalid resource handle

Do you have any ideas on root causes? I've already checked for null pointers. And the next thing I'll try is changing the source access order in the attrs arg (currently set to cudaMemcpySrcAccessOrderStream).

Regardless, I did benchmark cub::DeviceMemcpy::Batched against num_columns cudaMemcpyAsync calls and it generally under performs when only two buffers are copied. Performance looks good when 10 and 100 buffers are copied.

$ python nvbench/scripts/nvbench_compare.py table_to_array_bench_cudaMemcpyBatchAsync.json table_to_array_bench_cub::DeviceMemcpy::Batched.json

table_to_array

[0] NVIDIA RTX 5880 Ada Generation

num_rows columns Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
262144 2 11.118 us 11.28% 19.879 us 4.33% 8.761 us 78.80% SLOW
2097152 2 43.316 us 2.46% 59.083 us 2.05% 15.767 us 36.40% SLOW
16777216 2 321.465 us 0.54% 356.736 us 0.84% 35.272 us 10.97% SLOW
32768 10 27.827 us 4.91% 18.861 us 4.22% -8.966 us -32.22% FAST
262144 10 44.987 us 2.13% 41.143 us 2.56% -3.844 us -8.54% FAST
2097152 10 201.164 us 0.66% 228.313 us 1.05% 27.149 us 13.50% SLOW
16777216 10 1.589 ms 0.27% 1.668 ms 0.92% 79.189 us 4.98% SLOW
32768 100 229.088 us 1.71% 50.000 us 2.46% -179.088 us -78.17% FAST
262144 100 397.733 us 0.36% 283.406 us 0.91% -114.328 us -28.74% FAST
2097152 100 1.929 ms 0.25% 2.060 ms 1.36% 130.913 us 6.78% SLOW
16777216 100 15.836 ms 0.09% 16.030 ms 0.18% 193.286 us 1.22% SLOW
32768 2 7.285 us 9.26% 31.464 us 3.99% 24.179 us 331.92% SLOW
262144 2 11.670 us 6.06% 35.361 us 4.10% 23.690 us 203.00% SLOW
2097152 2 43.570 us 1.85% 245.818 us 3.46% 202.247 us 464.19% SLOW
16777216 2 321.276 us 0.52% 544.230 us 1.25% 222.954 us 69.40% SLOW
32768 10 25.450 us 3.65% 32.252 us 3.21% 6.802 us 26.73% SLOW
262144 10 44.983 us 1.88% 56.335 us 2.68% 11.352 us 25.24% SLOW
2097152 10 201.004 us 0.68% 417.523 us 2.45% 216.519 us 107.72% SLOW
16777216 10 1.595 ms 0.30% 1.872 ms 12.15% 276.805 us 17.35% SLOW
32768 100 230.065 us 1.79% 64.702 us 2.72% -165.363 us -71.88% FAST
262144 100 394.490 us 0.40% 298.147 us 0.95% -96.343 us -24.42% FAST
2097152 100 1.939 ms 0.32% 2.270 ms 5.17% 331.012 us 17.07% SLOW
16777216 100 15.895 ms 0.09% 16.164 ms 1.25% 269.368 us 1.69% SLOW

[1] NVIDIA RTX 5880 Ada Generation

num_rows columns Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
262144 2 11.118 us 11.28% 19.879 us 4.33% 8.761 us 78.80% SLOW
2097152 2 43.316 us 2.46% 59.083 us 2.05% 15.767 us 36.40% SLOW
16777216 2 321.465 us 0.54% 356.736 us 0.84% 35.272 us 10.97% SLOW
32768 10 27.827 us 4.91% 18.861 us 4.22% -8.966 us -32.22% FAST
262144 10 44.987 us 2.13% 41.143 us 2.56% -3.844 us -8.54% FAST
2097152 10 201.164 us 0.66% 228.313 us 1.05% 27.149 us 13.50% SLOW
16777216 10 1.589 ms 0.27% 1.668 ms 0.92% 79.189 us 4.98% SLOW
32768 100 229.088 us 1.71% 50.000 us 2.46% -179.088 us -78.17% FAST
262144 100 397.733 us 0.36% 283.406 us 0.91% -114.328 us -28.74% FAST
2097152 100 1.929 ms 0.25% 2.060 ms 1.36% 130.913 us 6.78% SLOW
16777216 100 15.836 ms 0.09% 16.030 ms 0.18% 193.286 us 1.22% SLOW
32768 2 7.285 us 9.26% 31.464 us 3.99% 24.179 us 331.92% SLOW
262144 2 11.670 us 6.06% 35.361 us 4.10% 23.690 us 203.00% SLOW
2097152 2 43.570 us 1.85% 245.818 us 3.46% 202.247 us 464.19% SLOW
16777216 2 321.276 us 0.52% 544.230 us 1.25% 222.954 us 69.40% SLOW
32768 10 25.450 us 3.65% 32.252 us 3.21% 6.802 us 26.73% SLOW
262144 10 44.983 us 1.88% 56.335 us 2.68% 11.352 us 25.24% SLOW
2097152 10 201.004 us 0.68% 417.523 us 2.45% 216.519 us 107.72% SLOW
16777216 10 1.595 ms 0.30% 1.872 ms 12.15% 276.805 us 17.35% SLOW
32768 100 230.065 us 1.79% 64.702 us 2.72% -165.363 us -71.88% FAST
262144 100 394.490 us 0.40% 298.147 us 0.95% -96.343 us -24.42% FAST
2097152 100 1.939 ms 0.32% 2.270 ms 5.17% 331.012 us 17.07% SLOW
16777216 100 15.895 ms 0.09% 16.164 ms 1.25% 269.368 us 1.69% SLOW

[2] NVIDIA GeForce GT 1030

num_rows columns Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
262144 2 11.118 us 11.28% 19.879 us 4.33% 8.761 us 78.80% SLOW
2097152 2 43.316 us 2.46% 59.083 us 2.05% 15.767 us 36.40% SLOW
16777216 2 321.465 us 0.54% 356.736 us 0.84% 35.272 us 10.97% SLOW
32768 10 27.827 us 4.91% 18.861 us 4.22% -8.966 us -32.22% FAST
262144 10 44.987 us 2.13% 41.143 us 2.56% -3.844 us -8.54% FAST
2097152 10 201.164 us 0.66% 228.313 us 1.05% 27.149 us 13.50% SLOW
16777216 10 1.589 ms 0.27% 1.668 ms 0.92% 79.189 us 4.98% SLOW
32768 100 229.088 us 1.71% 50.000 us 2.46% -179.088 us -78.17% FAST
262144 100 397.733 us 0.36% 283.406 us 0.91% -114.328 us -28.74% FAST
2097152 100 1.929 ms 0.25% 2.060 ms 1.36% 130.913 us 6.78% SLOW
16777216 100 15.836 ms 0.09% 16.030 ms 0.18% 193.286 us 1.22% SLOW
32768 2 7.285 us 9.26% 31.464 us 3.99% 24.179 us 331.92% SLOW
262144 2 11.670 us 6.06% 35.361 us 4.10% 23.690 us 203.00% SLOW
2097152 2 43.570 us 1.85% 245.818 us 3.46% 202.247 us 464.19% SLOW
16777216 2 321.276 us 0.52% 544.230 us 1.25% 222.954 us 69.40% SLOW
32768 10 25.450 us 3.65% 32.252 us 3.21% 6.802 us 26.73% SLOW
262144 10 44.983 us 1.88% 56.335 us 2.68% 11.352 us 25.24% SLOW
2097152 10 201.004 us 0.68% 417.523 us 2.45% 216.519 us 107.72% SLOW
16777216 10 1.595 ms 0.30% 1.872 ms 12.15% 276.805 us 17.35% SLOW
32768 100 230.065 us 1.79% 64.702 us 2.72% -165.363 us -71.88% FAST
262144 100 394.490 us 0.40% 298.147 us 0.95% -96.343 us -24.42% FAST
2097152 100 1.939 ms 0.32% 2.270 ms 5.17% 331.012 us 17.07% SLOW
16777216 100 15.895 ms 0.09% 16.164 ms 1.25% 269.368 us 1.69% SLOW

Summary

  • Total Matches: 69
    • Pass (diff <= min_noise): 0
    • Unknown (infinite noise): 0
    • Failure (diff > min_noise): 69

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@vuule vuule Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the code yet, but maybe using our wrapper batched_memcpy_async would be helpful here, it simplifies the use a bit.
My bad, thought the issue was with cub::DeviceMemcpy::Batched

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elstehle we're seeing a great perf boost when copying a hundred buffers using cub::DeviceMemcpy::Batched instead of a set of memcpys. However, the performance of the batched memcpy is significantly worse when only two buffers are copied. Is this expected?

Copy link
Contributor

@elstehle elstehle Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for joining the discussion a bit late.

@elstehle we're seeing a great perf boost when copying a hundred buffers using cub::DeviceMemcpy::Batched instead of a set of memcpys. However, the performance of the batched memcpy is significantly worse when only two buffers are copied. Is this expected?

If both, the number of columns is small and the total time to copy is small, like ~10s of microseconds, I think that slowdown comes from kernel launch overhead from DeviceMemcpy::Batched launching a couple of kernels.

Another factor may be underutilized SMs. If the total number of bytes being copied is too small to saturate all SMs, I expect using copy engines is more efficient.

I wasn't able to do the benchmark against cudaMemcpyBatchAsync. I get errors like...

cudaErrorInvalidValue invalid argument
cudaErrorInvalidDevice: invalid device ordinal
cudaErrorInvalidResourceHandle: invalid resource handle

Do you have any ideas on root causes? I've already checked for null pointers. And the next thing I'll try is changing the source access order in the attrs arg (currently set to cudaMemcpySrcAccessOrderStream).

I'm not sure if you had accounted for this, but iirc, cudaMemcpyBatchAsync expects a host array of device pointers (aka host-pointers-to-device-pointers). If that doesn't help, could you try running your benchmarks just on a single device to see if your issues relate to currentDevice? If these are nvbench, you can just pass mybench --device 0.

Copy link
Contributor Author

@Matt711 Matt711 Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @elstehle

I'm not sure if you had accounted for this, but iirc, cudaMemcpyBatchAsync expects a host array of device pointers (aka host-pointers-to-device-pointers).

Yup, I added the version of table_to_array_impl that uses cudaMemcpyBatchAsync. I'm passing a host array of device pointers.

If that doesn't help, could you try running your benchmarks just on a single device to see if your issues relate to currentDevice? If these are nvbench, you can just pass mybench --device 0.

I tried running on a single device and I get the same errors.


void* d_temp_storage = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceMemcpy::Batched(d_temp_storage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elstehle we're seeing a great perf boost when copying a hundred buffers using cub::DeviceMemcpy::Batched instead of a set of memcpys. However, the performance of the batched memcpy is significantly worse when only two buffers are copied. Is this expected?

@Matt711 Matt711 changed the title Add a public API for converting a table_view to device array Add a public API for copying a table_view to device array Apr 15, 2025
@Matt711
Copy link
Contributor Author

Matt711 commented Apr 15, 2025

Some TODOs:

  • check the performance against a multi-stream copy Ex: gather in Use stream pool for gather/scatter. #14162
  • Benchmark cudaMemcpyBatchAsync vs cub::DeviceMemcpy::Batched. Already benchmarked cub::DeviceMemcpy::Batched vs a bunch of cudaMemcpyAsync call. See this table.
  • Support columns with nulls
  • Support casting columns of different types to the same type on different streams

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants