Add a public API for copying a table_view to device array #18450

Matt711 · 2025-04-07T21:06:45Z

Description

Contributes to #16483.

This PR adds a new libcudf API: cudf::table_to_array, which copies data from a table_view into a preallocated column-major device array using cub::DeviceMemcpy::Batched.

The primary use case for this API is to accelerate the conversion of a cudf.DataFrame to a CuPy array when users access DataFrame.values in Python.

In a follow-up PR, I'll integrate this API into the cudf Python layer.

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-04-07T21:06:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Matt711 · 2025-04-07T21:07:20Z

/ok to test

cpp/src/reshape/table_to_array.cu

bdice

Initial feedback attached. Thanks for your work on this!

cpp/include/cudf/reshape.hpp

cpp/src/reshape/table_to_array.cu

cpp/tests/CMakeLists.txt

cpp/tests/reshape/table_to_array.cpp

cpp/src/reshape/table_to_array.cu

Matt711 · 2025-04-08T18:50:58Z

Benchmark Results

table_to_array

[0] NVIDIA RTX 5880 Ada Generation

num_rows	columns	Samples	CPU Time	Noise	GPU Time	Noise	GlobalMem BW	BWUtil
262144	2	24544x	24.381 us	20.36%	20.378 us	5.97%	205.824 GB/s	21.44%
2097152	2	8656x	61.408 us	6.84%	57.816 us	2.96%	580.363 GB/s	60.45%
16777216	2	1424x	357.499 us	1.32%	353.912 us	0.85%	758.480 GB/s	79.00%
32768	10	25488x	23.607 us	21.45%	19.629 us	6.31%	133.547 GB/s	13.91%
262144	10	11792x	46.189 us	9.27%	42.504 us	3.45%	493.399 GB/s	51.39%
2097152	10	2192x	233.055 us	1.91%	229.399 us	1.06%	731.356 GB/s	76.18%
16777216	10	544x	1.666 ms	1.23%	1.663 ms	1.21%	807.174 GB/s	84.07%
32768	100	9744x	55.100 us	8.09%	51.326 us	3.50%	510.739 GB/s	53.20%
262144	100	1744x	290.759 us	1.57%	287.065 us	0.91%	730.551 GB/s	76.09%
2097152	100	352x	2.059 ms	1.30%	2.056 ms	1.29%	816.148 GB/s	85.01%
16777216	100	32x	16.006 ms	0.26%	16.002 ms	0.25%	838.756 GB/s	87.36%

[1] NVIDIA RTX 5880 Ada Generation

num_rows	columns	Samples	CPU Time	Noise	GPU Time	Noise	GlobalMem BW	BWUtil
32768	2	15344x	36.213 us	13.56%	32.615 us	8.08%	16.075 GB/s	1.67%
262144	2	14064x	39.508 us	12.78%	35.582 us	6.11%	117.878 GB/s	12.28%
2097152	2	6496x	80.498 us	4.96%	77.064 us	2.18%	435.412 GB/s	45.35%
16777216	2	1648x	546.025 us	1.37%	542.291 us	1.18%	495.003 GB/s	51.56%
32768	10	13840x	39.317 us	9.62%	36.142 us	4.01%	72.532 GB/s	7.55%
262144	10	8544x	61.967 us	6.53%	58.526 us	2.85%	358.326 GB/s	37.32%
2097152	10	2064x	246.372 us	2.02%	242.862 us	1.07%	690.814 GB/s	71.95%
16777216	10	928x	1.877 ms	2.02%	1.873 ms	2.01%	716.410 GB/s	74.62%
32768	100	7632x	68.997 us	5.91%	65.532 us	2.68%	400.023 GB/s	41.66%
262144	100	1680x	303.596 us	1.48%	300.142 us	0.93%	698.719 GB/s	72.78%
2097152	100	576x	2.117 ms	0.80%	2.114 ms	0.78%	793.753 GB/s	82.67%
16777216	100	31x	16.148 ms	0.26%	16.144 ms	0.26%	831.395 GB/s	86.60%

cpp/benchmarks/reshape/table_to_array.cpp

cpp/src/reshape/table_to_array.cu

cpp/tests/reshape/table_to_array.cpp

cpp/CMakeLists.txt

cpp/include/cudf/reshape.hpp

cpp/src/reshape/table_to_array.cu

bdice · 2025-04-09T15:57:30Z

cpp/src/reshape/table_to_array.cu

+
+  void* d_temp_storage      = nullptr;
+  size_t temp_storage_bytes = 0;
+  cub::DeviceMemcpy::Batched(d_temp_storage,


Please benchmark cub::DeviceMemcpy::Batched against cudaMemcpyBatchAsync on CUDA 12.8. I'd like to see if there is a benefit for using the new API where it is supported. I think it should be more efficient.

I also want to check the performance against a multi-stream copy like I implemented for gather in #14162, but that can be done for a follow-up.

I wasn't able to do the benchmark against cudaMemcpyBatchAsync. I get errors like...

cudaErrorInvalidValue invalid argument cudaErrorInvalidDevice: invalid device ordinal cudaErrorInvalidResourceHandle: invalid resource handle

Do you have any ideas on root causes? I've already checked for null pointers. And the next thing I'll try is changing the source access order in the attrs arg (currently set to cudaMemcpySrcAccessOrderStream).

Regardless, I did benchmark cub::DeviceMemcpy::Batched against num_columns cudaMemcpyAsync calls and it generally under performs when only two buffers are copied. Performance looks good when 10 and 100 buffers are copied.

$ python nvbench/scripts/nvbench_compare.py table_to_array_bench_cudaMemcpyBatchAsync.json table_to_array_bench_cub::DeviceMemcpy::Batched.json

table_to_array

[0] NVIDIA RTX 5880 Ada Generation

num_rows columns Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status

262144 2 11.118 us 11.28% 19.879 us 4.33% 8.761 us 78.80% SLOW

2097152 2 43.316 us 2.46% 59.083 us 2.05% 15.767 us 36.40% SLOW

16777216 2 321.465 us 0.54% 356.736 us 0.84% 35.272 us 10.97% SLOW

32768 10 27.827 us 4.91% 18.861 us 4.22% -8.966 us -32.22% FAST

262144 10 44.987 us 2.13% 41.143 us 2.56% -3.844 us -8.54% FAST

2097152 10 201.164 us 0.66% 228.313 us 1.05% 27.149 us 13.50% SLOW

16777216 10 1.589 ms 0.27% 1.668 ms 0.92% 79.189 us 4.98% SLOW

32768 100 229.088 us 1.71% 50.000 us 2.46% -179.088 us -78.17% FAST

262144 100 397.733 us 0.36% 283.406 us 0.91% -114.328 us -28.74% FAST

2097152 100 1.929 ms 0.25% 2.060 ms 1.36% 130.913 us 6.78% SLOW

16777216 100 15.836 ms 0.09% 16.030 ms 0.18% 193.286 us 1.22% SLOW

32768 2 7.285 us 9.26% 31.464 us 3.99% 24.179 us 331.92% SLOW

262144 2 11.670 us 6.06% 35.361 us 4.10% 23.690 us 203.00% SLOW

2097152 2 43.570 us 1.85% 245.818 us 3.46% 202.247 us 464.19% SLOW

16777216 2 321.276 us 0.52% 544.230 us 1.25% 222.954 us 69.40% SLOW

32768 10 25.450 us 3.65% 32.252 us 3.21% 6.802 us 26.73% SLOW

262144 10 44.983 us 1.88% 56.335 us 2.68% 11.352 us 25.24% SLOW

2097152 10 201.004 us 0.68% 417.523 us 2.45% 216.519 us 107.72% SLOW

16777216 10 1.595 ms 0.30% 1.872 ms 12.15% 276.805 us 17.35% SLOW

32768 100 230.065 us 1.79% 64.702 us 2.72% -165.363 us -71.88% FAST

262144 100 394.490 us 0.40% 298.147 us 0.95% -96.343 us -24.42% FAST

2097152 100 1.939 ms 0.32% 2.270 ms 5.17% 331.012 us 17.07% SLOW

16777216 100 15.895 ms 0.09% 16.164 ms 1.25% 269.368 us 1.69% SLOW

[1] NVIDIA RTX 5880 Ada Generation

num_rows columns Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status

262144 2 11.118 us 11.28% 19.879 us 4.33% 8.761 us 78.80% SLOW

2097152 2 43.316 us 2.46% 59.083 us 2.05% 15.767 us 36.40% SLOW

16777216 2 321.465 us 0.54% 356.736 us 0.84% 35.272 us 10.97% SLOW

32768 10 27.827 us 4.91% 18.861 us 4.22% -8.966 us -32.22% FAST

262144 10 44.987 us 2.13% 41.143 us 2.56% -3.844 us -8.54% FAST

2097152 10 201.164 us 0.66% 228.313 us 1.05% 27.149 us 13.50% SLOW

16777216 10 1.589 ms 0.27% 1.668 ms 0.92% 79.189 us 4.98% SLOW

32768 100 229.088 us 1.71% 50.000 us 2.46% -179.088 us -78.17% FAST

262144 100 397.733 us 0.36% 283.406 us 0.91% -114.328 us -28.74% FAST

2097152 100 1.929 ms 0.25% 2.060 ms 1.36% 130.913 us 6.78% SLOW

16777216 100 15.836 ms 0.09% 16.030 ms 0.18% 193.286 us 1.22% SLOW

32768 2 7.285 us 9.26% 31.464 us 3.99% 24.179 us 331.92% SLOW

262144 2 11.670 us 6.06% 35.361 us 4.10% 23.690 us 203.00% SLOW

2097152 2 43.570 us 1.85% 245.818 us 3.46% 202.247 us 464.19% SLOW

16777216 2 321.276 us 0.52% 544.230 us 1.25% 222.954 us 69.40% SLOW

32768 10 25.450 us 3.65% 32.252 us 3.21% 6.802 us 26.73% SLOW

262144 10 44.983 us 1.88% 56.335 us 2.68% 11.352 us 25.24% SLOW

2097152 10 201.004 us 0.68% 417.523 us 2.45% 216.519 us 107.72% SLOW

16777216 10 1.595 ms 0.30% 1.872 ms 12.15% 276.805 us 17.35% SLOW

32768 100 230.065 us 1.79% 64.702 us 2.72% -165.363 us -71.88% FAST

262144 100 394.490 us 0.40% 298.147 us 0.95% -96.343 us -24.42% FAST

2097152 100 1.939 ms 0.32% 2.270 ms 5.17% 331.012 us 17.07% SLOW

16777216 100 15.895 ms 0.09% 16.164 ms 1.25% 269.368 us 1.69% SLOW

[2] NVIDIA GeForce GT 1030

num_rows columns Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status

262144 2 11.118 us 11.28% 19.879 us 4.33% 8.761 us 78.80% SLOW

2097152 2 43.316 us 2.46% 59.083 us 2.05% 15.767 us 36.40% SLOW

16777216 2 321.465 us 0.54% 356.736 us 0.84% 35.272 us 10.97% SLOW

32768 10 27.827 us 4.91% 18.861 us 4.22% -8.966 us -32.22% FAST

262144 10 44.987 us 2.13% 41.143 us 2.56% -3.844 us -8.54% FAST

2097152 10 201.164 us 0.66% 228.313 us 1.05% 27.149 us 13.50% SLOW

16777216 10 1.589 ms 0.27% 1.668 ms 0.92% 79.189 us 4.98% SLOW

32768 100 229.088 us 1.71% 50.000 us 2.46% -179.088 us -78.17% FAST

262144 100 397.733 us 0.36% 283.406 us 0.91% -114.328 us -28.74% FAST

2097152 100 1.929 ms 0.25% 2.060 ms 1.36% 130.913 us 6.78% SLOW

16777216 100 15.836 ms 0.09% 16.030 ms 0.18% 193.286 us 1.22% SLOW

32768 2 7.285 us 9.26% 31.464 us 3.99% 24.179 us 331.92% SLOW

262144 2 11.670 us 6.06% 35.361 us 4.10% 23.690 us 203.00% SLOW

2097152 2 43.570 us 1.85% 245.818 us 3.46% 202.247 us 464.19% SLOW

16777216 2 321.276 us 0.52% 544.230 us 1.25% 222.954 us 69.40% SLOW

32768 10 25.450 us 3.65% 32.252 us 3.21% 6.802 us 26.73% SLOW

262144 10 44.983 us 1.88% 56.335 us 2.68% 11.352 us 25.24% SLOW

2097152 10 201.004 us 0.68% 417.523 us 2.45% 216.519 us 107.72% SLOW

16777216 10 1.595 ms 0.30% 1.872 ms 12.15% 276.805 us 17.35% SLOW

32768 100 230.065 us 1.79% 64.702 us 2.72% -165.363 us -71.88% FAST

262144 100 394.490 us 0.40% 298.147 us 0.95% -96.343 us -24.42% FAST

2097152 100 1.939 ms 0.32% 2.270 ms 5.17% 331.012 us 17.07% SLOW

16777216 100 15.895 ms 0.09% 16.164 ms 1.25% 269.368 us 1.69% SLOW

Summary

Total Matches: 69

Pass (diff <= min_noise): 0

Unknown (infinite noise): 0

Failure (diff > min_noise): 69

CC @bdice @davidwendt

~~I haven't looked at the code yet, but maybe using our wrapper batched_memcpy_async would be helpful here, it simplifies the use a bit.~~
My bad, thought the issue was with cub::DeviceMemcpy::Batched

@elstehle we're seeing a great perf boost when copying a hundred buffers using cub::DeviceMemcpy::Batched instead of a set of memcpys. However, the performance of the batched memcpy is significantly worse when only two buffers are copied. Is this expected?

Sorry for joining the discussion a bit late.

@elstehle we're seeing a great perf boost when copying a hundred buffers using cub::DeviceMemcpy::Batched instead of a set of memcpys. However, the performance of the batched memcpy is significantly worse when only two buffers are copied. Is this expected?

If both, the number of columns is small and the total time to copy is small, like ~10s of microseconds, I think that slowdown comes from kernel launch overhead from DeviceMemcpy::Batched launching a couple of kernels.

Another factor may be underutilized SMs. If the total number of bytes being copied is too small to saturate all SMs, I expect using copy engines is more efficient.

I wasn't able to do the benchmark against cudaMemcpyBatchAsync. I get errors like...

cudaErrorInvalidValue invalid argument cudaErrorInvalidDevice: invalid device ordinal cudaErrorInvalidResourceHandle: invalid resource handle

Do you have any ideas on root causes? I've already checked for null pointers. And the next thing I'll try is changing the source access order in the attrs arg (currently set to cudaMemcpySrcAccessOrderStream).

I'm not sure if you had accounted for this, but iirc, cudaMemcpyBatchAsync expects a host array of device pointers (aka host-pointers-to-device-pointers). If that doesn't help, could you try running your benchmarks just on a single device to see if your issues relate to currentDevice? If these are nvbench, you can just pass mybench --device 0.

Thanks @elstehle

I'm not sure if you had accounted for this, but iirc, cudaMemcpyBatchAsync expects a host array of device pointers (aka host-pointers-to-device-pointers).

Yup, I added the version of table_to_array_impl that uses cudaMemcpyBatchAsync. I'm passing a host array of device pointers.

If that doesn't help, could you try running your benchmarks just on a single device to see if your issues relate to currentDevice? If these are nvbench, you can just pass mybench --device 0.

I tried running on a single device and I get the same errors.

cpp/src/reshape/table_to_array.cu

cpp/tests/CMakeLists.txt

cpp/tests/reshape/table_to_array.cpp

cpp/src/reshape/table_to_array.cu

vuule · 2025-04-10T23:44:39Z

cpp/src/reshape/table_to_array.cu

+
+  void* d_temp_storage      = nullptr;
+  size_t temp_storage_bytes = 0;
+  cub::DeviceMemcpy::Batched(d_temp_storage,


@elstehle we're seeing a great perf boost when copying a hundred buffers using cub::DeviceMemcpy::Batched instead of a set of memcpys. However, the performance of the batched memcpy is significantly worse when only two buffers are copied. Is this expected?

cpp/tests/CMakeLists.txt

Matt711 · 2025-04-15T21:32:43Z

Some TODOs:

check the performance against a multi-stream copy Ex: gather in Use stream pool for gather/scatter. #14162
Benchmark cudaMemcpyBatchAsync vs cub::DeviceMemcpy::Batched. Already benchmarked cub::DeviceMemcpy::Batched vs a bunch of cudaMemcpyAsync call. See this table.
Support columns with nulls
Support casting columns of different types to the same type on different streams

Matt711 added 2 commits April 7, 2025 12:34

Add a public API for converting a table_view to device array

e201564

support decimals and add more tests

6e9289e

github-actions bot assigned Matt711 Apr 7, 2025

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 7, 2025

Matt711 added feature request New feature or request non-breaking Non-breaking change labels Apr 7, 2025

Matt711 commented Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

Matt711 marked this pull request as ready for review April 8, 2025 03:51

Matt711 requested review from a team as code owners April 8, 2025 03:51

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

21a9201

Matt711 requested review from bdice and vuule April 8, 2025 03:51

bdice reviewed Apr 8, 2025

View reviewed changes

Matt711 and others added 3 commits April 8, 2025 09:53

fallback if cuda version < 12.8

7eb3690

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

854cbbf

clean up

5a9d195

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

address reviews

e5e65cc

Matt711 commented Apr 8, 2025

View reviewed changes

cpp/benchmarks/reshape/table_to_array.cpp Outdated Show resolved Hide resolved

Matt711 commented Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 8, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

Matt711 and others added 3 commits April 8, 2025 15:46

address review

0343bb3

use snake case

7530ecf

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

c7ab103

Matt711 requested review from davidwendt and bdice April 9, 2025 01:41

davidwendt reviewed Apr 9, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 9, 2025

View reviewed changes

cpp/src/reshape/table_to_array.cu Outdated Show resolved Hide resolved

davidwendt reviewed Apr 9, 2025

View reviewed changes

cpp/tests/reshape/table_to_array.cpp Outdated Show resolved Hide resolved

bdice reviewed Apr 9, 2025

View reviewed changes

Matt711 and others added 3 commits April 10, 2025 00:53

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

eec23b5

address reviews

f027616

clean up

b3251fe

vuule reviewed Apr 10, 2025

View reviewed changes

Matt711 and others added 3 commits April 14, 2025 18:08

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

f5a0096

address review

96ef619

pass a device_span instead of a raw pointer

82c5b22

davidwendt reviewed Apr 14, 2025

View reviewed changes

cpp/tests/CMakeLists.txt Outdated Show resolved Hide resolved

Matt711 mentioned this pull request Apr 15, 2025

Add cython bindings for cudf::table_to_device_array #18498

Draft

3 tasks

Matt711 changed the title ~~Add a public API for converting a table_view to device array~~ Add a public API for copying a table_view to device array Apr 15, 2025

Matt711 and others added 2 commits April 14, 2025 21:53

sort file names

4399938

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

737fa22

davidwendt approved these changes Apr 15, 2025

View reviewed changes

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

36abd46

Matt711 and others added 3 commits April 21, 2025 07:39

Merge branch 'branch-25.06' into fea/cpp/table-to-device-array

1fb9563

add other impl for benchmarking purposes

c6785c7

clean up

04c837f

Matt711 mentioned this pull request Apr 21, 2025

[FEA] Accelerate cupy array creation from DataFrame.values #16483

Open

Matt711 requested review from bdice and vuule April 21, 2025 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a public API for copying a table_view to device array #18450

Add a public API for copying a table_view to device array #18450

Matt711 commented Apr 7, 2025 •

edited

Loading

copy-pr-bot bot commented Apr 7, 2025

Matt711 commented Apr 7, 2025

bdice left a comment

Matt711 commented Apr 8, 2025

bdice Apr 9, 2025

Matt711 Apr 10, 2025 •

edited

Loading

Matt711 Apr 10, 2025

vuule Apr 10, 2025 •

edited

Loading

vuule Apr 10, 2025

elstehle Apr 16, 2025 •

edited

Loading

Matt711 Apr 21, 2025 •

edited

Loading

vuule Apr 10, 2025

Matt711 commented Apr 15, 2025 •

edited

Loading

num_rows	columns	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
262144	2	11.118 us	11.28%	19.879 us	4.33%	8.761 us	78.80%	SLOW
2097152	2	43.316 us	2.46%	59.083 us	2.05%	15.767 us	36.40%	SLOW
16777216	2	321.465 us	0.54%	356.736 us	0.84%	35.272 us	10.97%	SLOW
32768	10	27.827 us	4.91%	18.861 us	4.22%	-8.966 us	-32.22%	FAST
262144	10	44.987 us	2.13%	41.143 us	2.56%	-3.844 us	-8.54%	FAST
2097152	10	201.164 us	0.66%	228.313 us	1.05%	27.149 us	13.50%	SLOW
16777216	10	1.589 ms	0.27%	1.668 ms	0.92%	79.189 us	4.98%	SLOW
32768	100	229.088 us	1.71%	50.000 us	2.46%	-179.088 us	-78.17%	FAST
262144	100	397.733 us	0.36%	283.406 us	0.91%	-114.328 us	-28.74%	FAST
2097152	100	1.929 ms	0.25%	2.060 ms	1.36%	130.913 us	6.78%	SLOW
16777216	100	15.836 ms	0.09%	16.030 ms	0.18%	193.286 us	1.22%	SLOW
32768	2	7.285 us	9.26%	31.464 us	3.99%	24.179 us	331.92%	SLOW
262144	2	11.670 us	6.06%	35.361 us	4.10%	23.690 us	203.00%	SLOW
2097152	2	43.570 us	1.85%	245.818 us	3.46%	202.247 us	464.19%	SLOW
16777216	2	321.276 us	0.52%	544.230 us	1.25%	222.954 us	69.40%	SLOW
32768	10	25.450 us	3.65%	32.252 us	3.21%	6.802 us	26.73%	SLOW
262144	10	44.983 us	1.88%	56.335 us	2.68%	11.352 us	25.24%	SLOW
2097152	10	201.004 us	0.68%	417.523 us	2.45%	216.519 us	107.72%	SLOW
16777216	10	1.595 ms	0.30%	1.872 ms	12.15%	276.805 us	17.35%	SLOW
32768	100	230.065 us	1.79%	64.702 us	2.72%	-165.363 us	-71.88%	FAST
262144	100	394.490 us	0.40%	298.147 us	0.95%	-96.343 us	-24.42%	FAST
2097152	100	1.939 ms	0.32%	2.270 ms	5.17%	331.012 us	17.07%	SLOW
16777216	100	15.895 ms	0.09%	16.164 ms	1.25%	269.368 us	1.69%	SLOW

Add a public API for copying a table_view to device array #18450

Are you sure you want to change the base?

Add a public API for copying a table_view to device array #18450

Conversation

Matt711 commented Apr 7, 2025 • edited Loading

Description

copy-pr-bot bot commented Apr 7, 2025

Matt711 commented Apr 7, 2025

bdice left a comment

Choose a reason for hiding this comment

Matt711 commented Apr 8, 2025

Benchmark Results

table_to_array

[0] NVIDIA RTX 5880 Ada Generation

[1] NVIDIA RTX 5880 Ada Generation

bdice Apr 9, 2025

Choose a reason for hiding this comment

Matt711 Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

table_to_array

[0] NVIDIA RTX 5880 Ada Generation

[1] NVIDIA RTX 5880 Ada Generation

[2] NVIDIA GeForce GT 1030

Summary

Matt711 Apr 10, 2025

Choose a reason for hiding this comment

vuule Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

vuule Apr 10, 2025

Choose a reason for hiding this comment

elstehle Apr 16, 2025 • edited Loading

Choose a reason for hiding this comment

Matt711 Apr 21, 2025 • edited Loading

Choose a reason for hiding this comment

vuule Apr 10, 2025

Choose a reason for hiding this comment

Matt711 commented Apr 15, 2025 • edited Loading

Matt711 commented Apr 7, 2025 •

edited

Loading

Matt711 Apr 10, 2025 •

edited

Loading

vuule Apr 10, 2025 •

edited

Loading

elstehle Apr 16, 2025 •

edited

Loading

Matt711 Apr 21, 2025 •

edited

Loading

Matt711 commented Apr 15, 2025 •

edited

Loading