Skip to content

Commit 22dcaf1

Browse files
NB4444yssamtuMyNameIsTrezparbencNaraenda
authored
Develop stream 2025-03-05 (#697)
* Resolve "Extract common utilities for benchmarks and tests" * Add missing error check * implement dynamic dispatch for batch_copy and batch_memcpy * Reduce merge inplace memory footprint * Resolve "block disconinuity test values are converted to long long before comparing" * Resolve "Remove debug builds and combine benchmark and tests in Windows CI" * Resolve "Broadcast algorithm in warp_scan MI300X / MI325X" * Resolve "hipGetDeviceProperties is slow" * Resolve "Move device_ptr to common utilities" * Resolve "Use existing enums in thread_load_kernel() and thread_store_kernel()" * Resolve "Generate testing data using rocRAND" * Resolve "Explicitly declared values in enum for thread_load and thread_store modifiers" * Reduce lds conflicts shared histogram * Reset the user's active device on error * Resolve "Remove comment in test_device_histogram" * Resolve "Calculating the individual positions of each item dynamically, rather than storing them in registers" * Resolve "Implementation Proposal Virtual Shared Memory" * Resolve "fix precision check per operations in scan and reduce" * Use number_of_runs in loops for tests * Resolve "update CHANGELOG.md to include changes by new tests data generation" * Added warning for deprecation of c++14 * Remove duplicate include --------- Co-authored-by: Yung-Sheng Tu <[email protected]> Co-authored-by: Sander Bos <[email protected]> Co-authored-by: Bence Parajdi <[email protected]> Co-authored-by: Nara Prasetya <[email protected]> Co-authored-by: Robin Voetter <[email protected]> Co-authored-by: Enrico Degregori <[email protected]> Co-authored-by: Milo Lurati <[email protected]> Co-authored-by: Borys Petrov <[email protected]> Co-authored-by: Mátyás Aradi <[email protected]>
1 parent 07fda10 commit 22dcaf1

File tree

215 files changed

+8876
-4914
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

215 files changed

+8876
-4914
lines changed

.gitlab-ci.yml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,7 @@ build:cmake-minimum-apt:
188188
-D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror $EXTRA_CMAKE_CXX_FLAGS"
189189
-D CMAKE_BUILD_TYPE="$BUILD_TYPE"
190190
-D BUILD_$BUILD_TARGET=ON
191+
-D WITH_ROCRAND=ON
191192
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
192193
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
193194
-D BUILD_EXAMPLE=ON
@@ -206,6 +207,7 @@ build:cmake-minimum-apt:
206207
- $BUILD_DIR/gtest/
207208
- $BUILD_DIR/test/CTestTestfile.cmake
208209
- $BUILD_DIR/test/rocprim/CTestTestfile.cmake
210+
- $BUILD_DIR/deps/rocrand/
209211
- $BUILD_DIR/test/rocprim/test_*
210212
- $BUILD_DIR/test/rocprim/libtest_*
211213
- $BUILD_DIR/test/test_*
@@ -271,14 +273,14 @@ build:windows:
271273
- .deps:visual-studio-devshell
272274
parallel:
273275
matrix:
274-
- BUILD_TYPE: [Debug, Release]
275-
BUILD_TARGET: [BENCHMARK, TEST]
276+
- BUILD_TYPE: Release
276277
script:
277278
- mkdir -p $CI_PROJECT_DIR/build
278279
- cmake -G Ninja
279280
-S $CI_PROJECT_DIR
280281
-B $CI_PROJECT_DIR/build
281-
-D BUILD_$BUILD_TARGET=ON
282+
-D BUILD_TEST=ON
283+
-D BUILD_BENCHMARK=ON
282284
-D AMDGPU_TARGETS=$GPU_TARGET
283285
-D CMAKE_CXX_COMPILER:PATH="${env:HIP_PATH}\bin\clang++.exe"
284286
-D CMAKE_PREFIX_PATH:PATH="${env:HIP_PATH}"
@@ -427,7 +429,6 @@ test-windows-release:
427429
parallel:
428430
matrix:
429431
- BUILD_TYPE: Release
430-
BUILD_TARGET: TEST
431432

432433
.test-package:
433434
script:

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Full documentation for rocPRIM is available at [https://rocm.docs.amd.com/projec
1717
* Added `rocprim::int128_t` to the supported autotuning types to improve performance for 128-bit integers.
1818
* Added the `rocprim::merge_inplace` function for merging in-place.
1919
* Added initial value support for warp- and block-level inclusive scan.
20+
* Added support for building tests with device-side random data generation, making them finish faster. This requires rocRAND, and is enabled with the `WITH_ROCRAND=ON` build flag.
2021

2122
### Changed
2223

CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ include(CMakeDependentOption)
6868
# Disables building tests, benchmarks, examples
6969
option(ONLY_INSTALL "Only install" OFF)
7070
cmake_dependent_option(BUILD_TEST "Build tests (requires googletest)" OFF "NOT ONLY_INSTALL" OFF)
71+
option(WITH_ROCRAND "Build tests with device-side data generation(requires rocRAND)" OFF)
7172
cmake_dependent_option(BUILD_BENCHMARK "Build benchmarks" OFF "NOT ONLY_INSTALL" OFF)
7273
cmake_dependent_option(BUILD_EXAMPLE "Build examples" OFF "NOT ONLY_INSTALL" OFF)
7374
option(BUILD_NAIVE_BENCHMARK "Build naive benchmarks" OFF)
@@ -131,6 +132,7 @@ set(BUILD_SHARED_LIBS OFF) # don't build client dependencies as shared
131132

132133
# Get dependencies (required here to get rocm-cmake)
133134
include(cmake/Dependencies.cmake)
135+
134136
# Use target ID syntax if supported for GPU_TARGETS
135137
if(USE_HIPCXX)
136138
enable_language(HIP)

CONTRIBUTING.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ and performs well across a variety of input types and sizes. More specifically:
4141
- Tests and benchmarks must be instantiated with all supported data types.
4242
- If the algorithm uses multiple data types (for instance, if it uses different types for input and output), a selected and representative few combinations should be tested instead of the full combination matrix.
4343

44+
Any utility needed by the tests **and** benchmarks must be added to the appropriate header within the `common` folder. Non-common utilities may be hosted in the corresponding headers from the `test` or `benchmark` folders. For a more detailed description of the cases to be considered for adding new utilities, please check [common](/common/README.md).
45+
4446
We also employ automated testing and benchmarking via checks that are run when a pull request is created.
4547
These checks:
4648
- test all algorithms for correctness across a variety of input configurations (eg. types, sizes, etc.)

benchmark/ConfigAutotuneSettings.cmake

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,10 @@ set(LIMITED_TUNING_TYPES "rocprim::int128_t int64_t int short int8_t")
2929

3030
function(read_config_autotune_settings file list_across_names list_across output_pattern_suffix)
3131
if(file STREQUAL "benchmark_device_adjacent_difference")
32-
set(list_across_names "DataType;Left;InPlace;BlockSize" PARENT_SCOPE)
32+
set(list_across_names "DataType;Left;Aliasing;BlockSize" PARENT_SCOPE)
3333
set(list_across "${TUNING_TYPES};\
34-
true;false true;32 64 128 256 512 1024" PARENT_SCOPE)
35-
set(output_pattern_suffix "@DataType@_@Left@_@InPlace@_@BlockSize@" PARENT_SCOPE)
34+
true;no_alias in_place;32 64 128 256 512 1024" PARENT_SCOPE)
35+
set(output_pattern_suffix "@DataType@_@Left@_@Aliasing@_@BlockSize@" PARENT_SCOPE)
3636
elseif(file STREQUAL "benchmark_device_adjacent_find")
3737
set(list_across_names "InputType;BlockSize" PARENT_SCOPE)
3838
set(list_across "${TUNING_TYPES};64 128 256 512 1024" PARENT_SCOPE)

benchmark/benchmark_block_exchange.cpp

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// MIT License
22
//
3-
// Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
3+
// Copyright (c) 2017-2025 Advanced Micro Devices, Inc. All rights reserved.
44
//
55
// Permission is hereby granted, free of charge, to any person obtaining a copy
66
// of this software and associated documentation files (the "Software"), to deal
@@ -24,6 +24,8 @@
2424
#include "benchmark_utils.hpp"
2525
#include "cmdparser.hpp"
2626

27+
#include "../common/utils_custom_type.hpp"
28+
2729
// Google Benchmark
2830
#include <benchmark/benchmark.h>
2931

@@ -312,8 +314,8 @@ void add_benchmarks(const std::string& name,
312314
const managed_seed& seed,
313315
hipStream_t stream)
314316
{
315-
using custom_float2 = custom_type<float, float>;
316-
using custom_double2 = custom_type<double, double>;
317+
using custom_float2 = common::custom_type<float, float>;
318+
using custom_double2 = common::custom_type<double, double>;
317319

318320
std::vector<benchmark::internal::Benchmark*> bs = {BENCHMARK_TYPE(int, 256),
319321
BENCHMARK_TYPE(int8_t, 256),

benchmark/benchmark_block_radix_rank.cpp

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// MIT License
22
//
3-
// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
3+
// Copyright (c) 2022-2025 Advanced Micro Devices, Inc. All rights reserved.
44
//
55
// Permission is hereby granted, free of charge, to any person obtaining a copy
66
// of this software and associated documentation files (the "Software"), to deal
@@ -24,6 +24,8 @@
2424
// CmdParser
2525
#include "cmdparser.hpp"
2626

27+
#include "../common/utils_data_generation.hpp"
28+
2729
// Google Benchmark
2830
#include <benchmark/benchmark.h>
2931

@@ -110,8 +112,8 @@ void run_benchmark(benchmark::State& state,
110112
const unsigned int size = items_per_block * grid_size;
111113

112114
std::vector<T> input = get_random_data<T>(size,
113-
generate_limits<T>::min(),
114-
generate_limits<T>::max(),
115+
common::generate_limits<T>::min(),
116+
common::generate_limits<T>::max(),
115117
seed.get_0());
116118

117119
T* d_input;

benchmark/benchmark_block_radix_sort.cpp

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// MIT License
22
//
3-
// Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
3+
// Copyright (c) 2017-2025 Advanced Micro Devices, Inc. All rights reserved.
44
//
55
// Permission is hereby granted, free of charge, to any person obtaining a copy
66
// of this software and associated documentation files (the "Software"), to deal
@@ -24,6 +24,9 @@
2424
// CmdParser
2525
#include "cmdparser.hpp"
2626

27+
#include "../common/utils_custom_type.hpp"
28+
#include "../common/utils_data_generation.hpp"
29+
2730
// Google Benchmark
2831
#include <benchmark/benchmark.h>
2932

@@ -54,7 +57,7 @@ enum class benchmark_kinds
5457
};
5558

5659
template<typename T>
57-
using select_decomposer_t = std::conditional_t<is_custom_type<T>::value,
60+
using select_decomposer_t = std::conditional_t<common::is_custom_type<T>::value,
5861
custom_type_decomposer<T>,
5962
rocprim::identity_decomposer>;
6063

@@ -140,8 +143,8 @@ void run_benchmark(benchmark::State& state,
140143
const auto size = items_per_block * ((N + items_per_block - 1) / items_per_block);
141144

142145
std::vector<T> input = get_random_data<T>(size,
143-
generate_limits<T>::min(),
144-
generate_limits<T>::max(),
146+
common::generate_limits<T>::min(),
147+
common::generate_limits<T>::max(),
145148
seed.get_0());
146149

147150
T* d_input;
@@ -230,7 +233,7 @@ void add_benchmarks(benchmark_kinds benchmark_kind
230233
const managed_seed& seed,
231234
hipStream_t stream)
232235
{
233-
using custom_int_type = custom_type<int, int>;
236+
using custom_int_type = common::custom_type<int, int>;
234237

235238
std::vector<benchmark::internal::Benchmark*> bs = {BENCHMARK_TYPE(int, 64, 3),
236239
BENCHMARK_TYPE(int, 512, 3),

benchmark/benchmark_block_reduce.cpp

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// MIT License
22
//
3-
// Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
3+
// Copyright (c) 2017-2025 Advanced Micro Devices, Inc. All rights reserved.
44
//
55
// Permission is hereby granted, free of charge, to any person obtaining a copy
66
// of this software and associated documentation files (the "Software"), to deal
@@ -24,6 +24,8 @@
2424
// CmdParser
2525
#include "cmdparser.hpp"
2626

27+
#include "../common/utils_custom_type.hpp"
28+
2729
// Google Benchmark
2830
#include <benchmark/benchmark.h>
2931

@@ -170,8 +172,8 @@ void add_benchmarks(std::vector<benchmark::internal::Benchmark*>& benchmarks,
170172
hipStream_t stream,
171173
size_t bytes)
172174
{
173-
using custom_float2 = custom_type<float, float>;
174-
using custom_double2 = custom_type<double, double>;
175+
using custom_float2 = common::custom_type<float, float>;
176+
using custom_double2 = common::custom_type<double, double>;
175177

176178
std::vector<benchmark::internal::Benchmark*> new_benchmarks
177179
= {// When block size is less than or equal to warp size

benchmark/benchmark_block_run_length_decode.cpp

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// MIT License
22
//
3-
// Copyright (c) 2021-2024 Advanced Micro Devices, Inc. All rights reserved.
3+
// Copyright (c) 2021-2025 Advanced Micro Devices, Inc. All rights reserved.
44
//
55
// Permission is hereby granted, free of charge, to any person obtaining a copy
66
// of this software and associated documentation files (the "Software"), to deal
@@ -23,6 +23,8 @@
2323
#include "benchmark_utils.hpp"
2424
#include "cmdparser.hpp"
2525

26+
#include "../common/utils_data_generation.hpp"
27+
2628
#include <benchmark/benchmark.h>
2729

2830
#include <rocprim/block/block_load_func.hpp>
@@ -115,10 +117,10 @@ void run_benchmark(benchmark::State& state,
115117

116118
engine_type prng(seed.get_0());
117119
using ItemDistribution = std::conditional_t<rocprim::is_integral<ItemT>::value,
118-
uniform_int_distribution<ItemT>,
120+
common::uniform_int_distribution<ItemT>,
119121
std::uniform_real_distribution<ItemT>>;
120-
ItemDistribution run_item_dist(0, 100);
121-
uniform_int_distribution<OffsetT> run_length_dist(MinRunLength, MaxRunLength);
122+
ItemDistribution run_item_dist(0, 100);
123+
common::uniform_int_distribution<OffsetT> run_length_dist(MinRunLength, MaxRunLength);
122124

123125
for(size_t i = 0; i < num_runs; ++i)
124126
{

benchmark/benchmark_block_scan.cpp

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// MIT License
22
//
3-
// Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
3+
// Copyright (c) 2017-2025 Advanced Micro Devices, Inc. All rights reserved.
44
//
55
// Permission is hereby granted, free of charge, to any person obtaining a copy
66
// of this software and associated documentation files (the "Software"), to deal
@@ -24,6 +24,8 @@
2424
// CmdParser
2525
#include "cmdparser.hpp"
2626

27+
#include "../common/utils_custom_type.hpp"
28+
2729
// Google Benchmark
2830
#include <benchmark/benchmark.h>
2931

@@ -205,8 +207,8 @@ void add_benchmarks(std::vector<benchmark::internal::Benchmark*>& benchmarks,
205207
hipStream_t stream,
206208
size_t bytes)
207209
{
208-
using custom_float2 = custom_type<float, float>;
209-
using custom_double2 = custom_type<double, double>;
210+
using custom_float2 = common::custom_type<float, float>;
211+
using custom_double2 = common::custom_type<double, double>;
210212

211213
std::vector<benchmark::internal::Benchmark*> new_benchmarks
212214
= {// When block size is less than or equal to warp size

benchmark/benchmark_block_sort.parallel.hpp

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// MIT License
22
//
3-
// Copyright (c) 2019-2024 Advanced Micro Devices, Inc. All rights reserved.
3+
// Copyright (c) 2019-2025 Advanced Micro Devices, Inc. All rights reserved.
44
//
55
// Permission is hereby granted, free of charge, to any person obtaining a copy
66
// of this software and associated documentation files (the "Software"), to deal
@@ -25,6 +25,8 @@
2525

2626
#include "benchmark_utils.hpp"
2727

28+
#include "../common/utils_data_generation.hpp"
29+
2830
// Google Benchmark
2931
#include <benchmark/benchmark.h>
3032

@@ -243,10 +245,11 @@ struct block_sort_benchmark : public config_autotune_interface
243245

244246
const auto size = items_per_block * ((N + items_per_block - 1) / items_per_block);
245247

246-
std::vector<KeyType> input = get_random_data<KeyType>(size,
247-
generate_limits<KeyType>::min(),
248-
generate_limits<KeyType>::max(),
249-
seed.get_0());
248+
std::vector<KeyType> input
249+
= get_random_data<KeyType>(size,
250+
common::generate_limits<KeyType>::min(),
251+
common::generate_limits<KeyType>::max(),
252+
seed.get_0());
250253

251254
KeyType* d_input;
252255
KeyType* d_output;

benchmark/benchmark_device_adjacent_difference.cpp

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// MIT License
22
//
3-
// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
3+
// Copyright (c) 2022-2025 Advanced Micro Devices, Inc. All rights reserved.
44
//
55
// Permission is hereby granted, free of charge, to any person obtaining a copy
66
// of this software and associated documentation files (the "Software"), to deal
@@ -23,6 +23,11 @@
2323
#include "benchmark_device_adjacent_difference.parallel.hpp"
2424
#include "benchmark_utils.hpp"
2525

26+
#ifndef BENCHMARK_CONFIG_TUNING
27+
#include "../common/device_adjacent_difference.hpp"
28+
#include "../common/utils_custom_type.hpp"
29+
#endif
30+
2631
// Google Benchmark
2732
#include <benchmark/benchmark.h>
2833

@@ -56,10 +61,10 @@ constexpr std::size_t DEFAULT_BYTES = 1024LL * 1024LL * 1024LL * 2LL;
5661

5762
// clang-format off
5863
#define CREATE_BENCHMARKS(T) \
59-
CREATE_BENCHMARK(T, true, false) \
60-
CREATE_BENCHMARK(T, true, true) \
61-
CREATE_BENCHMARK(T, false, false) \
62-
CREATE_BENCHMARK(T, false, true)
64+
CREATE_BENCHMARK(T, true, common::api_variant::no_alias) \
65+
CREATE_BENCHMARK(T, true, common::api_variant::in_place) \
66+
CREATE_BENCHMARK(T, false, common::api_variant::no_alias) \
67+
CREATE_BENCHMARK(T, false, common::api_variant::in_place)
6368
// clang-format on
6469

6570
int main(int argc, char* argv[])
@@ -112,8 +117,8 @@ int main(int argc, char* argv[])
112117
seed,
113118
stream);
114119
#else // BENCHMARK_CONFIG_TUNING
115-
using custom_float2 = custom_type<float, float>;
116-
using custom_double2 = custom_type<double, double>;
120+
using custom_float2 = common::custom_type<float, float>;
121+
using custom_double2 = common::custom_type<double, double>;
117122
// Add benchmarks
118123
CREATE_BENCHMARKS(int)
119124
CREATE_BENCHMARKS(std::int64_t)

benchmark/benchmark_device_adjacent_difference.parallel.cpp.in

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// MIT License
22
//
3-
// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
3+
// Copyright (c) 2022-2025 Advanced Micro Devices, Inc. All rights reserved.
44
//
55
// Permission is hereby granted, free of charge, to any person obtaining a copy
66
// of this software and associated documentation files (the "Software"), to deal
@@ -20,18 +20,20 @@
2020
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
2121
// SOFTWARE.
2222

23-
#include "benchmark_utils.hpp"
2423
#include "benchmark_device_adjacent_difference.parallel.hpp"
24+
#include "benchmark_utils.hpp"
25+
26+
#include "../common/device_adjacent_difference.hpp"
2527

2628
#include <rocprim/types.hpp>
2729

2830
#include <stdint.h>
2931

3032
namespace {
3133
auto benchmarks = config_autotune_register::create_bulk(
32-
device_adjacent_difference_benchmark_generator<
33-
@DataType@,
34+
device_adjacent_difference_benchmark_generator<
35+
@DataType@,
3436
@BlockSize@,
35-
@Left@,
36-
@InPlace@>::create);
37+
@Left@,
38+
common::api_variant::@Aliasing@>::create);
3739
}

0 commit comments

Comments
 (0)