Skip to content

[ROCm] add support for ROCm/HIP device #6086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Conversation

jeffdaily
Copy link

To build for ROCm:

./helpers/hipify.sh
mkdir build
cd build
cmake -DUSE_ROCM=1 ..

CUDA source files are hipified in-place using the helper script before running cmake. The "cuda" device is re-used for rocm, so device=cuda will work the same for rocm builds.

Summary of changes:

  • CMakeLists.txt ROCm updates, also replace glob with explicit file list
  • support both warpSize 32 and 64
  • helpers/hipify.sh script added
  • .gitignore to ignore generated hip source files *.prehip
  • disable compiler warnings
  • move __device__ template function PercentileDevice into header
  • bug fixes for __host__ __define__

- CMakeLists.txt ROCm updates, also replace glob with explicit file list
- initial warpSize interop changes
- helpers/hipify.sh script added
- .gitignore to ignore generated hip source files
- disable compiler warnings
- move PercentileDevice __device__ template function into header
- bug fixes for __host__ __define__ and __HIP__ preprocessor symbols
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your interest in LightGBM. Since I'm not aware of any prior conversation in this project about adding support like this, we have some questions before spending time supporting this.

  • what is ROCm/HIP? Where can we read to learn more?
  • what is the value of this addition to LightGBM's users? What does this offer that the OpenCL-based and CUDA-based builds of LightGBM don't already offer?
    • this project's OpenCL-based GPU build is already struggling from a severe lack of maintenance... I'm very skeptical of taking on a third GPU build
  • how might we test this? What types of devices should we expect to be supported?

@shiyu1994
Copy link
Contributor

@jeffdaily Thank you, this is very exciting! @jameslamb ROCm is the counterpart of CUDA for AMD GPU. I don't have any prior discussion with @jeffdaily about this. But it is very exciting if we can enlarge the devices supported by LightGBM.

@jeffdaily
Copy link
Author

Apologies for coming out of nowhere with this. We use LightGBM; the OpenCL-based 'gpu' device already works on our AMD GPUs. But we were curious if we could get better performance if we ported the 'cuda' device to AMD GPUs. This started as a proof of concept, but it seemed useful to share even in its current state.

Using the GPU-Tutorial, here are my results on our MI210.

what is evaluated CPU GPU/OpenCL "cuda" but really ROCm
correctness auc : 0.821268
18.547533 seconds
auc : 0.821268
20.386780 seconds
auc : 0.821268
9.049307 seconds
speed objective=binary metric=auc 22.604444 seconds 18.028674 seconds 7.787303 seconds
speed objective=regression_l2 metric=l2 18.961535 seconds 14.491217 seconds 7.871302 seconds

@jeffdaily
Copy link
Author

  • what is ROCm/HIP? Where can we read to learn more?

https://rocm.docs.amd.com/en/latest/rocm.html

  • what is the value of this addition to LightGBM's users? What does this offer that the OpenCL-based and CUDA-based builds of LightGBM don't already offer?

See the perf results from the comment above.

  • this project's OpenCL-based GPU build is already struggling from a severe lack of maintenance... I'm very skeptical of taking on a third GPU build
  • how might we test this? What types of devices should we expect to be supported?

Here is the current list of supported AMD GPUs.

To test this, you'll need to run on one of the supported AMD GPUs. How is the cuda device currently tested?

@ibustany
Copy link

ibustany commented Sep 8, 2023

Thank you and kudos Jeff!
This work has been much needed!
Best regards,
Ismail

@jameslamb
Copy link
Collaborator

To test this, you'll need to run on one of the supported AMD GPUs. How is the cuda device currently tested?

We run a VM in Azure with a Tesla V100 on it, and schedule jobs onto it via GitHub Actions.

Are you aware of any free CI service supporting AMD GPUs? Otherwise, since I see you work for AMD and since merging this might further AMD's interests... would AMD maybe be willing to fund testing resources for this project? Maybe that's something you and @shiyu1994 (the only maintainer here who's employed by Microsoft) could coordinate?

@jeffdaily
Copy link
Author

Are you aware of any free CI service supporting AMD GPUs? Otherwise, since I see you work for AMD and since merging this might further AMD's interests... would AMD maybe be willing to fund testing resources for this project? Maybe that's something you and @shiyu1994 (the only maintainer here who's employed by Microsoft) could coordinate?

Microsoft does have an AMD GPU deployment. I'm aware of it being used for onnxruntime CI purposes. I wonder if some of those resources could be used here? @shiyu1994?

@jeffdaily
Copy link
Author

Noting that the only CI failure currently is not related to my changes. It seems to be a perhaps temporary environment setup issue for that job.

@shiyu1994
Copy link
Contributor

I have access to some AMD MI100 GPUs. But we still need separate budget for an agent with an AMD GPU if we want to test automatically in ci. Do you think it is acceptable if I run the tests for AMD GPU offline without an additional agent for ci? Given that the code for GPU version is shared by both CUDA and ROCm. @jameslamb @guolinke @jeffdaily.

@jameslamb
Copy link
Collaborator

Do you think it is acceptable if I run the tests for AMD GPU offline without an additional agent for ci?

If you feel confident in these changes based on that, and you think the added complexity in the CUDA code is worth it, that's fine with me. I'll defer to your opinion.

But without a CI job, there's a high risk that future refactorings will break this support again.

@jameslamb jameslamb dismissed their stale review September 13, 2023 14:11

dismissing

@jameslamb
Copy link
Collaborator

I dismissed my review, so that it doesn't block merging. My initial questions have been answered, thanks very much for those links and all that information!

@shiyu1994 and @guolinke seem excited about this addition... that's good enough for me 😊

I'll defer to them to review the code, as I know very little about CUDA.

@shiyu1994
Copy link
Contributor

@jeffdaily Thanks for the great work! I'll review this in the next few days.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the code!

I think this PR is blocked by #6766.

Also, I searched the CUDA code in the repo for the literals 32 and 64 and left some comments in places where warpsize can potentially be adjusted.

@@ -167,7 +169,7 @@ void CUDASingleGPUTreeLearner::LaunchReduceLeafStatKernel(

template <typename T, bool IS_INNER>
__global__ void CalcBitsetLenKernel(const CUDASplitInfo* best_split_info, size_t* out_len_buffer) {
__shared__ size_t shared_mem_buffer[32];
__shared__ size_t shared_mem_buffer[WARPSIZE];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also adjust the code that is relying on warpsize is always 32? For example, here:

len = (val / 32) + 1;
}
const size_t block_max_len = ShuffleReduceMax<size_t>(len, shared_mem_buffer, blockDim.x);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3cd34a2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain how exactly this my comment was addressed in 3cd34a2?

@@ -354,7 +358,7 @@ void CUDALambdarankNDCG::LaunchGetGradientsKernel(const double* score, score_t*
}
} else {
BitonicArgSortItemsGlobal(score, num_queries_, cuda_query_boundaries_, cuda_item_indices_buffer_.RawData());
if (num_rank_label <= 32) {
if (num_rank_label <= 32 && device_prop.warpSize == 32) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we adjust the following code for warpsize other than 32?

// assert that warpSize == 32
__shared__ double shared_buffer[32];

// assert that warpSize == 32, so we use buffer size 1024 / 32 = 32
__shared__ double shared_buffer[32];

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3cd34a2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that according to the initial comment it should be shared_buffer[1024 / WARPSIZE] but not shared_buffer[WARPSIZE]

shiyu1994 and others added 4 commits February 5, 2025 17:25
@shiyu1994 shiyu1994 requested a review from StrikerRUS February 26, 2025 03:12
@shiyu1994
Copy link
Contributor

@StrikerRUS @jameslamb @guolinke Could you help to review this again when you have time? Thanks.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 Thank you for considering my comments! I think that two of them were not addressed by your recent refactoring. Please check #6086 (comment) and #6086 (comment).

@shiyu1994
Copy link
Contributor

shiyu1994 commented Jun 10, 2025

@shiyu1994 Thank you for considering my comments! I think that two of them were not addressed by your recent refactoring. Please check #6086 (comment) and #6086 (comment).

Thank you for the very careful check. I've done the fixes in 28d4648. Could you please review it again? @StrikerRUS

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994

I've done the fixes in 28d4648.

I believe that right commit would be 6732b79.

I left my last suggestion for this PR below.

@@ -405,7 +409,7 @@ __global__ void GetGradientsKernel_RankXENDCG_SharedMemory(
const data_size_t block_reduce_size = query_item_count >= 1024 ? 1024 : query_item_count;
__shared__ double shared_rho[SHARED_MEMORY_SIZE];
// assert that warpSize == 32
__shared__ double shared_buffer[32];
__shared__ double shared_buffer[WARPSIZE];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it should be 1024 / WARPSIZE similarly to L530 in this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done via d4676d9.
Please check.
Thanks for the finding!

@shiyu1994
Copy link
Contributor

@StrikerRUS Thanks for your review. Could you please check this PR again? If there's no other problems, let's merge this.

shiyu1994 added a commit that referenced this pull request Jun 17, 2025
@shiyu1994
Copy link
Contributor

/AzurePipelines run

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffdaily Thank you so much for proposing this PR!

And thanks a lot @shiyu1994 for finishing it!

@StrikerRUS
Copy link
Collaborator

@jameslamb Could you please refresh your blocking review?

#6086 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants