Skip to content

Commit a83a3b6

Browse files
dzzz2001caic99
andauthored
Feature: enable multi-k calculation for CUDA version of module_gint (deepmodeling#4839)
* add find_matrix_offset function to Class hcontainer * modify dm matrix in gint_rho_gpu.cu * modify dm matrix in gint_force_gpu.cu * remove GPU restriction in multi-k * modify some functions name * modify hRGint in gint_vl_gpu.cu * enable mult-k calculation in gint_vl_gpu * add const * modify related doc * add two testing cases for multi-k calculation * fix an error * replace a test case * remove parameter “gamma_only" in get_device_flag * modify cuda.md * modify cuda.md again * Update docs/advanced/acceleration/cuda.md Co-authored-by: Chun Cai <[email protected]> * Update docs/advanced/acceleration/cuda.md Co-authored-by: Chun Cai <[email protected]> * Update docs/advanced/input_files/input-main.md Co-authored-by: Chun Cai <[email protected]> * Update cuda.md * Update cuda.md --------- Co-authored-by: Chun Cai <[email protected]>
1 parent f7ca386 commit a83a3b6

33 files changed

+462
-380
lines changed

docs/advanced/acceleration/cuda.md

+5-8
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@ In ABACUS, we provide the option to use GPU devices to accelerate performance. T
66

77
- **Electronic state data**: (e.g. electronic density) are moved from the GPU to the CPU(s) every scf step.
88

9-
- **Acclerated by the NVIDIA libraries**: `cuBLAS` for common linear algebra calculations, `cuSolver` for eigen values/vectors, and `cuFFT` for the conversions between the real and recip spaces.
9+
- **Accelerated by the NVIDIA libraries**: `cuBLAS` for common linear algebra calculations, `cuSolver` for eigen values/vectors, and `cuFFT` for the conversions between the real and recip spaces.
1010

1111
- **Multi GPU supprted**: Using multiple MPI tasks will often give the best performance. Note each MPI task will be bind to a GPU device with automatically computing load balancing.
1212

1313
- **Parallel strategy**: K point parallel.
1414

15-
Unlike PW basis, only the grid integration module (module_gint) and the diagonalization of the Hamiltonian matrix (module_hsolver) have been implemented with GPU acceleration under LCAO basis, and the acceleration is limited to gamma only calculation. Additionally, LCAO basis does not support multi-GPU acceleration. Both the grid integration module and the Hamiltonian matrix solver only support acceleration on a single GPU.
15+
Unlike PW basis, only the grid integration module (module_gint) and the diagonalization of the Hamiltonian matrix (module_hsolver) have been implemented with GPU acceleration under LCAO basis.
1616

1717
## Required hardware/software
1818

@@ -31,17 +31,14 @@ Check the [Advanced Installation Options](https://abacus-rtd.readthedocs.io/en/l
3131

3232
## Run with the GPU support by editing the INPUT script:
3333

34-
In `INPUT` file we need to set the value keyword [device](../input_files/input-main.md#device) to be `gpu`.
34+
In `INPUT` file we need to set the input parameter [device](../input_files/input-main.md#device) to `gpu`. If this parameter is not set, ABACUS will try to determine if there are available GPUs.
35+
- Set `ks_solver`: For the PW basis, CG, BPCG and Davidson methods are supported on GPU; set the input parameter [ks_solver](../input_files/input-main.md#ks_solver) to `cg`, `bpcg` or `dav`. For the LCAO basis, `cusolver` is supported on GPU.
36+
- **multi-card**: ABACUS allows for multi-GPU acceleration. If you have multiple GPU cards, you can run ABACUS with several MPI processes, and each process will utilize one GPU card. For example, the command `mpirun -n 2 abacus` will by default launch two GPUs for computation. If you only have one card, this command will only start one GPU.
3537

3638
## Examples
3739
We provides [examples](https://github.com/deepmodeling/abacus-develop/tree/develop/examples/gpu) of gpu calculations.
3840

3941
## Known limitations
4042
PW basis:
41-
- CG, BPCG and Davidson methods are supported, so the input keyword `ks_solver` can take the values `cg`, `bpcg` or `dav`.
4243
- Only k point parallelization is supported, so the input keyword `kpar` will be set to match the number of MPI tasks automatically.
4344
- By default, CUDA architectures 60, 70, 75, 80, 86, and 89 are compiled (if supported). It can be overriden using the CMake variable [`CMAKE_CUDA_ARCHITECTURES`](https://cmake.org/cmake/help/latest/variable/CMAKE_CUDA_ARCHITECTURES.html) or the environmental variable [`CUDAARCHS`](https://cmake.org/cmake/help/latest/envvar/CUDAARCHS.html).
44-
45-
LCAO basis:
46-
- Does not support multi-k calculation, so if the input keyword `device` is set to `gpu`, the input keyword `gamma_only` can only take the value `1`.
47-
- Does not support multi-GPU acceleration.

docs/advanced/input_files/input-main.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -629,7 +629,7 @@ If only one value is set (such as `kspacing 0.5`), then kspacing values of a/b/c
629629
- cpu: for CPUs via Intel, AMD, or Other supported CPU devices
630630
- gpu: for GPUs via CUDA or ROCm.
631631

632-
Known limitations: If using the pw basis, the ks_solver must be cg/bpcg/dav to support `gpu` acceleration. If using the lcao basis, `gamma_only` must be set to `1`, as multi-k calculation is currently not supported for `gpu`. lcao_in_pw also does not support `gpu`.
632+
Known limitations: `ks_solver` must also be set to the algorithms supported. lcao_in_pw currently does not support `gpu`.
633633

634634
- **Default**: cpu
635635

source/module_base/module_device/device.cpp

+1-6
Original file line numberDiff line numberDiff line change
@@ -149,8 +149,7 @@ int set_device_by_rank(const MPI_Comm mpi_comm) {
149149

150150
std::string get_device_flag(const std::string &device,
151151
const std::string &ks_solver,
152-
const std::string &basis_type,
153-
const bool &gamma_only) {
152+
const std::string &basis_type) {
154153
if (device == "cpu") {
155154
return "cpu"; // no extra checks required
156155
}
@@ -181,10 +180,6 @@ if (basis_type == "lcao_in_pw") {
181180
error_message +=
182181
"The GPU currently does not support the basis type \"lcao_in_pw\"!";
183182
}
184-
if (basis_type == "lcao" && gamma_only == false) {
185-
error_message += "The GPU currently does not support the basis type "
186-
"\"lcao\" with \"gamma_only\" set to \"0\"!";
187-
}
188183
if(error_message.empty())
189184
{
190185
return "gpu"; // possibly automatically set to GPU

source/module_base/module_device/device.h

+1-2
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,7 @@ int get_device_kpar(const int& kpar);
4848
*/
4949
std::string get_device_flag(const std::string& device,
5050
const std::string& ks_solver,
51-
const std::string& basis_type,
52-
const bool& gamma_only);
51+
const std::string& basis_type);
5352

5453
#if __MPI
5554
int get_node_rank();

source/module_hamilt_lcao/module_gint/gint.cpp

+4-4
Original file line numberDiff line numberDiff line change
@@ -45,16 +45,16 @@ void Gint::cal_gint(Gint_inout* inout) {
4545
}
4646
if (max_size > 0) {
4747
#ifdef __CUDA
48-
if (GlobalV::device_flag == "gpu" && GlobalV::GAMMA_ONLY_LOCAL
48+
if (GlobalV::device_flag == "gpu"
4949
&& (inout->job == Gint_Tools::job_type::vlocal
5050
|| inout->job == Gint_Tools::job_type::rho
5151
|| inout->job == Gint_Tools::job_type::force)) {
5252
if (inout->job == Gint_Tools::job_type::vlocal) {
53-
gamma_gpu_vlocal_interface(inout);
53+
gpu_vlocal_interface(inout);
5454
} else if (inout->job == Gint_Tools::job_type::rho) {
55-
gamma_gpu_rho_interface(inout);
55+
gpu_rho_interface(inout);
5656
} else if (inout->job == Gint_Tools::job_type::force) {
57-
gamma_gpu_force_interface(inout);
57+
gpu_force_interface(inout);
5858
}
5959
} else
6060
#endif

source/module_hamilt_lcao/module_gint/gint.h

+3-3
Original file line numberDiff line numberDiff line change
@@ -71,11 +71,11 @@ class Gint {
7171
int ny, nplane, startz_current; // from rhopw
7272

7373
// in cal_gint_gpu.cpp
74-
void gamma_gpu_vlocal_interface(Gint_inout* inout);
74+
void gpu_vlocal_interface(Gint_inout* inout);
7575

76-
void gamma_gpu_rho_interface(Gint_inout* inout);
76+
void gpu_rho_interface(Gint_inout* inout);
7777

78-
void gamma_gpu_force_interface(Gint_inout* inout);
78+
void gpu_force_interface(Gint_inout* inout);
7979

8080
// in cal_gint_cpu.cpp
8181

source/module_hamilt_lcao/module_gint/gint_force_gpu.cu

+9-32
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ namespace GintKernel
2020
* 6. force dot on the GPU.
2121
* 7. Copy the results back to the host.
2222
*/
23-
void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
23+
void gint_fvl_gpu(const hamilt::HContainer<double>* dm,
2424
const double* vlocal,
2525
double* force_in,
2626
double* stress_in,
@@ -82,36 +82,12 @@ void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
8282
Cuda_Mem_Wrapper<double> force(3 * nat, num_streams, true);
8383
Cuda_Mem_Wrapper<double> stress(6, num_streams, true);
8484

85-
Cuda_Mem_Wrapper<double> dm_matrix(lgd * lgd, 1, true);
86-
for (int iat1 = 0; iat1 < ucell.nat; iat1++)
87-
{
88-
for (int iat2 = 0; iat2 < ucell.nat; iat2++)
89-
{
90-
const int it1 = ucell.iat2it[iat1];
91-
const int it2 = ucell.iat2it[iat2];
92-
const int lo1
93-
= gridt.trace_lo[ucell.itiaiw2iwt(it1, ucell.iat2ia[iat1], 0)];
94-
const int lo2
95-
= gridt.trace_lo[ucell.itiaiw2iwt(it2, ucell.iat2ia[iat2], 0)];
96-
97-
hamilt::AtomPair<double>* tmp_ap = dm->find_pair(iat1, iat2);
98-
int orb_index = 0;
99-
if (tmp_ap == NULL)
100-
{
101-
continue;
102-
}
103-
for (int orb_i = 0; orb_i < tmp_ap->get_row_size(); orb_i++)
104-
{
105-
for (int orb_j = 0; orb_j < tmp_ap->get_col_size(); orb_j++)
106-
{
107-
dm_matrix.get_host_pointer()[(lo1 + orb_i) * lgd + (lo2 + orb_j)]
108-
= tmp_ap->get_pointer(0)[orb_index];
109-
orb_index++;
110-
}
111-
}
112-
}
113-
}
114-
dm_matrix.copy_host_to_device_sync();
85+
Cuda_Mem_Wrapper<double> dm_matrix(dm->get_nnr(), 1, false);
86+
// retrieve the density matrix on the host
87+
checkCuda(cudaMemcpy(dm_matrix.get_device_pointer(),
88+
dm->get_wrapper(),
89+
dm->get_nnr() * sizeof(double),
90+
cudaMemcpyHostToDevice));
11591

11692
#pragma omp parallel for num_threads(num_streams) collapse(2)
11793
for (int i = 0; i < gridt.nbx; i++)
@@ -142,7 +118,8 @@ void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
142118
dr_part.get_host_pointer(sid),
143119
vldr3.get_host_pointer(sid));
144120

145-
alloc_mult_force(gridt,
121+
alloc_mult_force(dm,
122+
gridt,
146123
ucell,
147124
grid_index_ij,
148125
max_atom,

source/module_hamilt_lcao/module_gint/gint_force_gpu.h

+3-2
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
#include "module_hamilt_lcao/module_gint/grid_technique.h"
66
namespace GintKernel
77
{
8-
void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
8+
void gint_fvl_gpu(const hamilt::HContainer<double>* dm,
99
const double* vlocal,
1010
double* force_in,
1111
double* stress_in,
@@ -29,7 +29,8 @@ void gtask_force(const Grid_Technique& gridt,
2929
double* dr_part,
3030
double* vldr3);
3131

32-
void alloc_mult_force(const Grid_Technique& gridt,
32+
void alloc_mult_force(const hamilt::HContainer<double>* dm,
33+
const Grid_Technique& gridt,
3334
const UnitCell& ucell,
3435
const int grid_index_ij,
3536
const int max_atom,

source/module_hamilt_lcao/module_gint/gint_gpu_interface.cpp

+15-12
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
#include "module_base/memory.h"
66
#include "module_base/timer.h"
77

8-
void Gint::gamma_gpu_vlocal_interface(Gint_inout* inout) {
8+
void Gint::gpu_vlocal_interface(Gint_inout* inout) {
99
ModuleBase::TITLE("Gint_interface", "cal_gint_vlocal");
1010
ModuleBase::timer::tick("Gint_interface", "cal_gint_vlocal");
1111

@@ -17,19 +17,22 @@ void Gint::gamma_gpu_vlocal_interface(Gint_inout* inout) {
1717
ylmcoef[i] = ModuleBase::Ylm::ylmcoef[i];
1818
}
1919

20-
GintKernel::gint_gamma_vl_gpu(this->hRGint,
21-
inout->vl,
22-
ylmcoef,
23-
dr,
24-
this->gridt->rcuts.data(),
25-
*this->gridt,
26-
ucell);
20+
double* pvpR = GlobalV::GAMMA_ONLY_LOCAL ? nullptr : this->pvpR_reduced[inout->ispin];
21+
GintKernel::gint_vl_gpu(this->hRGint,
22+
inout->vl,
23+
ylmcoef,
24+
dr,
25+
this->gridt->rcuts.data(),
26+
*this->gridt,
27+
ucell,
28+
pvpR,
29+
GlobalV::GAMMA_ONLY_LOCAL);
2730

2831
ModuleBase::TITLE("Gint_interface", "cal_gint_vlocal");
2932
ModuleBase::timer::tick("Gint_interface", "cal_gint_vlocal");
3033
}
3134

32-
void Gint::gamma_gpu_rho_interface(Gint_inout* inout) {
35+
void Gint::gpu_rho_interface(Gint_inout* inout) {
3336
ModuleBase::TITLE("Gint_interface", "cal_gint_rho");
3437
ModuleBase::timer::tick("Gint_interface", "cal_gint_rho");
3538

@@ -43,7 +46,7 @@ void Gint::gamma_gpu_rho_interface(Gint_inout* inout) {
4346
int nrxx = this->gridt->ncx * this->gridt->ncy * this->nplane;
4447
for (int is = 0; is < GlobalV::NSPIN; ++is) {
4548
ModuleBase::GlobalFunc::ZEROS(inout->rho[is], nrxx);
46-
GintKernel::gint_gamma_rho_gpu(this->DMRGint[is],
49+
GintKernel::gint_rho_gpu(this->DMRGint[is],
4750
ylmcoef,
4851
dr,
4952
this->gridt->rcuts.data(),
@@ -55,7 +58,7 @@ void Gint::gamma_gpu_rho_interface(Gint_inout* inout) {
5558
ModuleBase::timer::tick("Gint_interface", "cal_gint_rho");
5659
}
5760

58-
void Gint::gamma_gpu_force_interface(Gint_inout* inout) {
61+
void Gint::gpu_force_interface(Gint_inout* inout) {
5962
ModuleBase::TITLE("Gint_interface", "cal_gint_force");
6063
ModuleBase::timer::tick("Gint_interface", "cal_gint_force");
6164

@@ -74,7 +77,7 @@ void Gint::gamma_gpu_force_interface(Gint_inout* inout) {
7477
if (isforce || isstress) {
7578
std::vector<double> force(nat * 3, 0.0);
7679
std::vector<double> stress(6, 0.0);
77-
GintKernel::gint_fvl_gamma_gpu(this->DMRGint[inout->ispin],
80+
GintKernel::gint_fvl_gpu(this->DMRGint[inout->ispin],
7881
inout->vl,
7982
force.data(),
8083
stress.data(),

source/module_hamilt_lcao/module_gint/gint_rho_gpu.cu

+7-29
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
namespace GintKernel
1010
{
1111

12-
void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
12+
void gint_rho_gpu(const hamilt::HContainer<double>* dm,
1313
const double* ylmcoef_now,
1414
const double dr,
1515
const double* rcut,
@@ -60,35 +60,12 @@ void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
6060
Cuda_Mem_Wrapper<double> rho_g(num_mcell_on_proc, 1, false);
6161
Cuda_Mem_Wrapper<double*> dot_product(nbzp * gridt.bxyz, num_streams, true);
6262

63-
Cuda_Mem_Wrapper<double> dm_matrix(lgd * lgd, 1, true);
63+
Cuda_Mem_Wrapper<double> dm_matrix(dm->get_nnr(), 1, false);
6464
// retrieve the density matrix on the host
65-
for (int iat1 = 0; iat1 < ucell.nat; iat1++)
66-
{
67-
for (int iat2 = 0; iat2 < ucell.nat; iat2++)
68-
{
69-
const int it1 = ucell.iat2it[iat1];
70-
const int it2 = ucell.iat2it[iat2];
71-
const int lo1 = gridt.trace_lo[ucell.itiaiw2iwt(it1, ucell.iat2ia[iat1], 0)];
72-
const int lo2 = gridt.trace_lo[ucell.itiaiw2iwt(it2, ucell.iat2ia[iat2], 0)];
73-
74-
hamilt::AtomPair<double>* tmp_ap = dm->find_pair(iat1, iat2);
75-
int orb_index = 0;
76-
if (tmp_ap == NULL)
77-
{
78-
continue;
79-
}
80-
for (int orb_i = 0; orb_i < tmp_ap->get_row_size(); orb_i++)
81-
{
82-
for (int orb_j = 0; orb_j < tmp_ap->get_col_size(); orb_j++)
83-
{
84-
dm_matrix.get_host_pointer()[(lo1 + orb_i) * lgd + (lo2 + orb_j)]
85-
= tmp_ap->get_pointer(0)[orb_index];
86-
orb_index++;
87-
}
88-
}
89-
}
90-
}
91-
dm_matrix.copy_host_to_device_sync();
65+
checkCuda(cudaMemcpy(dm_matrix.get_device_pointer(),
66+
dm->get_wrapper(),
67+
dm->get_nnr() * sizeof(double),
68+
cudaMemcpyHostToDevice));
9269

9370
// calculate the rho for every nbzp bigcells
9471
#pragma omp parallel for num_threads(num_streams) collapse(2)
@@ -120,6 +97,7 @@ void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
12097
atoms_per_z);
12198

12299
alloc_mult_dot_rho(
100+
dm,
123101
gridt,
124102
ucell,
125103
grid_index_ij,

source/module_hamilt_lcao/module_gint/gint_rho_gpu.h

+3-32
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ namespace GintKernel
2121
* @param ucell UnitCell.
2222
* @param rho rho.
2323
*/
24-
void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
24+
void gint_rho_gpu(const hamilt::HContainer<double>* dm,
2525
const double* ylmcoef_now,
2626
const double dr,
2727
const double* rcut,
@@ -37,37 +37,8 @@ void gtask_rho(const Grid_Technique& gridt,
3737
int* atoms_num_info,
3838
int& atoms_per_z);
3939

40-
/**
41-
* Allocate resources and perform matrix multiplication and vector dot products
42-
* for computing the rho.
43-
*
44-
* @param gridt Grid_Technique object containing grid information.
45-
* @param ucell UnitCell object containing unit cell information.
46-
* @param gpu_mat_cal_flag Flags indicating which parts of the calculation will use the GPU.
47-
* @param grid_index_ij Combined X and Y indices of the bigcell.
48-
* @param max_size Maximum number of atoms in a meshcell.
49-
* @param lgd lgd.
50-
* @param nczp Number of meshcells along the z-axis on this processor.
51-
* @param psir_ylm_g One-dimensional array storing psir.
52-
* @param psir_dm_g One-dimensional array storing psir_dm.
53-
* @param dm_matrix_g One-dimensional array storing mat_dm.
54-
* @param mat_alpha Alpha values for matrix multiplication.
55-
* @param mat_m Number of rows in mat_dm.
56-
* @param mat_n Number of columns in mat_psir.
57-
* @param mat_k Number of columns in mat_dm, which equals the number of rows in mat_psir.
58-
* @param mat_lda Leading dimension of mat_dm.
59-
* @param mat_ldb Leading dimension of mat_psir.
60-
* @param mat_ldc Leading dimension of mat_psir_dm.
61-
* @param mat_A Pointers to mat_dm.
62-
* @param mat_B Pointers to mat_psir.
63-
* @param mat_C Pointers to mat_psir_dm.
64-
* @param max_m Maximum value of m.
65-
* @param max_n Maximum value of n.
66-
* @param atom_pair_num Total count of atom pairs, which is also the number of matrix multiplication operations.
67-
* @param rho_g Rho.
68-
* @param dot_product Pointers to the results of dot products.
69-
*/
70-
void alloc_mult_dot_rho(const Grid_Technique& gridt,
40+
void alloc_mult_dot_rho(const hamilt::HContainer<double>* dm,
41+
const Grid_Technique& gridt,
7142
const UnitCell& ucell,
7243
const int grid_index_ij,
7344
const int max_atom,

source/module_hamilt_lcao/module_gint/gint_tools.cpp

-7
Original file line numberDiff line numberDiff line change
@@ -256,11 +256,4 @@ void cal_dpsirr_ylm(
256256
return psir_vlbr3;
257257
}
258258

259-
260-
// calculating (psi_DMR)_mu = sum_nu DMR_mu,nu psi_nu
261-
// note : there is a difference between rho and force
262-
// in calculating rho, due to symmetry, the summation over mu,nu
263-
// can be done as sum_mu,mu + 2 sum_mu<nu, saving some time
264-
// but for force, we cannot exchange the index
265-
266259
} // namespace Gint_Tools

0 commit comments

Comments
 (0)