sunliang98
diff --git a/‎docs/advanced/acceleration/cuda.md
+5-8 b/‎docs/advanced/acceleration/cuda.md
+5-8
diff --git a/‎docs/advanced/input_files/input-main.md
+1-1 b/‎docs/advanced/input_files/input-main.md
+1-1
diff --git a/‎source/module_base/module_device/device.cpp
+1-6 b/‎source/module_base/module_device/device.cpp
+1-6
diff --git a/‎source/module_base/module_device/device.h
+1-2 b/‎source/module_base/module_device/device.h
+1-2
diff --git a/‎source/module_hamilt_lcao/module_gint/gint.cpp
+4-4 b/‎source/module_hamilt_lcao/module_gint/gint.cpp
+4-4
diff --git a/‎source/module_hamilt_lcao/module_gint/gint.h
+3-3 b/‎source/module_hamilt_lcao/module_gint/gint.h
+3-3
diff --git a/‎source/module_hamilt_lcao/module_gint/gint_force_gpu.cu
+9-32 b/‎source/module_hamilt_lcao/module_gint/gint_force_gpu.cu
+9-32
diff --git a/‎source/module_hamilt_lcao/module_gint/gint_force_gpu.h
+3-2 b/‎source/module_hamilt_lcao/module_gint/gint_force_gpu.h
+3-2
diff --git a/‎source/module_hamilt_lcao/module_gint/gint_gpu_interface.cpp
+15-12 b/‎source/module_hamilt_lcao/module_gint/gint_gpu_interface.cpp
+15-12
diff --git a/‎source/module_hamilt_lcao/module_gint/gint_rho_gpu.cu
+7-29 b/‎source/module_hamilt_lcao/module_gint/gint_rho_gpu.cu
+7-29
diff --git a/‎source/module_hamilt_lcao/module_gint/gint_rho_gpu.h
+3-32 b/‎source/module_hamilt_lcao/module_gint/gint_rho_gpu.h
+3-32
diff --git a/‎source/module_hamilt_lcao/module_gint/gint_tools.cpp
-7 b/‎source/module_hamilt_lcao/module_gint/gint_tools.cpp
-7
@@ -6,13 +6,13 @@ In ABACUS, we provide the option to use GPU devices to accelerate performance. T
 
 - **Electronic state data**: (e.g. electronic density) are moved from the GPU to the CPU(s) every scf step.
 
-- **Acclerated by the NVIDIA libraries**: `cuBLAS` for common linear algebra calculations, `cuSolver` for eigen values/vectors, and `cuFFT` for the conversions between the real and recip spaces.
+- **Accelerated by the NVIDIA libraries**: `cuBLAS` for common linear algebra calculations, `cuSolver` for eigen values/vectors, and `cuFFT` for the conversions between the real and recip spaces.
 
 - **Multi GPU supprted**: Using multiple MPI tasks will often give the best performance. Note each MPI task will be bind to a GPU device with automatically computing load balancing.
 
 - **Parallel strategy**: K point parallel.
 
-Unlike PW basis, only the grid integration module (module_gint) and the diagonalization of the Hamiltonian matrix (module_hsolver) have been implemented with GPU acceleration under LCAO basis, and the acceleration is limited to gamma only calculation. Additionally, LCAO basis does not support multi-GPU acceleration. Both the grid integration module and the Hamiltonian matrix solver only support acceleration on a single GPU.
+Unlike PW basis, only the grid integration module (module_gint) and the diagonalization of the Hamiltonian matrix (module_hsolver) have been implemented with GPU acceleration under LCAO basis.
 
 ## Required hardware/software
 
@@ -31,17 +31,14 @@ Check the [Advanced Installation Options](https://abacus-rtd.readthedocs.io/en/l
 
 ## Run with the GPU support by editing the INPUT script:
 
-In `INPUT` file we need to set the value keyword [device](../input_files/input-main.md#device) to be `gpu`.
+In `INPUT` file we need to set the input parameter [device](../input_files/input-main.md#device) to `gpu`. If this parameter is not set, ABACUS will try to determine if there are available GPUs.
+- Set `ks_solver`: For the PW basis, CG, BPCG and Davidson methods are supported on GPU; set the input parameter [ks_solver](../input_files/input-main.md#ks_solver) to `cg`, `bpcg` or `dav`. For the LCAO basis, `cusolver` is supported on GPU.
+- **multi-card**: ABACUS allows for multi-GPU acceleration. If you have multiple GPU cards, you can run ABACUS with several MPI processes, and each process will utilize one GPU card. For example, the command `mpirun -n 2 abacus` will by default launch two GPUs for computation. If you only have one card, this command will only start one GPU.
 
 ## Examples
 We provides [examples](https://github.com/deepmodeling/abacus-develop/tree/develop/examples/gpu) of gpu calculations.
 
 ## Known limitations
 PW basis:
-- CG, BPCG and Davidson methods are supported, so the input keyword `ks_solver` can take the values `cg`, `bpcg` or `dav`.
 - Only k point parallelization is supported, so the input keyword `kpar` will be set to match the number of MPI tasks automatically.
 - By default, CUDA architectures 60, 70, 75, 80, 86, and 89 are compiled (if supported). It can be overriden using the CMake variable [`CMAKE_CUDA_ARCHITECTURES`](https://cmake.org/cmake/help/latest/variable/CMAKE_CUDA_ARCHITECTURES.html) or the environmental variable [`CUDAARCHS`](https://cmake.org/cmake/help/latest/envvar/CUDAARCHS.html).
-
-LCAO basis:
-- Does not support multi-k calculation, so if the input keyword `device` is set to `gpu`, the input keyword `gamma_only` can only take the value `1`.
-- Does not support multi-GPU acceleration.
@@ -629,7 +629,7 @@ If only one value is set (such as `kspacing 0.5`), then kspacing values of a/b/c
   - cpu: for CPUs via Intel, AMD, or Other supported CPU devices
   - gpu: for GPUs via CUDA or ROCm.
 
-  Known limitations: If using the pw basis, the ks_solver must be cg/bpcg/dav to support `gpu` acceleration. If using the lcao basis, `gamma_only` must be set to `1`, as multi-k calculation is currently not supported for `gpu`. lcao_in_pw also does not support `gpu`.
+  Known limitations: `ks_solver` must also be set to the algorithms supported. lcao_in_pw currently does not support `gpu`.
 
 - **Default**: cpu
 
 
@@ -149,8 +149,7 @@ int set_device_by_rank(const MPI_Comm mpi_comm) {
 
 std::string get_device_flag(const std::string &device,
                             const std::string &ks_solver,
-                            const std::string &basis_type,
-                            const bool &gamma_only) {
+                            const std::string &basis_type) {
 if (device == "cpu") {
   return "cpu"; // no extra checks required
 }
@@ -181,10 +180,6 @@ if (basis_type == "lcao_in_pw") {
   error_message +=
       "The GPU currently does not support the basis type \"lcao_in_pw\"!";
 }
-if (basis_type == "lcao" && gamma_only == false) {
-  error_message += "The GPU currently does not support the basis type "
-                    "\"lcao\" with \"gamma_only\" set to \"0\"!";
-}
 if(error_message.empty())
 {
   return "gpu"; // possibly automatically set to GPU
 
@@ -48,8 +48,7 @@ int get_device_kpar(const int& kpar);
  */
 std::string get_device_flag(const std::string& device,
                             const std::string& ks_solver,
-                            const std::string& basis_type,
-                            const bool& gamma_only);
+                            const std::string& basis_type);
 
 #if __MPI
 int get_node_rank();
 
@@ -45,16 +45,16 @@ void Gint::cal_gint(Gint_inout* inout) {
     }
     if (max_size > 0) {
 #ifdef __CUDA
-        if (GlobalV::device_flag == "gpu" && GlobalV::GAMMA_ONLY_LOCAL
+        if (GlobalV::device_flag == "gpu"
             && (inout->job == Gint_Tools::job_type::vlocal
                 || inout->job == Gint_Tools::job_type::rho
                 || inout->job == Gint_Tools::job_type::force)) {
             if (inout->job == Gint_Tools::job_type::vlocal) {
-                gamma_gpu_vlocal_interface(inout);
+                gpu_vlocal_interface(inout);
             } else if (inout->job == Gint_Tools::job_type::rho) {
-                gamma_gpu_rho_interface(inout);
+                gpu_rho_interface(inout);
             } else if (inout->job == Gint_Tools::job_type::force) {
-                gamma_gpu_force_interface(inout);
+                gpu_force_interface(inout);
             }
         } else
 #endif
 
@@ -71,11 +71,11 @@ class Gint {
     int ny, nplane, startz_current; // from rhopw
 
     // in cal_gint_gpu.cpp
-    void gamma_gpu_vlocal_interface(Gint_inout* inout);
+    void gpu_vlocal_interface(Gint_inout* inout);
 
-    void gamma_gpu_rho_interface(Gint_inout* inout);
+    void gpu_rho_interface(Gint_inout* inout);
 
-    void gamma_gpu_force_interface(Gint_inout* inout);
+    void gpu_force_interface(Gint_inout* inout);
 
     // in cal_gint_cpu.cpp
 
 
@@ -20,7 +20,7 @@ namespace GintKernel
  * 6. force dot on the GPU.
  * 7. Copy the results back to the host.
  */
-void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
+void gint_fvl_gpu(const hamilt::HContainer<double>* dm,
                         const double* vlocal,
                         double* force_in,
                         double* stress_in,
@@ -82,36 +82,12 @@ void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
     Cuda_Mem_Wrapper<double> force(3 * nat, num_streams, true);
     Cuda_Mem_Wrapper<double> stress(6, num_streams, true);
 
-    Cuda_Mem_Wrapper<double> dm_matrix(lgd * lgd, 1, true);
-    for (int iat1 = 0; iat1 < ucell.nat; iat1++)
-    {
-        for (int iat2 = 0; iat2 < ucell.nat; iat2++)
-        {
-            const int it1 = ucell.iat2it[iat1];
-            const int it2 = ucell.iat2it[iat2];
-            const int lo1
-                = gridt.trace_lo[ucell.itiaiw2iwt(it1, ucell.iat2ia[iat1], 0)];
-            const int lo2
-                = gridt.trace_lo[ucell.itiaiw2iwt(it2, ucell.iat2ia[iat2], 0)];
-
-            hamilt::AtomPair<double>* tmp_ap = dm->find_pair(iat1, iat2);
-            int orb_index = 0;
-            if (tmp_ap == NULL)
-            {
-                continue;
-            }
-            for (int orb_i = 0; orb_i < tmp_ap->get_row_size(); orb_i++)
-            {
-                for (int orb_j = 0; orb_j < tmp_ap->get_col_size(); orb_j++)
-                {
-                    dm_matrix.get_host_pointer()[(lo1 + orb_i) * lgd + (lo2 + orb_j)]
-                        = tmp_ap->get_pointer(0)[orb_index];
-                    orb_index++;
-                }
-            }
-        }
-    }
-    dm_matrix.copy_host_to_device_sync();
+    Cuda_Mem_Wrapper<double> dm_matrix(dm->get_nnr(), 1, false);
+    // retrieve the density matrix on the host
+    checkCuda(cudaMemcpy(dm_matrix.get_device_pointer(),
+                         dm->get_wrapper(),
+                         dm->get_nnr() * sizeof(double),
+                         cudaMemcpyHostToDevice));
 
     #pragma omp parallel for num_threads(num_streams) collapse(2)
     for (int i = 0; i < gridt.nbx; i++)
@@ -142,7 +118,8 @@ void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
                         dr_part.get_host_pointer(sid),
                         vldr3.get_host_pointer(sid));
 
-            alloc_mult_force(gridt,
+            alloc_mult_force(dm,
+                             gridt,
                              ucell, 
                              grid_index_ij,
                              max_atom,
 
@@ -5,7 +5,7 @@
 #include "module_hamilt_lcao/module_gint/grid_technique.h"
 namespace GintKernel
 {
-void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
+void gint_fvl_gpu(const hamilt::HContainer<double>* dm,
                         const double* vlocal,
                         double* force_in,
                         double* stress_in,
@@ -29,7 +29,8 @@ void gtask_force(const Grid_Technique& gridt,
                  double* dr_part,
                  double* vldr3);
 
-void alloc_mult_force(const Grid_Technique& gridt,
+void alloc_mult_force(const hamilt::HContainer<double>* dm,
+                      const Grid_Technique& gridt,
                       const UnitCell& ucell,
                       const int grid_index_ij,
                       const int max_atom,
 
@@ -5,7 +5,7 @@
 #include "module_base/memory.h"
 #include "module_base/timer.h"
 
-void Gint::gamma_gpu_vlocal_interface(Gint_inout* inout) {
+void Gint::gpu_vlocal_interface(Gint_inout* inout) {
     ModuleBase::TITLE("Gint_interface", "cal_gint_vlocal");
     ModuleBase::timer::tick("Gint_interface", "cal_gint_vlocal");
 
@@ -17,19 +17,22 @@ void Gint::gamma_gpu_vlocal_interface(Gint_inout* inout) {
         ylmcoef[i] = ModuleBase::Ylm::ylmcoef[i];
     }
 
-    GintKernel::gint_gamma_vl_gpu(this->hRGint,
-                                  inout->vl,
-                                  ylmcoef,
-                                  dr,
-                                  this->gridt->rcuts.data(),
-                                  *this->gridt,
-                                  ucell);
+    double* pvpR = GlobalV::GAMMA_ONLY_LOCAL ? nullptr : this->pvpR_reduced[inout->ispin];
+    GintKernel::gint_vl_gpu(this->hRGint,
+                            inout->vl,
+                            ylmcoef,
+                            dr,
+                            this->gridt->rcuts.data(),
+                            *this->gridt,
+                            ucell,
+                            pvpR,
+                            GlobalV::GAMMA_ONLY_LOCAL);
 
     ModuleBase::TITLE("Gint_interface", "cal_gint_vlocal");
     ModuleBase::timer::tick("Gint_interface", "cal_gint_vlocal");
 }
 
-void Gint::gamma_gpu_rho_interface(Gint_inout* inout) {
+void Gint::gpu_rho_interface(Gint_inout* inout) {
     ModuleBase::TITLE("Gint_interface", "cal_gint_rho");
     ModuleBase::timer::tick("Gint_interface", "cal_gint_rho");
 
@@ -43,7 +46,7 @@ void Gint::gamma_gpu_rho_interface(Gint_inout* inout) {
     int nrxx = this->gridt->ncx * this->gridt->ncy * this->nplane;
     for (int is = 0; is < GlobalV::NSPIN; ++is) {
         ModuleBase::GlobalFunc::ZEROS(inout->rho[is], nrxx);
-        GintKernel::gint_gamma_rho_gpu(this->DMRGint[is],
+        GintKernel::gint_rho_gpu(this->DMRGint[is],
                                        ylmcoef,
                                        dr,
                                        this->gridt->rcuts.data(),
@@ -55,7 +58,7 @@ void Gint::gamma_gpu_rho_interface(Gint_inout* inout) {
     ModuleBase::timer::tick("Gint_interface", "cal_gint_rho");
 }
 
-void Gint::gamma_gpu_force_interface(Gint_inout* inout) {
+void Gint::gpu_force_interface(Gint_inout* inout) {
     ModuleBase::TITLE("Gint_interface", "cal_gint_force");
     ModuleBase::timer::tick("Gint_interface", "cal_gint_force");
 
@@ -74,7 +77,7 @@ void Gint::gamma_gpu_force_interface(Gint_inout* inout) {
     if (isforce || isstress) {
         std::vector<double> force(nat * 3, 0.0);
         std::vector<double> stress(6, 0.0);
-        GintKernel::gint_fvl_gamma_gpu(this->DMRGint[inout->ispin],
+        GintKernel::gint_fvl_gpu(this->DMRGint[inout->ispin],
                                        inout->vl,
                                        force.data(),
                                        stress.data(),
 
@@ -9,7 +9,7 @@
 namespace GintKernel
 {
 
-void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
+void gint_rho_gpu(const hamilt::HContainer<double>* dm,
                         const double* ylmcoef_now,
                         const double dr,
                         const double* rcut,
@@ -60,35 +60,12 @@ void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
     Cuda_Mem_Wrapper<double> rho_g(num_mcell_on_proc, 1, false);
     Cuda_Mem_Wrapper<double*> dot_product(nbzp * gridt.bxyz, num_streams, true);
 
-    Cuda_Mem_Wrapper<double> dm_matrix(lgd * lgd, 1, true);
+    Cuda_Mem_Wrapper<double> dm_matrix(dm->get_nnr(), 1, false);
     // retrieve the density matrix on the host
-    for (int iat1 = 0; iat1 < ucell.nat; iat1++)
-    {
-        for (int iat2 = 0; iat2 < ucell.nat; iat2++)
-        {
-            const int it1 = ucell.iat2it[iat1];
-            const int it2 = ucell.iat2it[iat2];
-            const int lo1 = gridt.trace_lo[ucell.itiaiw2iwt(it1, ucell.iat2ia[iat1], 0)];
-            const int lo2 = gridt.trace_lo[ucell.itiaiw2iwt(it2, ucell.iat2ia[iat2], 0)];
-
-            hamilt::AtomPair<double>* tmp_ap = dm->find_pair(iat1, iat2);
-            int orb_index = 0;
-            if (tmp_ap == NULL)
-            {
-                continue;
-            }
-            for (int orb_i = 0; orb_i < tmp_ap->get_row_size(); orb_i++)
-            {
-                for (int orb_j = 0; orb_j < tmp_ap->get_col_size(); orb_j++)
-                {
-                    dm_matrix.get_host_pointer()[(lo1 + orb_i) * lgd + (lo2 + orb_j)]
-                        = tmp_ap->get_pointer(0)[orb_index];
-                    orb_index++;
-                }
-            }
-        }
-    }
-    dm_matrix.copy_host_to_device_sync();
+    checkCuda(cudaMemcpy(dm_matrix.get_device_pointer(),
+                         dm->get_wrapper(),
+                         dm->get_nnr() * sizeof(double),
+                         cudaMemcpyHostToDevice));
 
 // calculate the rho for every nbzp bigcells
 #pragma omp parallel for num_threads(num_streams) collapse(2)
@@ -120,6 +97,7 @@ void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
                       atoms_per_z);
 
             alloc_mult_dot_rho(
+                dm,
                 gridt,
                 ucell,
                 grid_index_ij,
 
@@ -21,7 +21,7 @@ namespace GintKernel
  * @param ucell UnitCell.
  * @param rho rho.
  */
-void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
+void gint_rho_gpu(const hamilt::HContainer<double>* dm,
                         const double* ylmcoef_now,
                         const double dr,
                         const double* rcut,
@@ -37,37 +37,8 @@ void gtask_rho(const Grid_Technique& gridt,
                int* atoms_num_info,
                int& atoms_per_z);
 
-/**
- * Allocate resources and perform matrix multiplication and vector dot products 
- * for computing the rho.
- *
- * @param gridt Grid_Technique object containing grid information.
- * @param ucell UnitCell object containing unit cell information.
- * @param gpu_mat_cal_flag Flags indicating which parts of the calculation will use the GPU.
- * @param grid_index_ij Combined X and Y indices of the bigcell.
- * @param max_size Maximum number of atoms in a meshcell.
- * @param lgd lgd.
- * @param nczp Number of meshcells along the z-axis on this processor.
- * @param psir_ylm_g One-dimensional array storing psir.
- * @param psir_dm_g One-dimensional array storing psir_dm.
- * @param dm_matrix_g One-dimensional array storing mat_dm.
- * @param mat_alpha Alpha values for matrix multiplication.
- * @param mat_m Number of rows in mat_dm.
- * @param mat_n Number of columns in mat_psir.
- * @param mat_k Number of columns in mat_dm, which equals the number of rows in mat_psir.
- * @param mat_lda Leading dimension of mat_dm.
- * @param mat_ldb Leading dimension of mat_psir.
- * @param mat_ldc Leading dimension of mat_psir_dm.
- * @param mat_A Pointers to mat_dm.
- * @param mat_B Pointers to mat_psir.
- * @param mat_C Pointers to mat_psir_dm.
- * @param max_m Maximum value of m.
- * @param max_n Maximum value of n.
- * @param atom_pair_num Total count of atom pairs, which is also the number of matrix multiplication operations.
- * @param rho_g Rho.
- * @param dot_product Pointers to the results of dot products.
- */
-void alloc_mult_dot_rho(const Grid_Technique& gridt,
+void alloc_mult_dot_rho(const hamilt::HContainer<double>* dm,
+                        const Grid_Technique& gridt,
                         const UnitCell& ucell,
                         const int grid_index_ij,
                         const int max_atom,
 
@@ -256,11 +256,4 @@ void cal_dpsirr_ylm(
 		return psir_vlbr3;
 	}
 
-
-// calculating (psi_DMR)_mu = sum_nu DMR_mu,nu psi_nu
-// note : there is a difference between rho and force
-// in calculating rho, due to symmetry, the summation over mu,nu
-// can be done as sum_mu,mu + 2 sum_mu<nu, saving some time
-// but for force, we cannot exchange the index
-
 } // namespace Gint_Tools
Original file line number	Diff line number	Diff line change
`@@ -149,8 +149,7 @@ int set_device_by_rank(const MPI_Comm mpi_comm) {`
`149`	`149`
`150`	`150`	`std::string get_device_flag(const std::string &device,`
`151`	`151`	`const std::string &ks_solver,`
`152`		`- const std::string &basis_type,`
`153`		`- const bool &gamma_only) {`
	`152`	`+ const std::string &basis_type) {`
`154`	`153`	`if (device == "cpu") {`
`155`	`154`	`return "cpu"; // no extra checks required`
`156`	`155`	`}`
`@@ -181,10 +180,6 @@ if (basis_type == "lcao_in_pw") {`
`181`	`180`	`error_message +=`
`182`	`181`	`"The GPU currently does not support the basis type \"lcao_in_pw\"!";`
`183`	`182`	`}`
`184`		`-if (basis_type == "lcao" && gamma_only == false) {`
`185`		`- error_message += "The GPU currently does not support the basis type "`
`186`		`- "\"lcao\" with \"gamma_only\" set to \"0\"!";`
`187`		`-}`
`188`	`183`	`if(error_message.empty())`
`189`	`184`	`{`
`190`	`185`	`return "gpu"; // possibly automatically set to GPU`