Description
Hello,
I recently started using PEPSKit. Kudos to all of you for all this work. I run finite temperature simulation and benchmark with my own python code (https://github.com/ogauthe/frostspin). I was happy to observe the results are nearly the same between frostspin and PEPSKit. However the performances were not as good as I was hoping for. I really think they can be improved, as I expect TensorKit to be much more efficient for the permutedims
part.
Here I propose to give some benchmark points and find if some bottleneck can be improved in PEPSKit.
Setup: I use the simple update code to run a finite temperature simulation of the Heisenberg model on the square lattice. I add an ancilla leg to my PEPS tensor. As #184 is still a work in progress, I merge ancilla and physical leg together and define a Hamiltonian acting on both spaces to reuse PEPSKit code. I start from an infinite temperature product state, use simple update for imaginary time evolution. Once a given beta is reached, I use CTMRG to contract the tensor network and compute observables. I impose SU(2) symmetry. I can share the script if you want to reproduce.
Now there are of course several differences between the two codes. I think the most important ones are
- in frostspin I use an AB//BA unit cell, with only 2 sites.
- in the CTMRG, I cache enlarged corners to reuse them for the next iteration (PEPSKit simultaneous CTMRG looks a bit similar)
- I use a Lanczos based SVD and compute relatively few singular vectors
- when computing observables, I trace over the ancilla leg while contracting bra an ket tensors
- when computing observables, I compute reduce density matrices and normalize their trace (no norm computation)
- I avoid Zygote constrains on allocating memory (but I do not aggressively try to reuse it)
- due to different convergence criterion, PEPSKit does more iterations
Below are the logs for D=11, chi=121 to give precise values.
logs_profile_julia_1stnei_genx_CTM_SU2_D11_chi122_11b43a279c_c1.log
logs_profile_1stnei_genx_CTM_SU2_D11_chi121_rdm2nd_c7cb4eea37_c1.log
For D=11, chi=121, once compiling and unitary constructions are done
- 1 CTMRG step: frostpin takes ~9 sec for 4 sweeps (N-E-S-W) x 2 sites in the unit cell. PEPSKit takes ~42 sec, with a 4 site unit cell
- first neighbor bond energy: less than 1 sec on frostspin, 12 sec on PEPSKit
- second neighbor bond energy: 2 sec on frostspin, 25 sec on PEPSKit
The CTMRG iteration part looks reasonable, but finer profiling is needed. In frostspin, the bottleneck is always permute
. Contractions are non negligible yet always cheaper. SVD are usually negligible. As I expect TensorKit permute
to me much faster, there may be hidden bottleneck. On the other hand, observable computation is way slower than it has to be, although part of it is probably due to merged physical-ancilla legs.
A few comments:
- frostspin uses nearly 3 times less memory than PEPSKit (memory maximum is reached in 2nd neighbor density matrix computation).
- I get the same results with U(1) symmetry but frostspin abelian code has poor performances, benchmark are less relevant
the same simple update code with Trivial sector gives incorrect results on PEPSKit. Still investigating.- I plan to also benchmark the J1-J2 using Add 3-site simple update (aka 3-site cluster update) #171, but I have not been able to reproduce my previous results yet (this may be a PEPSKit bug, more probably just a script issue)
- I would be happy to provide more details on the algorithm or the benchmark
- My next step is to run a profiler on PEPSKit to have a better idea of time use.