Memory usage of mbCG #1453

arthus701 · 2021-02-02T14:46:39Z

arthus701
Feb 2, 2021

Hi everyone, I hope this is the adequate place to ask this question. If not I can also resort to opening an issue.

I have a question regarding memory usage of mbCG, or maybe even gpytorch in general. In the appendix of the BBMM paper, observation 1 states that CG has a space complexity of O(n). From my understanding, this is not the case for mbCG, as the matrix-vector-multiplication is replaced by a matrix-matrix-multiplication. Is this true? If not, how can I make use of the memory efficiency?

From some rudimentary tests I found that during a single calculation of say the marginal likelihood, the kernel's forward method is called exactly once, which from my understanding hints at the whole covariance matrix being computed, and thus a space complexity of O(n^2). Or is the column-wise access utilized in CG/mvm controlled by the lazy tensors?

During these tests I also found the memory_efficient setting, but this appears to be something different...

I think you have created a masterful package, but trying to understand which parts of the paper are utilized at which point is often really hard, due to the very nested structure of the implementations.

I'm looking forward to your answer!

Cheers and stay safe,
Arthus

Answered by jacobrgardner

Feb 2, 2021

It's O(n) because the output of CG (a single vector or small set of vectors) is that space requirement unlike Cholesky, and you don't really need to store the entire kernel matrix in memory at the same time to do so -- you can compute MVMs in a map reduce fashion.

The original GPyTorch paper exclusively dealt with the setting where we are storing the entire kernel matrix in memory to compute MVMs, but this paper directly centers around extending this to O(n) storage, and we have a few example notebooks that do this.

View full answer

jacobrgardner · 2021-02-02T14:53:28Z

jacobrgardner
Feb 2, 2021
Maintainer

It's O(n) because the output of CG (a single vector or small set of vectors) is that space requirement unlike Cholesky, and you don't really need to store the entire kernel matrix in memory at the same time to do so -- you can compute MVMs in a map reduce fashion.

The original GPyTorch paper exclusively dealt with the setting where we are storing the entire kernel matrix in memory to compute MVMs, but this paper directly centers around extending this to O(n) storage, and we have a few example notebooks that do this.

2 replies

arthus701 Feb 2, 2021
Author

Thank you for the incredibly fast reply. I guess this is one of the notebooks you had in mind? If so, is this option hidden behind the beta_features.checkpoint_kernel stuff?

jacobrgardner Feb 2, 2021
Maintainer

Yeah, or this notebook, which uses KeOps as a library to accomplish the O(n) matmuls, which has custom CUDA code so is a fair bit faster. checkpoint_kernel is our 100% in PyTorch implementation of this though, you're right.

arthus701 · 2021-02-26T11:31:01Z

arthus701
Feb 26, 2021
Author

Hi everyone,
I finally got around to test the O(n) settings, using the beta_features.checkpoint_kernel context. However I found that the kernel is still fully evaluated once. Maybe I misunderstood the point, but I figured I could use the implementation to run a GP even if my GPU's memory is not sufficient to store the whole covariance matrix.

I also had a look at the KeOps approach, but it appears I would have to rewrite my custom kernel to be able to use it. Would this help with the issue of a lack of GPU-memory?

Cheers,
Arthus

2 replies

gpleiss Feb 26, 2021
Maintainer

KeOps would definitely be the preferred choice, if it does not require too much effort to rewrite the kernel. It is much more efficient than beta_features.checkpoint_kernel.

However, beta_features.checkpoint_kernel should work, and it should not be instantiating the full kernel. This sounds like a bug - could you please post code to reproduce?

arthus701 Feb 26, 2021
Author

I'm sorry, I found a mistake on my side. I was checking what inputs are passed to the kernels forward method and assumed the full matrix is calculated, although actually only the diagonal is used. The reason my approach failed is that my kernel does not handle the diag kwarg and thus was trying to allocate the full matrix. I'll look into implementing the diag evaluation, which will hopefully settle the issue.

After looking at KeOps a bit I agree that it looks like the preferred choice, though.

Thank you for your fast reply and overall great support!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage of mbCG #1453

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Memory usage of mbCG #1453

arthus701 Feb 2, 2021

Replies: 2 comments · 4 replies

jacobrgardner Feb 2, 2021 Maintainer

arthus701 Feb 2, 2021 Author

jacobrgardner Feb 2, 2021 Maintainer

arthus701 Feb 26, 2021 Author

gpleiss Feb 26, 2021 Maintainer

arthus701 Feb 26, 2021 Author

arthus701
Feb 2, 2021

Replies: 2 comments 4 replies

jacobrgardner
Feb 2, 2021
Maintainer

arthus701 Feb 2, 2021
Author

jacobrgardner Feb 2, 2021
Maintainer

arthus701
Feb 26, 2021
Author

gpleiss Feb 26, 2021
Maintainer

arthus701 Feb 26, 2021
Author