Skip to content

Commit f73fa7d

Browse files
authored
Deprecate checkpointing (#2361)
* Deprecate checkpointing * Remove checkpointing from test
1 parent 1be177e commit f73fa7d

File tree

5 files changed

+23
-115
lines changed

5 files changed

+23
-115
lines changed

examples/02_Scalable_Exact_GPs/Exact_GP_Posterior_Sampling_with_CIQ.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"\n",
99
"This notebook demonstrates the most simple usage of contour integral quadrature with msMINRES as described [here](https://arxiv.org/pdf/2006.11267.pdf) to sample from the predictive distribution of an exact GP.\n",
1010
"\n",
11-
"Note that to achieve results where Cholesky would run the GPU out of memory, you'll either need to have KeOps installed (see our KeOps tutorial in this same folder), or use the `checkpoint_kernel` beta feature. Despite this, on this relatively simple example with 1000 training points but seeing to sample at 20000 test points in 1D, we will achieve significant speed ups over Cholesky."
11+
"Note that to achieve results where Cholesky would run the GPU out of memory, you'll need to have KeOps installed (see our KeOps tutorial in this same folder). Despite this, on this relatively simple example with 1000 training points but seeing to sample at 20000 test points in 1D, we will achieve significant speed ups over Cholesky."
1212
]
1313
},
1414
{

examples/02_Scalable_Exact_GPs/Simple_MultiGPU_GP_Regression.ipynb

+9-85
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,13 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Exact GP Regression with Multiple GPUs and Kernel Partitioning\n",
7+
"# Exact GP Regression with Multiple GPUs\n",
88
"## Introduction\n",
9-
"In this notebook, we'll demonstrate training exact GPs on large datasets using two key features from the paper https://arxiv.org/abs/1903.08114: \n",
9+
"In this notebook, we'll demonstrate training exact GPs on large datasets by distributing the kernel matrix across multiple GPUs, for additional parallelism.\n",
1010
"\n",
11-
"1. The ability to distribute the kernel matrix across multiple GPUs, for additional parallelism.\n",
12-
"2. Partitioning the kernel into chunks computed on-the-fly when performing each MVM to reduce memory usage.\n",
11+
"**NOTE**: Kernel partitioning (another memory-saving mechanism introduced in https://arxiv.org/abs/1903.08114) is no longer supported for multiple GPUs. If your kernel matrix is too big to fit on your available GPUs, please use the [GPyTorch KeOps integration](./KeOps_GP_Regression.ipynb) for kernel partitioning.\n",
1312
"\n",
14-
"We'll be using the `protein` dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: both the number of GPUs available and the amount of memory they have (which determines the partition size) have a significant effect on training time."
13+
"We'll be using the `protein` dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: the number of GPUs available has a significant effect on training time."
1514
]
1615
},
1716
{
@@ -164,7 +163,6 @@
164163
" train_y,\n",
165164
" n_devices,\n",
166165
" output_device,\n",
167-
" checkpoint_size,\n",
168166
" preconditioner_size,\n",
169167
" n_training_iter,\n",
170168
"):\n",
@@ -178,8 +176,7 @@
178176
" mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)\n",
179177
"\n",
180178
" \n",
181-
" with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \\\n",
182-
" gpytorch.settings.max_preconditioner_size(preconditioner_size):\n",
179+
" with gpytorch.settings.max_preconditioner_size(preconditioner_size):\n",
183180
"\n",
184181
" def closure():\n",
185182
" optimizer.zero_grad()\n",
@@ -208,78 +205,6 @@
208205
" return model, likelihood"
209206
]
210207
},
211-
{
212-
"cell_type": "markdown",
213-
"metadata": {},
214-
"source": [
215-
"## Automatically determining GPU Settings\n",
216-
"\n",
217-
"In the next cell, we automatically determine a roughly reasonable partition or *checkpoint* size that will allow us to train without using more memory than the GPUs available have. Not that this is a coarse estimate of the largest possible checkpoint size, and may be off by as much as a factor of 2. A smarter search here could make up to a 2x performance improvement."
218-
]
219-
},
220-
{
221-
"cell_type": "code",
222-
"execution_count": 6,
223-
"metadata": {},
224-
"outputs": [
225-
{
226-
"name": "stdout",
227-
"output_type": "stream",
228-
"text": [
229-
"Number of devices: 2 -- Kernel partition size: 0\n",
230-
"RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 1; 10.73 GiB total capacity; 7.48 GiB already allocated; 2.46 GiB free; 21.49 MiB cached)\n",
231-
"Number of devices: 2 -- Kernel partition size: 18292\n",
232-
"RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 10.73 GiB total capacity; 6.37 GiB already allocated; 448.94 MiB free; 1.30 GiB cached)\n",
233-
"Number of devices: 2 -- Kernel partition size: 9146\n",
234-
"Iter 1/1 - Loss: 0.893 lengthscale: 0.486 noise: 0.248\n",
235-
"Finished training on 36584 data points using 2 GPUs.\n"
236-
]
237-
}
238-
],
239-
"source": [
240-
"import gc\n",
241-
"\n",
242-
"def find_best_gpu_setting(train_x,\n",
243-
" train_y,\n",
244-
" n_devices,\n",
245-
" output_device,\n",
246-
" preconditioner_size\n",
247-
"):\n",
248-
" N = train_x.size(0)\n",
249-
" \n",
250-
" # Find the optimum partition/checkpoint size by decreasing in powers of 2\n",
251-
" # Start with no partitioning (size = 0)\n",
252-
" settings = [0] + [int(n) for n in np.ceil(N / 2**np.arange(1, np.floor(np.log2(N))))]\n",
253-
"\n",
254-
" for checkpoint_size in settings:\n",
255-
" print('Number of devices: {} -- Kernel partition size: {}'.format(n_devices, checkpoint_size))\n",
256-
" try:\n",
257-
" # Try a full forward and backward pass with this setting to check memory usage\n",
258-
" _, _ = train(train_x, train_y,\n",
259-
" n_devices=n_devices, output_device=output_device,\n",
260-
" checkpoint_size=checkpoint_size,\n",
261-
" preconditioner_size=preconditioner_size, n_training_iter=1)\n",
262-
" \n",
263-
" # when successful, break out of for-loop and jump to finally block\n",
264-
" break\n",
265-
" except RuntimeError as e:\n",
266-
" print('RuntimeError: {}'.format(e))\n",
267-
" except AttributeError as e:\n",
268-
" print('AttributeError: {}'.format(e))\n",
269-
" finally:\n",
270-
" # handle CUDA OOM error\n",
271-
" gc.collect()\n",
272-
" torch.cuda.empty_cache()\n",
273-
" return checkpoint_size\n",
274-
"\n",
275-
"# Set a large enough preconditioner size to reduce the number of CG iterations run\n",
276-
"preconditioner_size = 100\n",
277-
"checkpoint_size = find_best_gpu_setting(train_x, train_y,\n",
278-
" n_devices=n_devices, \n",
279-
" output_device=output_device,\n",
280-
" preconditioner_size=preconditioner_size)"
281-
]
282-
},
283208
{
284209
"cell_type": "markdown",
285210
"metadata": {},
@@ -309,7 +234,6 @@
309234
"source": [
310235
"model, likelihood = train(train_x, train_y,\n",
311236
" n_devices=n_devices, output_device=output_device,\n",
312-
" checkpoint_size=10000,\n",
313237
" preconditioner_size=100,\n",
314238
" n_training_iter=20)"
315239
]
@@ -331,7 +255,7 @@
331255
"model.eval()\n",
332256
"likelihood.eval()\n",
333257
"\n",
334-
"with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):\n",
258+
"with torch.no_grad(), gpytorch.settings.fast_pred_var():\n",
335259
" # Make predictions on a small number of test points to get the test time caches computed\n",
336260
" latent_pred = model(test_x[:2, :])\n",
337261
" del latent_pred # We don't care about these predictions, we really just want the caches."
@@ -360,7 +284,7 @@
360284
}
361285
],
362286
"source": [
363-
"with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):\n",
287+
"with torch.no_grad(), gpytorch.settings.fast_pred_var():\n",
364288
" %time latent_pred = model(test_x)\n",
365289
" \n",
366290
"test_rmse = torch.sqrt(torch.mean(torch.pow(latent_pred.mean - test_y, 2)))\n",
@@ -385,7 +309,7 @@
385309
"metadata": {
386310
"anaconda-cloud": {},
387311
"kernelspec": {
388-
"display_name": "Python 3",
312+
"display_name": "Python 3 (ipykernel)",
389313
"language": "python",
390314
"name": "python3"
391315
},
@@ -399,7 +323,7 @@
399323
"name": "python",
400324
"nbconvert_exporter": "python",
401325
"pygments_lexer": "ipython3",
402-
"version": "3.7.1"
326+
"version": "3.8.0"
403327
}
404328
},
405329
"nbformat": 4,

gpytorch/beta_features.py

+10
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,16 @@ class checkpoint_kernel(_value_context):
3939

4040
_global_value = 0
4141

42+
def __enter__(self, *args, **kwargs):
43+
warnings.warn(
44+
"The checkpointing feature is deprecated and will be removed in the next version. "
45+
"If your data cannot fit on a single GPU, we recommend using the GPyTorch KeOps integration. "
46+
"(The KeOps integration accomplishes the same thing that our checkpointing feature did, but better!) "
47+
"See the KeOps example in the GPyTorch documentation at docs.gpytorch.ai",
48+
DeprecationWarning,
49+
)
50+
return super().__enter__(*args, **kwargs)
51+
4252

4353
class default_preconditioner(_feature_flag):
4454
"""

test/examples/test_simple_gp_regression.py

+3-6
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,7 @@ def test_gp_posterior_single_training_point_smoke_test(self):
216216

217217
def test_posterior_latent_gp_and_likelihood_with_optimization(self, cuda=False, checkpoint=0):
218218
train_x, test_x, train_y, test_y = self._get_data(
219-
cuda=cuda, num_data=(1000 if checkpoint else 11), add_noise=bool(checkpoint)
219+
cuda=cuda, num_data=(11), add_noise=bool(checkpoint)
220220
)
221221
# We're manually going to set the hyperparameters to something they shouldn't be
222222
likelihood = GaussianLikelihood(noise_prior=SmoothedBoxPrior(exp(-3), exp(3), sigma=0.1))
@@ -234,8 +234,8 @@ def test_posterior_latent_gp_and_likelihood_with_optimization(self, cuda=False,
234234
gp_model.train()
235235
likelihood.train()
236236
optimizer = optim.Adam(gp_model.parameters(), lr=0.15)
237-
with gpytorch.beta_features.checkpoint_kernel(checkpoint), gpytorch.settings.fast_pred_var():
238-
for _ in range(20 if checkpoint else 50):
237+
with gpytorch.settings.fast_pred_var():
238+
for _ in range(50):
239239
optimizer.zero_grad()
240240
output = gp_model(train_x)
241241
loss = -mll(output, train_y)
@@ -256,9 +256,6 @@ def test_posterior_latent_gp_and_likelihood_with_optimization(self, cuda=False,
256256

257257
self.assertLess(mean_abs_error.item(), 0.05)
258258

259-
def test_gp_with_checkpointing(self, cuda=False):
260-
return self.test_posterior_latent_gp_and_likelihood_with_optimization(cuda=cuda, checkpoint=250)
261-
262259
def test_fantasy_updates_cuda(self):
263260
if torch.cuda.is_available():
264261
with least_used_cuda_device():

test/lazy/test_lazy_evaluated_kernel_tensor.py

-23
Original file line numberDiff line numberDiff line change
@@ -112,29 +112,6 @@ def _test_inv_matmul(self, rhs, lhs=None, cholesky=False):
112112
else:
113113
self.assertFalse(linear_cg_mock.called)
114114

115-
def test_inv_matmul_matrix_with_checkpointing(self):
116-
# Add one checkpointing test
117-
lazy_tensor = self.create_linear_op().requires_grad_(True)
118-
lazy_tensor_copy = lazy_tensor.clone().detach_().requires_grad_(True)
119-
evaluated = self.evaluate_linear_op(lazy_tensor_copy)
120-
121-
test_vector = torch.randn(2, 5, 6)
122-
test_vector_copy = test_vector.clone()
123-
with gpytorch.beta_features.checkpoint_kernel(2):
124-
res = lazy_tensor.solve(test_vector)
125-
actual = evaluated.inverse().matmul(test_vector_copy)
126-
self.assertLess(((res - actual).abs() / actual.abs().clamp(1, 1e5)).max().item(), 3e-1)
127-
128-
grad = torch.randn_like(res)
129-
res.backward(gradient=grad)
130-
actual.backward(gradient=grad)
131-
132-
for param, param_copy in zip(lazy_tensor.kernel.parameters(), lazy_tensor_copy.kernel.parameters()):
133-
self.assertAllClose(param.grad, param_copy.grad, rtol=1e-3)
134-
self.assertAllClose(
135-
lazy_tensor.x1.grad + lazy_tensor.x2.grad, lazy_tensor_copy.x1.grad + lazy_tensor_copy.x2.grad, rtol=1e-3
136-
)
137-
138115
def test_batch_getitem(self):
139116
"""Indexing was wrong when the kernel had more batch dimensions than the
140117
data"""

0 commit comments

Comments
 (0)