Deprecate checkpointing (#2361)

gpleiss · web-flow · commit f73fa7dfb33b · 2023-06-02T18:03:20.000-04:00
* Deprecate checkpointing

* Remove checkpointing from test
diff --git a/examples/02_Scalable_Exact_GPs/Exact_GP_Posterior_Sampling_with_CIQ.ipynb b/examples/02_Scalable_Exact_GPs/Exact_GP_Posterior_Sampling_with_CIQ.ipynb
@@ -8,7 +8,7 @@
     "\n",
     "This notebook demonstrates the most simple usage of contour integral quadrature with msMINRES as described [here](https://arxiv.org/pdf/2006.11267.pdf) to sample from the predictive distribution of an exact GP.\n",
     "\n",
-    "Note that to achieve results where Cholesky would run the GPU out of memory, you'll either need to have KeOps installed (see our KeOps tutorial in this same folder), or use the `checkpoint_kernel` beta feature. Despite this, on this relatively simple example with 1000 training points but seeing to sample at 20000 test points in 1D, we will achieve significant speed ups over Cholesky."
+    "Note that to achieve results where Cholesky would run the GPU out of memory, you'll need to have KeOps installed (see our KeOps tutorial in this same folder). Despite this, on this relatively simple example with 1000 training points but seeing to sample at 20000 test points in 1D, we will achieve significant speed ups over Cholesky."
    ]
   },
   {
diff --git a/examples/02_Scalable_Exact_GPs/Simple_MultiGPU_GP_Regression.ipynb b/examples/02_Scalable_Exact_GPs/Simple_MultiGPU_GP_Regression.ipynb
@@ -4,14 +4,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Exact GP Regression with Multiple GPUs and Kernel Partitioning\n",
+    "# Exact GP Regression with Multiple GPUs\n",
     "## Introduction\n",
-    "In this notebook, we'll demonstrate training exact GPs on large datasets using two key features from the paper https://arxiv.org/abs/1903.08114: \n",
+    "In this notebook, we'll demonstrate training exact GPs on large datasets by distributing the kernel matrix across multiple GPUs, for additional parallelism.\n",
     "\n",
-    "1. The ability to distribute the kernel matrix across multiple GPUs, for additional parallelism.\n",
-    "2. Partitioning the kernel into chunks computed on-the-fly when performing each MVM to reduce memory usage.\n",
+    "**NOTE**: Kernel partitioning (another memory-saving mechanism introduced in https://arxiv.org/abs/1903.08114) is no longer supported for multiple GPUs. If your kernel matrix is too big to fit on your available GPUs, please use the [GPyTorch KeOps integration](./KeOps_GP_Regression.ipynb) for kernel partitioning.\n",
     "\n",
-    "We'll be using the `protein` dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: both the number of GPUs available and the amount of memory they have (which determines the partition size) have a significant effect on training time."
+    "We'll be using the `protein` dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: the number of GPUs available has a significant effect on training time."
    ]
   },
   {
@@ -164,7 +163,6 @@
     "          train_y,\n",
     "          n_devices,\n",
     "          output_device,\n",
-    "          checkpoint_size,\n",
     "          preconditioner_size,\n",
     "          n_training_iter,\n",
     "):\n",
@@ -178,8 +176,7 @@
     "    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)\n",
     "\n",
     "    \n",
-    "    with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \\\n",
-    "         gpytorch.settings.max_preconditioner_size(preconditioner_size):\n",
+    "    with gpytorch.settings.max_preconditioner_size(preconditioner_size):\n",
     "\n",
     "        def closure():\n",
     "            optimizer.zero_grad()\n",
@@ -208,78 +205,6 @@
     "    return model, likelihood"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Automatically determining GPU Settings\n",
-    "\n",
-    "In the next cell, we automatically determine a roughly reasonable partition or *checkpoint* size that will allow us to train without using more memory than the GPUs available have. Not that this is a coarse estimate of the largest possible checkpoint size, and may be off by as much as a factor of 2. A smarter search here could make up to a 2x performance improvement."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Number of devices: 2 -- Kernel partition size: 0\n",
-      "RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 1; 10.73 GiB total capacity; 7.48 GiB already allocated; 2.46 GiB free; 21.49 MiB cached)\n",
-      "Number of devices: 2 -- Kernel partition size: 18292\n",
-      "RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 10.73 GiB total capacity; 6.37 GiB already allocated; 448.94 MiB free; 1.30 GiB cached)\n",
-      "Number of devices: 2 -- Kernel partition size: 9146\n",
-      "Iter 1/1 - Loss: 0.893   lengthscale: 0.486   noise: 0.248\n",
-      "Finished training on 36584 data points using 2 GPUs.\n"
-     ]
-    }
-   ],
-   "source": [
-    "import gc\n",
-    "\n",
-    "def find_best_gpu_setting(train_x,\n",
-    "                          train_y,\n",
-    "                          n_devices,\n",
-    "                          output_device,\n",
-    "                          preconditioner_size\n",
-    "):\n",
-    "    N = train_x.size(0)\n",
-    "    \n",
-    "    # Find the optimum partition/checkpoint size by decreasing in powers of 2\n",
-    "    # Start with no partitioning (size = 0)\n",
-    "    settings = [0] + [int(n) for n in np.ceil(N / 2**np.arange(1, np.floor(np.log2(N))))]\n",
-    "\n",
-    "    for checkpoint_size in settings:\n",
-    "        print('Number of devices: {} -- Kernel partition size: {}'.format(n_devices, checkpoint_size))\n",
-    "        try:\n",
-    "            # Try a full forward and backward pass with this setting to check memory usage\n",
-    "            _, _ = train(train_x, train_y,\n",
-    "                         n_devices=n_devices, output_device=output_device,\n",
-    "                         checkpoint_size=checkpoint_size,\n",
-    "                         preconditioner_size=preconditioner_size, n_training_iter=1)\n",
-    "            \n",
-    "            # when successful, break out of for-loop and jump to finally block\n",
-    "            break\n",
-    "        except RuntimeError as e:\n",
-    "            print('RuntimeError: {}'.format(e))\n",
-    "        except AttributeError as e:\n",
-    "            print('AttributeError: {}'.format(e))\n",
-    "        finally:\n",
-    "            # handle CUDA OOM error\n",
-    "            gc.collect()\n",
-    "            torch.cuda.empty_cache()\n",
-    "    return checkpoint_size\n",
-    "\n",
-    "# Set a large enough preconditioner size to reduce the number of CG iterations run\n",
-    "preconditioner_size = 100\n",
-    "checkpoint_size = find_best_gpu_setting(train_x, train_y,\n",
-    "                                        n_devices=n_devices, \n",
-    "                                        output_device=output_device,\n",
-    "                                        preconditioner_size=preconditioner_size)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -309,7 +234,6 @@
    "source": [
     "model, likelihood = train(train_x, train_y,\n",
     "                          n_devices=n_devices, output_device=output_device,\n",
-    "                          checkpoint_size=10000,\n",
     "                          preconditioner_size=100,\n",
     "                          n_training_iter=20)"
    ]
@@ -331,7 +255,7 @@
     "model.eval()\n",
     "likelihood.eval()\n",
     "\n",
-    "with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):\n",
+    "with torch.no_grad(), gpytorch.settings.fast_pred_var():\n",
     "    # Make predictions on a small number of test points to get the test time caches computed\n",
     "    latent_pred = model(test_x[:2, :])\n",
     "    del latent_pred  # We don't care about these predictions, we really just want the caches."
@@ -360,7 +284,7 @@
     }
    ],
    "source": [
-    "with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):\n",
+    "with torch.no_grad(), gpytorch.settings.fast_pred_var():\n",
     "    %time latent_pred = model(test_x)\n",
     "    \n",
     "test_rmse = torch.sqrt(torch.mean(torch.pow(latent_pred.mean - test_y, 2)))\n",
@@ -385,7 +309,7 @@
  "metadata": {
   "anaconda-cloud": {},
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -399,7 +323,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.8.0"
   }
  },
  "nbformat": 4,
diff --git a/gpytorch/beta_features.py b/gpytorch/beta_features.py
@@ -39,6 +39,16 @@ class checkpoint_kernel(_value_context):
 
     _global_value = 0
 
+    def __enter__(self, *args, **kwargs):
+        warnings.warn(
+            "The checkpointing feature is deprecated and will be removed in the next version. "
+            "If your data cannot fit on a single GPU, we recommend using the GPyTorch KeOps integration. "
+            "(The KeOps integration accomplishes the same thing that our checkpointing feature did, but better!) "
+            "See the KeOps example in the GPyTorch documentation at docs.gpytorch.ai",
+            DeprecationWarning,
+        )
+        return super().__enter__(*args, **kwargs)
+
 
 class default_preconditioner(_feature_flag):
     """
diff --git a/test/examples/test_simple_gp_regression.py b/test/examples/test_simple_gp_regression.py
@@ -216,7 +216,7 @@ def test_gp_posterior_single_training_point_smoke_test(self):
 
     def test_posterior_latent_gp_and_likelihood_with_optimization(self, cuda=False, checkpoint=0):
         train_x, test_x, train_y, test_y = self._get_data(
-            cuda=cuda, num_data=(1000 if checkpoint else 11), add_noise=bool(checkpoint)
+            cuda=cuda, num_data=(11), add_noise=bool(checkpoint)
         )
         # We're manually going to set the hyperparameters to something they shouldn't be
         likelihood = GaussianLikelihood(noise_prior=SmoothedBoxPrior(exp(-3), exp(3), sigma=0.1))
@@ -234,8 +234,8 @@ def test_posterior_latent_gp_and_likelihood_with_optimization(self, cuda=False,
         gp_model.train()
         likelihood.train()
         optimizer = optim.Adam(gp_model.parameters(), lr=0.15)
-        with gpytorch.beta_features.checkpoint_kernel(checkpoint), gpytorch.settings.fast_pred_var():
-            for _ in range(20 if checkpoint else 50):
+        with gpytorch.settings.fast_pred_var():
+            for _ in range(50):
                 optimizer.zero_grad()
                 output = gp_model(train_x)
                 loss = -mll(output, train_y)
@@ -256,9 +256,6 @@ def test_posterior_latent_gp_and_likelihood_with_optimization(self, cuda=False,
 
         self.assertLess(mean_abs_error.item(), 0.05)
 
-    def test_gp_with_checkpointing(self, cuda=False):
-        return self.test_posterior_latent_gp_and_likelihood_with_optimization(cuda=cuda, checkpoint=250)
-
     def test_fantasy_updates_cuda(self):
         if torch.cuda.is_available():
             with least_used_cuda_device():
diff --git a/test/lazy/test_lazy_evaluated_kernel_tensor.py b/test/lazy/test_lazy_evaluated_kernel_tensor.py
@@ -112,29 +112,6 @@ def _test_inv_matmul(self, rhs, lhs=None, cholesky=False):
             else:
                 self.assertFalse(linear_cg_mock.called)
 
-    def test_inv_matmul_matrix_with_checkpointing(self):
-        # Add one checkpointing test
-        lazy_tensor = self.create_linear_op().requires_grad_(True)
-        lazy_tensor_copy = lazy_tensor.clone().detach_().requires_grad_(True)
-        evaluated = self.evaluate_linear_op(lazy_tensor_copy)
-
-        test_vector = torch.randn(2, 5, 6)
-        test_vector_copy = test_vector.clone()
-        with gpytorch.beta_features.checkpoint_kernel(2):
-            res = lazy_tensor.solve(test_vector)
-            actual = evaluated.inverse().matmul(test_vector_copy)
-            self.assertLess(((res - actual).abs() / actual.abs().clamp(1, 1e5)).max().item(), 3e-1)
-
-            grad = torch.randn_like(res)
-            res.backward(gradient=grad)
-            actual.backward(gradient=grad)
-
-        for param, param_copy in zip(lazy_tensor.kernel.parameters(), lazy_tensor_copy.kernel.parameters()):
-            self.assertAllClose(param.grad, param_copy.grad, rtol=1e-3)
-        self.assertAllClose(
-            lazy_tensor.x1.grad + lazy_tensor.x2.grad, lazy_tensor_copy.x1.grad + lazy_tensor_copy.x2.grad, rtol=1e-3
-        )
-
     def test_batch_getitem(self):
         """Indexing was wrong when the kernel had more batch dimensions than the
         data"""

Original file line number	Diff line number	Diff line change
`@@ -8,7 +8,7 @@`
`8`	`8`	`"\n",`
`9`	`9`	`"This notebook demonstrates the most simple usage of contour integral quadrature with msMINRES as described [here](https://arxiv.org/pdf/2006.11267.pdf) to sample from the predictive distribution of an exact GP.\n",`
`10`	`10`	`"\n",`
`11`		- "Note that to achieve results where Cholesky would run the GPU out of memory, you'll either need to have KeOps installed (see our KeOps tutorial in this same folder), or use the `checkpoint_kernel` beta feature. Despite this, on this relatively simple example with 1000 training points but seeing to sample at 20000 test points in 1D, we will achieve significant speed ups over Cholesky."
	`11`	`+ "Note that to achieve results where Cholesky would run the GPU out of memory, you'll need to have KeOps installed (see our KeOps tutorial in this same folder). Despite this, on this relatively simple example with 1000 training points but seeing to sample at 20000 test points in 1D, we will achieve significant speed ups over Cholesky."`
`12`	`12`	`]`
`13`	`13`	`},`
`14`	`14`	`{`