|
4 | 4 | "cell_type": "markdown",
|
5 | 5 | "metadata": {},
|
6 | 6 | "source": [
|
7 |
| - "# Exact GP Regression with Multiple GPUs and Kernel Partitioning\n", |
| 7 | + "# Exact GP Regression with Multiple GPUs\n", |
8 | 8 | "## Introduction\n",
|
9 |
| - "In this notebook, we'll demonstrate training exact GPs on large datasets using two key features from the paper https://arxiv.org/abs/1903.08114: \n", |
| 9 | + "In this notebook, we'll demonstrate training exact GPs on large datasets by distributing the kernel matrix across multiple GPUs, for additional parallelism.\n", |
10 | 10 | "\n",
|
11 |
| - "1. The ability to distribute the kernel matrix across multiple GPUs, for additional parallelism.\n", |
12 |
| - "2. Partitioning the kernel into chunks computed on-the-fly when performing each MVM to reduce memory usage.\n", |
| 11 | + "**NOTE**: Kernel partitioning (another memory-saving mechanism introduced in https://arxiv.org/abs/1903.08114) is no longer supported for multiple GPUs. If your kernel matrix is too big to fit on your available GPUs, please use the [GPyTorch KeOps integration](./KeOps_GP_Regression.ipynb) for kernel partitioning.\n", |
13 | 12 | "\n",
|
14 |
| - "We'll be using the `protein` dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: both the number of GPUs available and the amount of memory they have (which determines the partition size) have a significant effect on training time." |
| 13 | + "We'll be using the `protein` dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: the number of GPUs available has a significant effect on training time." |
15 | 14 | ]
|
16 | 15 | },
|
17 | 16 | {
|
|
164 | 163 | " train_y,\n",
|
165 | 164 | " n_devices,\n",
|
166 | 165 | " output_device,\n",
|
167 |
| - " checkpoint_size,\n", |
168 | 166 | " preconditioner_size,\n",
|
169 | 167 | " n_training_iter,\n",
|
170 | 168 | "):\n",
|
|
178 | 176 | " mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)\n",
|
179 | 177 | "\n",
|
180 | 178 | " \n",
|
181 |
| - " with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \\\n", |
182 |
| - " gpytorch.settings.max_preconditioner_size(preconditioner_size):\n", |
| 179 | + " with gpytorch.settings.max_preconditioner_size(preconditioner_size):\n", |
183 | 180 | "\n",
|
184 | 181 | " def closure():\n",
|
185 | 182 | " optimizer.zero_grad()\n",
|
|
208 | 205 | " return model, likelihood"
|
209 | 206 | ]
|
210 | 207 | },
|
211 |
| - { |
212 |
| - "cell_type": "markdown", |
213 |
| - "metadata": {}, |
214 |
| - "source": [ |
215 |
| - "## Automatically determining GPU Settings\n", |
216 |
| - "\n", |
217 |
| - "In the next cell, we automatically determine a roughly reasonable partition or *checkpoint* size that will allow us to train without using more memory than the GPUs available have. Not that this is a coarse estimate of the largest possible checkpoint size, and may be off by as much as a factor of 2. A smarter search here could make up to a 2x performance improvement." |
218 |
| - ] |
219 |
| - }, |
220 |
| - { |
221 |
| - "cell_type": "code", |
222 |
| - "execution_count": 6, |
223 |
| - "metadata": {}, |
224 |
| - "outputs": [ |
225 |
| - { |
226 |
| - "name": "stdout", |
227 |
| - "output_type": "stream", |
228 |
| - "text": [ |
229 |
| - "Number of devices: 2 -- Kernel partition size: 0\n", |
230 |
| - "RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 1; 10.73 GiB total capacity; 7.48 GiB already allocated; 2.46 GiB free; 21.49 MiB cached)\n", |
231 |
| - "Number of devices: 2 -- Kernel partition size: 18292\n", |
232 |
| - "RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 10.73 GiB total capacity; 6.37 GiB already allocated; 448.94 MiB free; 1.30 GiB cached)\n", |
233 |
| - "Number of devices: 2 -- Kernel partition size: 9146\n", |
234 |
| - "Iter 1/1 - Loss: 0.893 lengthscale: 0.486 noise: 0.248\n", |
235 |
| - "Finished training on 36584 data points using 2 GPUs.\n" |
236 |
| - ] |
237 |
| - } |
238 |
| - ], |
239 |
| - "source": [ |
240 |
| - "import gc\n", |
241 |
| - "\n", |
242 |
| - "def find_best_gpu_setting(train_x,\n", |
243 |
| - " train_y,\n", |
244 |
| - " n_devices,\n", |
245 |
| - " output_device,\n", |
246 |
| - " preconditioner_size\n", |
247 |
| - "):\n", |
248 |
| - " N = train_x.size(0)\n", |
249 |
| - " \n", |
250 |
| - " # Find the optimum partition/checkpoint size by decreasing in powers of 2\n", |
251 |
| - " # Start with no partitioning (size = 0)\n", |
252 |
| - " settings = [0] + [int(n) for n in np.ceil(N / 2**np.arange(1, np.floor(np.log2(N))))]\n", |
253 |
| - "\n", |
254 |
| - " for checkpoint_size in settings:\n", |
255 |
| - " print('Number of devices: {} -- Kernel partition size: {}'.format(n_devices, checkpoint_size))\n", |
256 |
| - " try:\n", |
257 |
| - " # Try a full forward and backward pass with this setting to check memory usage\n", |
258 |
| - " _, _ = train(train_x, train_y,\n", |
259 |
| - " n_devices=n_devices, output_device=output_device,\n", |
260 |
| - " checkpoint_size=checkpoint_size,\n", |
261 |
| - " preconditioner_size=preconditioner_size, n_training_iter=1)\n", |
262 |
| - " \n", |
263 |
| - " # when successful, break out of for-loop and jump to finally block\n", |
264 |
| - " break\n", |
265 |
| - " except RuntimeError as e:\n", |
266 |
| - " print('RuntimeError: {}'.format(e))\n", |
267 |
| - " except AttributeError as e:\n", |
268 |
| - " print('AttributeError: {}'.format(e))\n", |
269 |
| - " finally:\n", |
270 |
| - " # handle CUDA OOM error\n", |
271 |
| - " gc.collect()\n", |
272 |
| - " torch.cuda.empty_cache()\n", |
273 |
| - " return checkpoint_size\n", |
274 |
| - "\n", |
275 |
| - "# Set a large enough preconditioner size to reduce the number of CG iterations run\n", |
276 |
| - "preconditioner_size = 100\n", |
277 |
| - "checkpoint_size = find_best_gpu_setting(train_x, train_y,\n", |
278 |
| - " n_devices=n_devices, \n", |
279 |
| - " output_device=output_device,\n", |
280 |
| - " preconditioner_size=preconditioner_size)" |
281 |
| - ] |
282 |
| - }, |
283 | 208 | {
|
284 | 209 | "cell_type": "markdown",
|
285 | 210 | "metadata": {},
|
|
309 | 234 | "source": [
|
310 | 235 | "model, likelihood = train(train_x, train_y,\n",
|
311 | 236 | " n_devices=n_devices, output_device=output_device,\n",
|
312 |
| - " checkpoint_size=10000,\n", |
313 | 237 | " preconditioner_size=100,\n",
|
314 | 238 | " n_training_iter=20)"
|
315 | 239 | ]
|
|
331 | 255 | "model.eval()\n",
|
332 | 256 | "likelihood.eval()\n",
|
333 | 257 | "\n",
|
334 |
| - "with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):\n", |
| 258 | + "with torch.no_grad(), gpytorch.settings.fast_pred_var():\n", |
335 | 259 | " # Make predictions on a small number of test points to get the test time caches computed\n",
|
336 | 260 | " latent_pred = model(test_x[:2, :])\n",
|
337 | 261 | " del latent_pred # We don't care about these predictions, we really just want the caches."
|
|
360 | 284 | }
|
361 | 285 | ],
|
362 | 286 | "source": [
|
363 |
| - "with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):\n", |
| 287 | + "with torch.no_grad(), gpytorch.settings.fast_pred_var():\n", |
364 | 288 | " %time latent_pred = model(test_x)\n",
|
365 | 289 | " \n",
|
366 | 290 | "test_rmse = torch.sqrt(torch.mean(torch.pow(latent_pred.mean - test_y, 2)))\n",
|
|
385 | 309 | "metadata": {
|
386 | 310 | "anaconda-cloud": {},
|
387 | 311 | "kernelspec": {
|
388 |
| - "display_name": "Python 3", |
| 312 | + "display_name": "Python 3 (ipykernel)", |
389 | 313 | "language": "python",
|
390 | 314 | "name": "python3"
|
391 | 315 | },
|
|
399 | 323 | "name": "python",
|
400 | 324 | "nbconvert_exporter": "python",
|
401 | 325 | "pygments_lexer": "ipython3",
|
402 |
| - "version": "3.7.1" |
| 326 | + "version": "3.8.0" |
403 | 327 | }
|
404 | 328 | },
|
405 | 329 | "nbformat": 4,
|
|
0 commit comments