-
Notifications
You must be signed in to change notification settings - Fork 563
[Bug] PiecewisePolynomialKernel fails to put all tensors on the same GPU device #2199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Taking a glance at PiecewisePolynomilKernel, I'm pretty sure the issue is here:
We are instantiating a torch.tensor(0.0) without setting it to the same dtype or device as r. I think a fix should be as simple as using torch.tensor(0.0, dtype=r.dtype, device=r.device) Can you try making that change to source code, and - if it works - submit a bug fix PR? |
Thank you for the answer, but unfortunately this fix doesn't work. According to the error message, the device error happens during back propagation rather than forward passing. Strangely, the inputs and outputs of the model, as well as the loss, are all on the correct device, as can be noticed in logs. Code snippet import torch
import gpytorch
import os
import numpy as np
import urllib.request
from scipy.io import loadmat
dataset = 'protein'
if not os.path.isfile(f'../../datasets/UCI/{dataset}.mat'):
print(f'Downloading \'{dataset}\' UCI dataset...')
urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1nRb8e7qooozXkNghC5eQS0JeywSXGX2S',
f'../../datasets/UCI/{dataset}.mat')
data = torch.Tensor(loadmat(f'../../datasets/UCI/{dataset}.mat')['data'])
n_train = 4000
train_x, train_y = data[:n_train, :-1], data[:n_train, -1]
output_device = torch.device('cuda:0')
train_x, train_y = train_x.contiguous().to(output_device), train_y.contiguous().to(output_device)
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super().__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.PiecewisePolynomialKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
print(f"mean_x.device: {mean_x.device} - {mean_x.size()}")
print(f"covar_x.device: {covar_x.device} - {covar_x.size()}")
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
model = ExactGPModel(train_x, train_y, likelihood).to(output_device)
model.train()
likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
print(f"train device: x: {train_x.device}, y: {train_y.device}")
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
print(f"loss.device: {loss.device}")
loss.backward()
optimizer.step() Log output
Error message
|
Hmm I can take a look later. Mind renaming the issue to reflect that this isn't a MultiDeviceKernel issue? |
Sure, I have renamed this issue to be related to PiecewisePolynomialKernel. |
🐛 Bug
I was experimenting with the tutorial of Exact GP multiple GPUs here. However, when the base kernel was changed from RBF kernel to piecewise polynomial kernel, an error showed up that tensors are not on the same device.
To reproduce
** Code snippet to reproduce **
** Stack trace/error message **
System information
Please complete the following information:
Additional context
I further experimented with training size and similar issue showed up when
n_train = 100
using RBF kernel. Please see the error message below.The text was updated successfully, but these errors were encountered: