使用pysnooper调试pytorch代码。
TorchSnooper的Python项目详细描述
火炬手
状态:
支票:
部署(仅在标记提交时成功):
你想看看模型每一步的形状/数据类型等,但厌倦了手工打印吗?
你是否被类似于RuntimeError: Expected object of scalar type Double but got scalar type Float
的错误所困扰,并希望快速找出问题所在?
torchsnooper是一个PySnooper扩展,可以帮助您调试这些错误。
要使用torchsnooper,只需像使用pysnooper一样使用它。记住在代码中将pysnooper.snoop
替换为torchsnooper.snoop
。
要安装:
pip install torchsnooper
火炬手也支持snoop。要在snoop中使用torchsnooper,只需执行:
torchsnooper.register_snoop()
或
torchsnooper.register_snoop(verbose=True)
开始的时候,通常使用snoop。
示例1:监视设备和数据类型
我们正在编写一个简单的函数:
defmyfunc(mask,x):y=torch.zeros(6)y.masked_scatter_(mask,x)returny
并按如下方式使用
mask=torch.tensor([0,1,0,1,1,0],device='cuda')source=torch.tensor([1.0,2.0,3.0],device='cuda')y=myfunc(mask,source)
上面的代码似乎是正确的,但不幸的是,我们得到了以下错误:
RuntimeError: Expected object of backend CPU but got backend CUDA for argument #2 'mask'
怎么了?让我们窥探一下!用torchsnooper.snoop()
:
importtorchimporttorchsnooper@torchsnooper.snoop()defmyfunc(mask,x):y=torch.zeros(6)y.masked_scatter_(mask,x)returnymask=torch.tensor([0,1,0,1,1,0],device='cuda')source=torch.tensor([1.0,2.0,3.0],device='cuda')y=myfunc(mask,source)
运行我们的脚本,我们将看到:
Starting var:.. mask = tensor<(6,), int64, cuda:0>
Starting var:.. x = tensor<(3,), float32, cuda:0>
21:41:42.941668 call 5 def myfunc(mask, x):
21:41:42.941834 line 6 y = torch.zeros(6)
New var:....... y = tensor<(6,), float32, cpu>
21:41:42.943443 line 7 y.masked_scatter_(mask, x)
21:41:42.944404 exception 7 y.masked_scatter_(mask, x)
现在注意张量的装置,我们注意到
New var:....... y = tensor<(6,), float32, cpu>
现在,很明显,问题是因为y
是CPU上的张量,也就是说,
我们忘记在y = torch.zeros(6)
上指定设备。把它改成
y = torch.zeros(6, device='cuda')
,这个问题解决了。
但是当再次运行脚本时,我们会收到另一个错误:
RuntimeError: Expected object of scalar type Byte but got scalar type Long for argument #2 'mask'
再看上面的跟踪,注意变量的数据类型,我们会注意到
Starting var:.. mask = tensor<(6,), int64, cuda:0>
好吧,问题是,我们没有把输入中的mask
设为字节张量。将线路更改为
mask = torch.tensor([0, 1, 0, 1, 1, 0], device='cuda', dtype=torch.uint8)
问题解决了。
示例2:监视形状
我们正在建立一个线性模型
classModel(torch.nn.Module):def__init__(self):super().__init__()self.layer=torch.nn.Linear(2,1)defforward(self,x):returnself.layer(x)
我们想适应y = x1 + 2 * x2 + 3
,所以我们创建了一个数据集:
x=torch.tensor([[0.0,0.0],[0.0,1.0],[1.0,0.0],[1.0,1.0]])y=torch.tensor([3.0,5.0,4.0,6.0])
我们使用SGD优化器在此数据集上训练模型:
model=Model()optimizer=torch.optim.SGD(model.parameters(),lr=0.1)for_inrange(10):optimizer.zero_grad()pred=model(x)squared_diff=(y-pred)**2loss=squared_diff.mean()print(loss.item())loss.backward()optimizer.step()
但不幸的是,损失并没有降到足够低的水平。
怎么了?让我们窥探一下!将训练循环放入snoop:
withtorchsnooper.snoop():for_inrange(100):optimizer.zero_grad()pred=model(x)squared_diff=(y-pred)**2loss=squared_diff.mean()print(loss.item())loss.backward()optimizer.step()
部分跟踪结果如下:
New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... model = Model( (layer): Linear(in_features=2, out_features=1, bias=True))
New var:....... optimizer = SGD (Parameter Group 0 dampening: 0 lr: 0....omentum: 0 nesterov: False weight_decay: 0)
22:27:01.024233 line 21 for _ in range(100):
New var:....... _ = 0
22:27:01.024439 line 22 optimizer.zero_grad()
22:27:01.024574 line 23 pred = model(x)
New var:....... pred = tensor<(4, 1), float32, cpu, grad>
22:27:01.026442 line 24 squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4, 4), float32, cpu, grad>
22:27:01.027369 line 25 loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
22:27:01.027616 line 26 print(loss.item())
22:27:01.027793 line 27 loss.backward()
22:27:01.050189 line 28 optimizer.step()
我们注意到,y
具有形状(4,)
,但是pred
具有形状(4, 1)
。结果,由于广播,squared_diff
具有形状(4, 4)
!
这不是预期的行为,让我们修复它:pred = model(x).squeeze()
,现在一切看起来都很好:
New var:....... x = tensor<(4, 2), float32, cpu>
New var:....... y = tensor<(4,), float32, cpu>
New var:....... model = Model( (layer): Linear(in_features=2, out_features=1, bias=True))
New var:....... optimizer = SGD (Parameter Group 0 dampening: 0 lr: 0....omentum: 0 nesterov: False weight_decay: 0)
22:28:19.778089 line 21 for _ in range(100):
New var:....... _ = 0
22:28:19.778293 line 22 optimizer.zero_grad()
22:28:19.778436 line 23 pred = model(x).squeeze()
New var:....... pred = tensor<(4,), float32, cpu, grad>
22:28:19.780250 line 24 squared_diff = (y - pred) ** 2
New var:....... squared_diff = tensor<(4,), float32, cpu, grad>
22:28:19.781099 line 25 loss = squared_diff.mean()
New var:....... loss = tensor<(), float32, cpu, grad>
22:28:19.781361 line 26 print(loss.item())
22:28:19.781537 line 27 loss.backward()
22:28:19.798983 line 28 optimizer.step()
最终模型收敛到期望值。
例3:监视nan和inf
假设我们有一个模型,它可以输出某些事情的可能性。在本例中,我们将使用mock:
classMockModel(torch.nn.Module):def__init__(self):super(MockModel,self).__init__()self.unused=torch.nn.Linear(6,4)defforward(self,x):returntorch.tensor([0.0,0.25,0.9,0.75])+self.unused(x)*0.0model=MockModel()
在训练过程中,我们希望最小化负对数可能性,我们有代码:
forepochinrange(100):batch_input=torch.randn(6,6)likelihood=model(batch_input)log_likelihood=likelihood.log()target=-log_likelihood.mean()print(target.item())optimizer.zero_grad()target.backward()optimizer.step()
不幸的是,我们在训练中首先得到目标的inf
,然后是nan
。怎么了?让我们窥探一下:
withtorchsnooper.snoop():forepochinrange(100):batch_input=torch.randn(6,6)likelihood=model(batch_input)log_likelihood=likelihood.log()target=-log_likelihood.mean()print(target.item())optimizer.zero_grad()target.backward()optimizer.step()
我们将看到snoop的输出部分如下:
19:30:20.928316 line 18 for epoch in range(100):
New var:....... epoch = 0
19:30:20.928575 line 19 batch_input = torch.randn(6, 6)
New var:....... batch_input = tensor<(6, 6), float32, cpu>
19:30:20.929671 line 20 likelihood = model(batch_input)
New var:....... likelihood = tensor<(6, 4), float32, cpu, grad>
19:30:20.930284 line 21 log_likelihood = likelihood.log()
New var:....... log_likelihood = tensor<(6, 4), float32, cpu, grad, has_inf>
19:30:20.930672 line 22 target = -log_likelihood.mean()
New var:....... target = tensor<(), float32, cpu, grad, has_inf>
19:30:20.931136 line 23 print(target.item())
19:30:20.931508 line 25 optimizer.zero_grad()
19:30:20.931871 line 26 target.backward()
inf
19:30:20.960028 line 27 optimizer.step()
19:30:20.960673 line 18 for epoch in range(100):
Modified var:.. epoch = 1
19:30:20.961043 line 19 batch_input = torch.randn(6, 6)
19:30:20.961423 line 20 likelihood = model(batch_input)
Modified var:.. likelihood = tensor<(6, 4), float32, cpu, grad, has_nan>
19:30:20.961910 line 21 log_likelihood = likelihood.log()
Modified var:.. log_likelihood = tensor<(6, 4), float32, cpu, grad, has_nan>
19:30:20.962302 line 22 target = -log_likelihood.mean()
Modified var:.. target = tensor<(), float32, cpu, grad, has_nan>
19:30:20.962715 line 23 print(target.item())
19:30:20.963089 line 25 optimizer.zero_grad()
19:30:20.963464 line 26 target.backward()
19:30:20.964051 line 27 optimizer.step()
读取输出,我们发现在第一个历元(epoch = 0
)处,log_likelihood
有has_inf
标志。
has_inf
标志表示,张量的值中包含inf
。对target
显示相同的标志。
在第二个时代,从likelihood
开始,张量都有一个has_nan
标志。
从我们深入学习的经验中,我们可以猜测这是因为第一个时代有inf
,这导致
梯度是nan
,当参数更新时,这些nan
传播到参数并导致
未来的步骤有nan
结果。
深入研究,我们发现likelihood
中包含0,这将导致log(0) = -inf
。改变
该行
log_likelihood=likelihood.log()
进入
log_likelihood=likelihood.clamp(min=1e-8).log()
问题解决了。