什么是Pytorch中基于磁带的autograd?

2024-06-01 04:19:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我理解autograd用于表示自动区分。但是{}中的{}到底是什么,以及为什么有这么多的讨论肯定或否定它

例如:

this

In pytorch, there is no traditional sense of tape

this

We don’t really build gradient tapes per se. But graphs.

但不是this

Autograd is now a core torch package for automatic differentiation. It uses a tape based system for automatic differentiation.

为进一步参考,请将其与Tensorflow中的GradientTape进行比较


Tags: ofnoinforispytorchthisautomatic
2条回答

有不同类型的自动分化,例如forward-modereverse-modehybrids;(more explanation)。{}中的{}autograd仅指使用反向模式自动微分,{a2}。反向模式自动差异只是一种用于有效计算梯度的技术,它恰好被反向传播,source使用


现在,在PyTorch中,Autograd是自动区分的核心火炬包。它使用tape-based系统进行自动鉴别。在前进阶段,autograd磁带将记住它执行的所有操作,在后退阶段,它将重放操作

同样在TensorFlow中,为了自动区分,它还需要记住在向前传递期间以什么顺序发生的操作。然后,在向后传递过程中,TensorFlow以相反顺序遍历此操作列表以计算梯度。现在,TensorFlow提供了用于自动区分的^{}API;也就是说,计算关于某些输入的计算梯度,通常是^{}。TensorFlow^{}上下文中执行的相关操作记录到磁带上。然后,TensorFlow使用该磁带计算使用reverse mode differentiation记录的计算的梯度

因此,正如我们从高层观点所看到的,两者都在做相同的操作。然而,在自定义训练循环期间,{}过程和{}的计算在{}中更加明确,因为它使用{}API作用域,而在{}中,这些操作是隐式的,但需要在更新训练参数(权重和偏差)时将{}标志临时设置为{}。为此,它显式地使用^{}API。换句话说,TensorFlow的tf.GradientTape()类似于PyTorch的loss.backward()。下面是上述语句代码中的简化形式

# TensorFlow 
[w, b] = tf_model.trainable_variables
for epoch in range(epochs):
  with tf.GradientTape() as tape:
    # forward passing and loss calculations 
    # within explicit tape scope 
    predictions = tf_model(x)
    loss = squared_error(predictions, y)

  # compute gradients (grad)
  w_grad, b_grad = tape.gradient(loss, tf_model.trainable_variables)

  # update training variables 
  w.assign(w - w_grad * learning_rate)
  b.assign(b - b_grad * learning_rate)


# PyTorch 
[w, b] = torch_model.parameters()
for epoch in range(epochs):
  # forward pass and loss calculation 
  # implicit tape-based AD 
  y_pred = torch_model(inputs)
  loss = squared_error(y_pred, labels)

  # compute gradients (grad)
  loss.backward()
  
  # update training variables / parameters  
  with torch.no_grad():
    w -= w.grad * learning_rate
    b -= b.grad * learning_rate
    w.grad.zero_()
    b.grad.zero_()

仅供参考,在上述两个框架中,可训练变量(wb)都是手动更新的,但我们通常使用优化器(例如adam)来完成这项工作

# TensorFlow 
# ....
# update training variables 
optimizer.apply_gradients(zip([w_grad, b_grad], model.trainable_weights))

# PyTorch
# ....
# update training variables / parameters
optimizer.step()
optimizer.zero_grad()

我怀疑这是因为'tape'这个词在自动区分的上下文中有两种不同的用法

当人们说不是基于磁带的时,他们的意思是它使用操作符重载,而不是[基于磁带的]源转换来自动区分

[Operator overloading] relies on a language’s ability to redefine the meaning of functions and operators. All primitives are overloaded so that they additionally perform a tracing operation: The primitive is logged onto a ‘tape’, along with its inputs to ensure that those intermediate variables are kept alive. At the end of the function’s execution, this tape contains a linear trace of all the numerical operations in the program. Derivatives can be calculated by walking this tape in reverse. [...]
OO is the technique used by PyTorch, Autograd, and Chainer [37].

...

Tape-based Frameworks such as ADIFOR [8] and Tapenade [20] for Fortran and C use a global stack also called a ‘tape’2 to ensure that intermediate variables are kept alive. The original (primal) function is augmented so that it writes intermediate variables to the tape during the forward pass, and the adjoint program will read intermediate variables from the tape during the backward pass. More recently, tape-based ST was implemented for Python in the ML framework Tangent [38].

...

2 The tape used in ST stores only the intermediate variables, whereas the tape in OO is a program trace that stores the executed primitives as well.

相关问题 更多 >