什么是Pytorch中基于磁带的autograd？

2条回答

网友

1楼 · 编辑于 2024-06-01 04:19:49

有不同类型的自动分化，例如forward-mode、reverse-mode、hybrids；（more explanation）。{}中的{}autograd仅指使用反向模式自动微分，{a2}。反向模式自动差异只是一种用于有效计算梯度的技术，它恰好被反向传播，source使用

现在，在PyTorch中，Autograd是自动区分的核心火炬包。它使用tape-based系统进行自动鉴别。在前进阶段，autograd磁带将记住它执行的所有操作，在后退阶段，它将重放操作

同样在TensorFlow中，为了自动区分，它还需要记住在向前传递期间以什么顺序发生的操作。然后，在向后传递过程中，TensorFlow以相反顺序遍历此操作列表以计算梯度。现在，TensorFlow提供了用于自动区分的^{}API；也就是说，计算关于某些输入的计算梯度，通常是^{}。TensorFlow将^{}上下文中执行的相关操作记录到磁带上。然后，TensorFlow使用该磁带计算使用reverse mode differentiation记录的计算的梯度

因此，正如我们从高层观点所看到的，两者都在做相同的操作。然而，在自定义训练循环期间，{}过程和{}的计算在{}中更加明确，因为它使用{}API作用域，而在{}中，这些操作是隐式的，但需要在更新训练参数（权重和偏差）时将{}标志临时设置为{}。为此，它显式地使用^{}API。换句话说，TensorFlow的tf.GradientTape()类似于PyTorch的loss.backward()。下面是上述语句代码中的简化形式

# TensorFlow 
[w, b] = tf_model.trainable_variables
for epoch in range(epochs):
  with tf.GradientTape() as tape:
    # forward passing and loss calculations 
    # within explicit tape scope 
    predictions = tf_model(x)
    loss = squared_error(predictions, y)

  # compute gradients (grad)
  w_grad, b_grad = tape.gradient(loss, tf_model.trainable_variables)

  # update training variables 
  w.assign(w - w_grad * learning_rate)
  b.assign(b - b_grad * learning_rate)


# PyTorch 
[w, b] = torch_model.parameters()
for epoch in range(epochs):
  # forward pass and loss calculation 
  # implicit tape-based AD 
  y_pred = torch_model(inputs)
  loss = squared_error(y_pred, labels)

  # compute gradients (grad)
  loss.backward()
  
  # update training variables / parameters  
  with torch.no_grad():
    w -= w.grad * learning_rate
    b -= b.grad * learning_rate
    w.grad.zero_()
    b.grad.zero_()

仅供参考，在上述两个框架中，可训练变量（w，b）都是手动更新的，但我们通常使用优化器（例如adam）来完成这项工作

# TensorFlow 
# ....
# update training variables 
optimizer.apply_gradients(zip([w_grad, b_grad], model.trainable_weights))

# PyTorch
# ....
# update training variables / parameters
optimizer.step()
optimizer.zero_grad()

网友
2楼 · 编辑于 2024-06-01 04:19:49

我怀疑这是因为'tape'这个词在自动区分的上下文中有两种不同的用法
当人们说pytorch不是基于磁带的时，他们的意思是它使用操作符重载，而不是[基于磁带的]源转换来自动区分
[Operator overloading] relies on a language’s ability to redefine the meaning of functions and operators. All primitives are overloaded so that they additionally perform a tracing operation: The primitive is logged onto a ‘tape’, along with its inputs to ensure that those intermediate variables are kept alive. At the end of the function’s execution, this tape contains a linear trace of all the numerical operations in the program. Derivatives can be calculated by walking this tape in reverse. [...]
OO is the technique used by PyTorch, Autograd, and Chainer [37].
...
Tape-based Frameworks such as ADIFOR [8] and Tapenade [20] for Fortran and C use a global stack also called a ‘tape’² to ensure that intermediate variables are kept alive. The original (primal) function is augmented so that it writes intermediate variables to the tape during the forward pass, and the adjoint program will read intermediate variables from the tape during the backward pass. More recently, tape-based ST was implemented for Python in the ML framework Tangent [38].
...
^{² The tape used in ST stores only the intermediate variables, whereas the tape in OO is a program trace that stores the executed primitives as well.}
Automatic differentiation in ML: Where we are and where we should be going

相关问题更多 >

编程相关推荐

热门问题

热门文章