小数据集PyTorch梯度累积的最后一步

scaler = GradScaler() for epoch in epochs: for i, (input, target) in enumerate(data): with autocast(): output = model(input) loss = loss_fn(output, target) loss = loss / iters_to_accumulate # Accumulates scaled gradients. scaler.scale(loss).backward() if (i + 1) % iters_to_accumulate == 0: # may unscale_ here if desired (e.g., to allow clipping unscaled gradients) scaler.step(optimizer) scaler.update() optimizer.zero_grad()

scaler = GradScaler() for epoch in epochs: for i, (input, target) in enumerate(data): with autocast(): output = model(input) loss = loss_fn(output, target) loss = loss / iters_to_accumulate # Accumulates scaled gradients. scaler.scale(loss).backward() if (i + 1) % iters_to_accumulate == 0 or (i + 1) == len(data): # may unscale_ here if desired (e.g., to allow clipping unscaled gradients) scaler.step(optimizer) scaler.update() optimizer.zero_grad()

2条回答

网友

1楼 · 编辑于 2024-05-19 17:02:51

我很确定我以前见过这种情况。从Pytorch Lightning（函数{}、{}和{}）中查看{a1}

网友

2楼 · 编辑于 2024-05-19 17:02:51

正如Lucas Ramos已经提到的，当使用^{}时，底层数据集的大小不能被批大小整除，默认行为是使用较小的最后一批：

drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

您的计划基本上是结合drop_last=False实施梯度累积-即最后一批比所有其他批小。
因此，原则上，不同批量的培训没有什么错

但是，代码中有一些东西需要修复：
小批量的损失平均。因此，如果您以通常的方式处理小批量，则无需担心。但是，当累积梯度时，可以通过将损失除以iters_to_accumulate来明确地进行：

loss = loss / iters_to_accumulate

在最后一个小批量（较小的大小）中，您需要更改iter_to_accumulate的值以反映较小的小批量大小

我提出了这个修改后的代码，将训练循环分为两个：一个是小批量的外部循环，另一个是每个小批量累积梯度的内部循环。请注意，使用^{} over the ^{}有助于将训练循环分成两部分：

scaler = GradScaler()

for epoch in epochs: 
    bi = 0  # index batches
    # outer loop over minibatches
    data_iter = iter(data)
    while bi < len(data):
        # determine the range for this batch
        nbi = min(len(data), bi + iters_to_accumulate)
        # inner loop over the items of the mini batch - accumulating gradients
        for i in range(bi, nbi):
            input, target = data_iter.next()
            with autocast():
                output = model(input)
                loss = loss_fn(output, target)
                loss = loss / (nbi - bi)  # divide by the true batch size

            # Accumulates scaled gradients.
            scaler.scale(loss).backward()
        # done mini batch loop - gradients were accumulated, we can make an optimizatino step.
        
        # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
        bi = nbi

相关问题更多 >

编程相关推荐

热门问题

热门文章