Pytork的再现性

2024-04-23 10:30:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我想训练一个使用PyTorch 1.6.0和多个GPU的模型。我设定了所有的种子和CUDA基准

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

然而,在两次运行中,损失看起来有所不同

first run

second run

l_sup是使用另一个固定的预训练模型来监督中间特征提取层的损失。l_pix是原始模型损耗

代码如下所示:

Class Model:
    def __init__(self):
        self.basic_model = BasicModel()
        load_path = ...
        self.pretrained_network = PretrainNetwork()
        self.load_network(self.pretrained_network, load_path)
        self.pretrained_network.eval()
        for p in self.pretrained_network.parameters():
            p.requires_grad = False
        self.basic_model = DataParallel(self.basic_model)
        self.pretrained_network = DataParallel(self.pretrained_network)
        
    def load_network(self, net, load_path, strict=True, param_key='params'):
        if isinstance(net, (DataParallel, DistributedDataParallel)):
            net = net.module
    
        load_net = torch.load(load_path, map_location=lambda storage, loc: storage)
        if param_key is not None:
            load_net = load_net[param_key]
        for k, v in deepcopy(load_net).items():
            if k.startswith('module.'):
                load_net[k[7:]] = v
                load_net.pop(k)
        net.load_state_dict(load_net, strict=strict)
        
    def optimize_parameters(self):
        self.optimizer.zero_grad()
        # left_feature and right_feature are the output of middle feature extraction layer, cost_volume is the final output of the model
        left_feature, right_feature, cost_volume = self.basic_model(self.left_seq,  self.right_seq)
            
        # the output of middle feature extraction layer of pre-trained model
        left_cnn_feature, right_cnn_feature = self.pretrained_network(self.left_seq, self.right_seq)

        l_total = 0
            
        # loss of original model
        l_pix = self.basic_loss(cost_volume, self.gt)
        l_total += l_pix
            
        # loss of supervision of pre-trained model
        l_sup = 0.5 * self.feature_loss(left_feature, left_cnn_feature) + 0.5 * self.feature_loss(right_feature, right_cnn_feature)
        l_total += l_sup

        l_total.backward()
        self.optimizer.step()

为了确保预训练模型不会更新其参数,我加载预训练模型的权重并写入

self.pretrained_network.eval()
for p in self.pretrained_network.parameters():
    p.requires_grad = False

奇怪的是,如果我只使用l_pix,这意味着删除这两行

l_sup = 0.5 * self.feature_loss(left_feature, left_cnn_feature) + 0.5 * self.feature_loss(right_feature, right_cnn_feature)
l_total += l_sup

保证了再现性。我想知道为什么添加l_sup会导致不可再现性


Tags: of模型selfrightnetmodelloadtorch