运行时错误:CUDA错误:设备序号无效

2024-06-09 04:48:46 发布

您现在位置:Python中文网/ 问答频道 /正文

当我试着运行我的程序时,我得到一个错误,说: RuntimeError: CUDA error: invalid device ordinal 完整错误如下所示

我对这类事情没有多少经验;此外,它既不是我自己的程序,也不是我自己的机器

基于this question on github,我测试了以下内容:

Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.8.1+cu102'
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name()
'GeForce RTX 2080 Ti'
>>> 

与这个问题不同,我使用的机器似乎只能访问一个GPU。一位同事建议这可能与self.device产生错误值有关

当然,非常感谢您的帮助

(rlpyt) hbp@aklma-MS-7B24:~$ cd Documents/Bing/Mathieu/learning_to_be_taught/experiments/vmpo_replay_ratio/(rlpyt) hbp@aklma-MS-7B24:~/Documents/Bing/Mathieu/learning_to_be_taught/experiments/vmpo_replay_ratio$ python vmpo_replay_ratio.py
/home/hbp/anaconda3/envs/rlpyt/lib/python3.8/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
training started with parameters: Namespace(epochs=None, log_dir=None, log_dir_positional=None, name='run', run_id=None, serial_mode=False, slot_affinity_code=None, snapshot_file=None)
exp_dir: /home/hbp/Documents/Bing/Mathieu/learning_to_be_taught/experiments/vmpo_replay_ratio/logs/run_6
using seed 5986
2021-05-27 14:11:40.546471  | run_6 Running 1520 sampler iterations.
2021-05-27 14:11:40.600944  | run_6 Optimizer master CPU affinity: [0].
2021-05-27 14:11:40.626970  | run_6 Initialized async CPU agent model.
2021-05-27 14:11:40.627073  | run_6 WARNING: unequal number of envs per process, from batch_B 6400 and n_worker 7 (possible suboptimal speed).
2021-05-27 14:11:40.627223  | run_6 Total parallel evaluation envs: 21.
2021-05-27 14:11:40.657946  | run_6 Optimizer master Torch threads: 1.
using seed 5987
using seed 5986
using seed 5988
using seed 5989
using seed 5990
using seed 5991
Traceback (most recent call last):
  File "vmpo_replay_ratio.py", line 213, in <module>
    build_and_train(slot_affinity_code=args.slot_affinity_code,
  File "vmpo_replay_ratio.py", line 135, in build_and_train
    runner.train()
  File "/home/hbp/Documents/Bing/Mathieu/rlpyt/rlpyt/runners/async_rl.py", line 87, in train
    throttle_itr, delta_throttle_itr = self.startup()
  File "/home/hbp/Documents/Bing/Mathieu/rlpyt/rlpyt/runners/async_rl.py", line 161, in startup
    throttle_itr, delta_throttle_itr = self.optim_startup()
  File "/home/hbp/Documents/Bing/Mathieu/rlpyt/rlpyt/runners/async_rl.py", line 177, in optim_startup
    self.agent.to_device(main_affinity.get("cuda_idx", None))
  File "/home/hbp/Documents/Bing/Mathieu/rlpyt/rlpyt/agents/base.py", line 115, in to_device
using seed 5992
    self.model.to(self.device)
  File "/home/hbp/anaconda3/envs/rlpyt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/home/hbp/anaconda3/envs/rlpyt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/hbp/anaconda3/envs/rlpyt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/hbp/anaconda3/envs/rlpyt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/hbp/anaconda3/envs/rlpyt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/home/hbp/anaconda3/envs/rlpyt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
2021-05-27 14:11:40.987723  | run_6 Sampler rank 1 initialized, CPU affinity [2], Torch threads 1, Seed 5987
2021-05-27 14:11:40.987714  | run_6 Sampler rank 0 initialized, CPU affinity [1], Torch threads 1, Seed 5986
2021-05-27 14:11:40.988088  | run_6 Sampler rank 2 initialized, CPU affinity [3], Torch threads 1, Seed 5988
2021-05-27 14:11:40.989922  | run_6 Sampler rank 3 initialized, CPU affinity [4], Torch threads 1, Seed 5989
2021-05-27 14:11:40.992058  | run_6 Sampler rank 4 initialized, CPU affinity [5], Torch threads 1, Seed 5990
2021-05-27 14:11:40.995587  | run_6 Sampler rank 5 initialized, CPU affinity [6], Torch threads 1, Seed 5991
2021-05-27 14:11:40.996119  | run_6 Sampler rank 6 initialized, CPU affinity [7], Torch threads 1, Seed 5992

Tags: toruninpyhomedevicelinetorch