我当时正在谷歌colab中为目标检测开发Detectron2,它工作得很成功,但我不得不转向使用CentOS 7.4和Conda的群集HPC。我已经安装了所有的需求,目前运行脚本时没有出现错误,但是在DefaultTrainer类的函数resume\u或\u load中,它被阻塞在一个有限睡眠循环中。当我停止它时,会出现这种回溯
Traceback (most recent call last):
File "new_train.py", line 138, in <module>
trainer.resume_or_load(resume=False)
File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/detectron2/engine/defaults.py",
line 353, in resume_or_load
checkpoint = self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=resume)
File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line
215, in resume_or_load
return self.load(path, checkpointables=[])
File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line
140, in load
path = self.path_manager.get_local_path(path)
File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/iopath/common/file_io.py", line
1109, in get_local_path
path, force=force, **kwargs
File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/iopath/common/file_io.py", line
764, in _get_local_path
with file_lock(cached):
File "/hpcfs/home/mj.patino/.conda/envs/tesisEnv/lib/python3.7/site-
packages/portalocker/utils.py", line 160, in __enter__
return self.acquire()
File "/hpcfs/home/mj.patino/.conda/envs/tesisEnv/lib/python3.7/site-packages/portalocker/utils.py", line 239, in acquire
for _ in self._timeout_generator(timeout, check_interval):
File "/hpcfs/home/mj.patino/.conda/envs/tesisEnv/lib/python3.7/site-
packages/portalocker/utils.py", line 152, in _timeout_generator
time.sleep(max(0.001, (i * check_interval) - since_start_time))
KeyboardInterrupt
跟踪错误非常困难,但我发现错误具体发生在fcntl.flock函数中。当我用谷歌Colab中的detectron同样的方法尝试这个功能时,它成功了,但在我的conda env中,我看到了这个错误
OSError [errno 9]: Bad file descriptor
当脚本尝试从model_zoo下载预先训练的文件并在本地驱动器的文件中使用fcntl.flock()函数时,会发生此错误。此函数接收io.Textiowrapper对象,并正确描述本地驱动器中的现有文件,并锁定非阻塞和独占标志。我已经检查了文件权限,我有读写权限
我已经搜索过了,但是我没有找到它发生的原因,有人知道我如何修复这个错误吗
多谢各位
PD:另外,我通过安装Python3.7.9、3.7.10和3.9.4进行了尝试,出现了相同的错误
目前没有回答
相关问题 更多 >
编程相关推荐