在Google云教程上跟踪TensorFlow Pets时出现大量远程错误

2024-06-07 11:25:49 发布

您现在位置:Python中文网/ 问答频道 /正文

按照“googlecloud上牛津iit Pets数据集的分布式培训”教程on the official TensorFlow Models repo中的说明,我遇到了一些问题。首先,这:

Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 51, in from object_detection.builders import model_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/model_builder.py", line 29, in from object_detection.meta_architectures import ssd_meta_arch File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 32, in from object_detection.utils import visualization_utils File "/root/.local/lib/python2.7/site-packages/object_detection/utils/visualization_utils.py", line 25, in import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements ImportError: No module named matplotlib

最后一部分是“没有名为matplotlib的模块”。根据网上的一些建议,我编辑了设置.py,添加“matplotlib”作为要求:

REQUIRED_PACKAGES = ['Pillow>=1.0', 'matplotlib']

再次运行,解决了问题。奇怪的是,你可以假设它是一个教程,它不会有这个问题。不过,接下来又遇到了一个新问题:

Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 264, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 164, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 0 exited with a non-zero status of 1.

由于这个问题没有相关的搜索结果,很难知道问题出在哪里,尽管有一个答案暗示了TensorFlow的过时版本。本项目所述的TensorFlow版本是TensorFlow 1.2。TensorFlow现在是1.7版,所以可能这就是问题的症结所在。运行时版本列表的选项有1.2、1.4、1.5和1.6。尝试使用1.6版本时,我遇到了另一个错误:

Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 746, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 402, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 486, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error The replica worker 1 exited with a non-zero status of 1.

再说一次,现在似乎没有解决这个错误的方法。所以我在黑暗中刺杀。我用TensorFlow 1.4再试一次。新错误:

Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 264, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 165, in build process_fn, config.input_path[:], input_reader_config) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/dataset_util.py", line 133, in read_dataset tf.contrib.data.parallel_interleave( AttributeError: 'module' object has no attribute 'parallel_interleave' The replica worker 0 exited with a non-zero status of 1

我发现自己现在深陷在错误的世界里,不知道下一步该怎么做。我只是按照教程的步骤,执行他们说要执行的代码行,并在执行5-10分钟后接收这些远程错误。在

如有任何关于如何克服这些问题的建议,我们将不胜感激。在


Tags: runinpyobjectlibpackagesusrlocal
2条回答

其中一些错误应该在the following commit之前发生。 现在使用repo,按照here中的说明操作对我很有用。看起来只需要使用 runtime-version 1.7标志。 如果您一直有问题,请确保使用sudo遵循installation instructions。在

如果不是,有些人仍然说他们需要在setup.py中添加Tensorflow和Jupyter(但我不是这样想的)

您有安装问题。卸载所有内容并确认已卸载,方法是启动Python并导入已卸载的内容,以确保每个卸载的包都遵守ImportError。在

然后仔细遵循the installation page上的步骤,这些步骤确实指明了matplotlib的单独安装步骤等。在

相关问题 更多 >

    热门问题