使用mrjob在Python Hadoop MapReduce作业中处理CalledProcessError
我正在尝试在我的自定义数据上运行mrjob网站上的基本示例。我已经成功使用流式处理运行了Hadoop的map reduce,也成功尝试过不使用Hadoop的脚本,但现在我想通过mrjob在Hadoop上运行它,使用的命令如下。
./mapred.py -r hadoop --hadoop-bin /usr/bin/hadoop -o hdfs:///user/cloudera/wc_result_mrjob hdfs:///user/cloudera/books
mapred.py的源代码如下:
#! /usr/bin/env python
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
不幸的是,我遇到了以下错误:
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/mapred.cloudera.20140824.195414.420162
writing wrapper script to /tmp/mapred.cloudera.20140824.195414.420162/setup-wrapper.sh
STDERR: mkdir: `hdfs:///user/cloudera/tmp/mrjob/mapred.cloudera.20140824.195414.420162/files/': No such file or directory
Traceback (most recent call last):
File "./mapred.py", line 18, in <module>
MRWordFrequencyCount.run()
File "/usr/lib/python2.6/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/usr/lib/python2.6/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/usr/lib/python2.6/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/usr/lib/python2.6/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/usr/lib/python2.6/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/usr/lib/python2.6/site-packages/mrjob/hadoop.py", line 238, in _run
self._upload_local_files_to_hdfs()
File "/usr/lib/python2.6/site-packages/mrjob/hadoop.py", line 265, in _upload_local_files_to_hdfs
self._mkdir_on_hdfs(self._upload_mgr.prefix)
File "/usr/lib/python2.6/site-packages/mrjob/hadoop.py", line 273, in _mkdir_on_hdfs
self.invoke_hadoop(['fs', '-mkdir', path])
File "/usr/lib/python2.6/site-packages/mrjob/fs/hadoop.py", line 109, in invoke_hadoop
raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/usr/bin/hadoop', 'fs', '-mkdir', 'hdfs:///user/cloudera/tmp/mrjob/mapred.cloudera.20140824.195414.420162/files/']' returned non-zero exit status 1
我觉得mrjob无法在HDFS中创建某个目录,但我不知道该如何解决这个问题。
我的Hadoop是cloudera CDH5.1的快速启动版。
感谢您提前提供的任何建议和意见。
编辑:
我尝试在cloudera CDH4.7的快速启动版上运行相同的代码,结果很好。所以我修改后的问题是:mrjob框架是否支持cloudera CDH5.1?如果支持,那我该如何运行它呢?
1 个回答
2
我遇到了同样的错误,我的解决办法是把这行代码:
self.invoke_hadoop(['fs', '-mkdir', path])
改成:
self.invoke_hadoop(['fs', '-mkdir','-p', path])
我修改的文件是:
/usr/lib/python2.6/site-packages/mrjob/hadoop.py
我的MRJOB已经运行了几个月,没有出现任何问题,所以我觉得这样没问题。
我自己也想知道有没有其他的解决方法。