我试图用编程的方式运行简单的wordcount示例,但是我不能让代码在hadoop集群上运行。在
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield word.lower(), 1
def combiner(self, word, counts):
yield word, sum(counts)
def reducer(self, word, counts):
yield word, sum(counts)
我可以在本地运行此代码(使用inline
选项),但是在hadoop上我得到了:
> Traceback (most recent call last): File "mr_job_tester.py", line 17,
> in <module>
> print test_runner(args, input_dir) File "mr_job_tester.py", line 8, in test_runner
> runner.run() File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 458, in
> run
> self._run() File "/usr/local/lib/python2.7/dist-packages/mrjob/hadoop.py", line 239, in
> _run
> self._run_job_in_hadoop() File "/usr/local/lib/python2.7/dist-packages/mrjob/hadoop.py", line 295, in
> _run_job_in_hadoop
> for step_num in xrange(self._num_steps()): File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 742, in
> _num_steps
> return len(self._get_steps()) File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 721, in
> _get_steps
> raise ValueError("Bad --steps response: \n%s" % stdout) ValueError: Bad --steps response:
(According to this)mrjob提交作业文件并在mapper和reducer中远程执行的方式使得以下行必须位于作业声明文件中:
相关问题 更多 >
编程相关推荐