Hadoop流任务在Python中失败错误

22 投票

7 回答

37831 浏览

提问于 2025-04-16 08:37

根据这个指南，我成功运行了示例练习。但是在运行我的mapreduce任务时，出现了以下错误：
ERROR streaming.StreamJob: 任务未成功！ 10/12/16 17:13:38 信息 streaming.StreamJob: 正在终止任务... 流式任务失败！
错误来自日志文件

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Mapper.py

import sys

i=0

for line in sys.stdin:
    i+=1
    count={}
    for word in line.strip().split():
        count[word]=count.get(word,0)+1
    for word,weight in count.items():
        print '%s\t%s:%s' % (word,str(i),str(weight))

Reducer.py

import sys

keymap={}
o_tweet="2323"
id_list=[]
for line in sys.stdin:
    tweet,tw=line.strip().split()
    #print tweet,o_tweet,tweet_id,id_list
    tweet_id,w=tw.split(':')
    w=int(w)
    if tweet.__eq__(o_tweet):
        for i,wt in id_list:
            print '%s:%s\t%s' % (tweet_id,i,str(w+wt))
        id_list.append((tweet_id,w))
    else:
        id_list=[(tweet_id,w)]
        o_tweet=tweet

[编辑] 运行任务的命令：

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input my-input/* -output my-output

输入是任何随机的句子序列。

谢谢，

数据处理 hadoop mapreduce 分布式计算错误日志 mapper reducer 流任务

7 个回答

你需要明确告诉系统，mapper和reducer是用Python脚本来写的，因为我们有好几种流式处理的选择。你可以使用单引号或者双引号。

-mapper "python mapper.py" -reducer "python reducer.py"

或者

-mapper 'python mapper.py' -reducer 'python reducer.py'

完整的命令是这样的：

hadoop jar /path/to/hadoop-mapreduce/hadoop-streaming.jar \
-input /path/to/input \
-output /path/to/output \
-mapper 'python mapper.py' \
-reducer 'python reducer.py' \
-file /path/to/mapper-script/mapper.py \
-file /path/to/reducer-script/reducer.py

回答于 2025-04-16 由 Python大师

分享举报

试着在你的脚本最上面加上

 #!/usr/bin/env python

这段代码。

或者，

-mapper 'python m.py' -reducer 'r.py'

回答于 2025-04-16 由 Python大师

分享举报

你的 -mapper 和 -reducer 只需要写脚本的名字就可以了。

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input my-input/* -output my-output

当你的脚本在 HDFS 中的一个文件夹里，而这个文件夹是相对于当前执行任务的文件夹（也就是用“.”表示的）时，记得如果你想添加另一个 -file，比如查找表，你可以在 Python 中像访问同一目录下的文件一样打开它，尽管你的脚本是在 M/R 任务中。

另外，确保你已经给 mapper.py 和 reducer.py 设置了执行权限，命令是 chmod a+x mapper.py 和 chmod a+x reducer.py。

回答于 2025-04-16 由 Python大师

分享举报

Hadoop流任务在Python中失败错误

7 个回答

撰写回答