为什么我的Hadoop输出是多个文件部分？

0 投票

3 回答

1453 浏览

提问于 2025-04-18 11:52

我尝试统计单词的出现频率，并写入文件：
mapper.py：

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)

我使用了Hadoop的命令： hadoop streaming \ -input "/app/hadoop_learn_test/book.txt" \ -mapper "python mapper.py" \ -reducer "cat" \ -output "/app/hadoop_learn_test/book_out" \ -file "mapper.py" \ 这里的book.txt是：

foo foo quux labs foo bar quux

但是我得到了400个名为part-00000.gz的文件，当我用hadoop dfs -cat path来查看内容时，却什么都没有。

为什么我得不到结果呢？

我在本地终端使用了cat book.txt | python mapper.py | sort，得到了以下结果：

bar     1
foo     1
foo     1
foo     1
labs    1
quux    1
quux    1

数据处理文件输出 hadoop 分布式计算词频统计

3 个回答

试着把mapred.reduce.tasks这个设置改成1。

你可以在使用hadoop命令的时候加上-D mapred.reduce.tasks=1来实现。

补充一下：在一个Map Reduce的工作中，每个reducer都会生成一个输出文件。所以如果你得到了400个文件，基本上就是有400个reducer在工作。

回答于 2025-04-18 由 Python大师

分享举报

我觉得你需要使用计数器。

#!/usr/bin/env python

import sys
from collection import Counter

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    wordcount=Counter(words)

    for word,count in wordcount.items():

        print '%s\t%s' % (word, count)

回答于 2025-04-18 由 Python大师

分享举报

你有很多输出文件的原因是因为Hadoop是一个可以把任务分配到很多电脑（还有很多处理器）上的框架。如果你只想要最后得到一个文件，那你只能用一个进程（或者线程），这样就失去了使用Hadoop的意义。

如果想把所有输出合并在一起，可以简单地使用通配符：

hadoop dfs -text path/p*

注意，如果你在读取gz文件，应该使用-text选项。

另外，使用hadoop dfs这个命令已经不推荐了，你应该这样做：

hdfs dfs -text path/p*

补充说明：如果只用一个reduce任务，会影响到容错能力——也就是说，如果那个节点出问题了，reduce阶段就得从头开始。虽然可以开启猜测执行，但这样通常会比使用多个reduce任务更浪费资源。

回答于 2025-04-18 由 Python大师

分享举报

为什么我的Hadoop输出是多个文件部分？

3 个回答

撰写回答