如何在单个hadoop流作业中处理多个统计信息

2024-04-20 04:07:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我必须从一些输入数据中得到不同的统计数据。这是我想的地图

label_1 = "tot_lan"
label_2 = "max_lan"
label_3 = "ur_bin"
label_4 = "ur_record

for line in sys.stdin:
    line = line.strip()
    line2 = line.split('|')
    lang = line2[-1]
    total_docs_counter +=1

    if lang=='-':
        no_lang_info +=1
        continue

    doc_url = line2[0]
    lang = lang .split('&')
    max_lang = ""
    max_percent = 0

    for num, ln in enumerate(lang):
        if num==0:
            continue
        tmp = ln.split('-')
        lang_id = tmp[0]
        # update_lang_statistics(total_lang_record, lang_id)
        print "%s\t%s\t1" %(label_1, lang_id)
    print "%s\t%s\t1" % (label_2, max_lang)

我为每个输出都提供了一个标签,以便在reducer中检测它属于哪个类别。然后在reducer中通过简单的条件表达式,得到所需的输出。很明显,内环的输出要比外环的输出多。我在本地做了个测试

cat input | ./mapper.py | sort | ./reducer.py

它工作正常,但当我在hadoop中为一些数据运行此作业时,似乎没有发生任何reducer操作,即输出文件包含映射器输出。映射器输出未正确还原。我的地图是非常简单的只是每个类别分别求和。问题出在哪里。有没有其他更好的方法来做这项工作

这是减速机代码

def counting_reducer(line, temp, count, label):
    label,name, freq = line.split("\t")
    freq = int(freq)
    if name == temp:
        count += freq
    else:
        if temp:
            print '%s\t%s\t%s' % (label, temp, count)
        count = freq
        temp = name

    return [temp, count]

label_1 = "tot_lan"
label_2 = "max_lan"
label_3 = "ur_bin"
label_4 = "ur_record"

#TMP variables
temp_l1 = None
count_l1 = 0
temp_l2 = None
count_l2 = 0
temp_l3 = None
count_l3 = 0
temp_l4 = None
count_l4 = 0

urdu_bin_record = {}

skip = 0
for line in sys.stdin:
    line = line.strip()
    if line.startswith(label_1):
        temp_l1, count_l1 = counting_reducer(line, temp_l1, count_l1, label_1)
    elif line.startswith(label_2):
        temp_l2, count_l2 = counting_reducer(line, temp_l2, count_l2, label_2)


    else:
        print line

Tags: l1langifcountlinerecordtemplabel