垃圾邮件/火腿分类的减缩程序

import re import sys # initialize trackers current_word = None spam_count, ham_count = 0,0 # read from standard input # Substitute read from a file for line in data.splitlines(): #for line in sys.stdin: # parse input word, is_spam, count = line.split('\t') count = int(count) if word == current_word: if is_spam == '1': spam_count += count else: ham_count += count else: if current_word: # word to emit... if spam_count: print("%s\t%s\t%s" % (current_word, '1', spam_count)) print("%s\t%s\t%s" % (current_word, '0', ham_count)) if is_spam == '1': current_word, spam_count = word, count else: current_word, ham_count = word, count if current_word == word: if is_spam == '1': print(f'{current_word}\t{is_spam}\t{spam_count}') else: print(f'{current_word}\t{is_spam}\t{spam_count}')

1条回答

网友

1楼 · 发布于 2024-05-18 07:33:13

原因是：您应该取消ham_count，而不仅仅是更新spam_count，反之亦然。你知道吗

重写

if is_spam == '1':
    current_word, spam_count = word, count
else:
    current_word, ham_count = word, count

作为

if is_spam == '1':
    current_word, spam_count = word, count
    ham_count = 0
else:
    current_word, ham_count = word, count
    spam_count = 0

然而，输出将与输出不完全相同
1）因为您总是先打印spam_count（但在示例输出中，“cat ham”发射得更早）
2）根据is_spam变量的当前状态，输出块只发出spam或ham，但我猜，您计划发出这些，对吗？你知道吗

The output: 
dog 1   2
dog 0   2
cat 1   1

-有正确的计数“猫垃圾邮件”，但没有“猫火腿”-我想，你至少应该打印这样的东西：

重写此代码

if current_word == word:
    if is_spam == '1':
        print(f'{current_word}\t{is_spam}\t{spam_count}')
    else:
        print(f'{current_word}\t{is_spam}\t{spam_count}')

作为

print(f'{current_word}\t{1}\t{spam_count}')
print(f'{current_word}\t{0}\t{ham_count}')

完整的输出将是

dog 1   2
dog 0   2
cat 1   1
cat 0   2

Itertools
此外，itertools模块对于类似的任务也很有用：

import itertools    

splitted_lines = map(lambda x: x.split('\t'), data.splitlines())
grouped = itertools.groupby(splitted_lines, lambda x: x[0])

grouped是itertools.goupby公司对象，它是generator—所以，请注意，它是惰性的，并且只返回一次值（因此，我在这里显示输出只是作为示例，因为它使用generator值）

[(gr_name, list(gr)) for gr_name, gr in grouped] 
Out:
[('dog',
  [['dog', '1', '1'],
   ['dog', '1', '1'],
   ['dog', '0', '1'],
   ['dog', '0', '1']]),
 ('cat', [['cat', '0', '1'], ['cat', '0', '1'], ['cat', '1', '1']])]

好的，现在每个组可以按照它的is_spam大小重新分组：

import itertools    

def sum_group(group):
    """
    >>> sum_group([('1', [['dog', '1', '1'], ['dog', '1', '1']]), ('0', [['dog', '0', '1'], ['dog', '0', '1']])])
    [('1', 2), ('0', 2)]
    """
    return sum([int(i[-1]) for i in group])

splitted_lines = map(lambda x: x.split('\t'), data.splitlines())
grouped = itertools.groupby(splitted_lines, lambda x: x[0])

[(name, [(tag_name, sum_group(sub_group))
         for tag_name, sub_group 
         in itertools.groupby(group, lambda x: x[1])])
 for name, group in grouped]
Out:
[('dog', [('1', 2), ('0', 2)]), ('cat', [('0', 2), ('1', 1)])]

通过itertools完成示例：

import itertools 


def emit_group(name, tag_name, group):
    tag_sum = sum([int(i[-1]) for i in group])
    print(f"{name}\t{tag_name}\t{tag_sum}")  # emit here
    return (name, tag_name, tag_sum)  # return the same data


splitted_lines = map(lambda x: x.split('\t'), data.splitlines())
grouped = itertools.groupby(splitted_lines, lambda x: x[0])


emitted = [[emit_group(name, tag_name, sub_group) 
            for tag_name, sub_group 
            in itertools.groupby(group, lambda x: x[1])]
            for name, group in  grouped]
Out:
dog 1   2
dog 0   2
cat 0   2
cat 1   1

-emitted包含具有相同数据的元组列表。由于它是lazy方法，它可以完美地与流一起工作；here是不错的iterools教程，如果您感兴趣的话。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章