如何计算对话中每个人的词数总和?

2 投票
3 回答
2310 浏览
提问于 2025-04-17 02:21


我刚开始学习Python,想写一个程序来导入一个文本文件,统计总的单词数,计算每个参与者(用'P1'、'P2'等表示)在特定段落中的单词数,并且在统计时排除这些参与者的标识(比如'P1'等),最后将段落单独打印出来。

多亏了@James Hurford,我得到了以下代码:

words = None
with open('data.txt') as f:
   words = f.read().split()
total_words = len(words)
print 'Total words:', total_words

in_para = False
para_type = None
paragraph = list()
for word in words:
  if ('P1' in word or
      'P2' in word or
      'P3' in word ):
      if in_para == False:
         in_para = True
         para_type = word
      else:
         print 'Words in paragraph', para_type, ':', len(paragraph)
         print ' '.join(paragraph)
         del paragraph[:]
         para_type = word
  else:
    paragraph.append(word)
else:
  if in_para == True:
    print 'Words in last paragraph', para_type, ':', len(paragraph)
    print ' '.join(paragraph)
  else:
    print 'No words'

我的文本文件内容是这样的:

P1: 嗯嗯嗯。

P2: 嗯嗯嗯嗯。

P1: 嗯嗯。

P3: 嗯。

接下来我需要做的是为每个参与者统计单词数。我现在只能打印出来,但不知道怎么返回或重复使用这些数据。

我需要一个新的变量来记录每个参与者的单词数,这样我就可以在后面操作它们,除了统计每个参与者说的所有单词,比如:

P1all = sum of words in paragraph

有没有办法把"you're"或"it's"等算作两个单词?

有什么想法可以解决这个问题吗?

3 个回答

1

你可以用两个变量来实现这个功能。一个变量用来记录是谁在说话,另一个变量用来存储说话者的段落内容。为了存储这些段落,并且把段落和说话者关联起来,可以使用一个字典,字典的键是说话者的名字,值是一个包含这个人说过的段落的列表。

para_dict = dict()
para_type = None

for word in words:
    if ('P1' in word or
        'P2' in word or
        'P3' in word ):
        #extract the part we want leaving off the ':'
        para_type = word[:2]
        #create a dict with a list of lists 
        #to contain each paragraph the person uses
        if para_type not in para_dict:
            para_dict[para_type] = list()
        para_dict[para_type].append(list())
    else:
        #Append the word to the last list in the list of lists
        para_dict[para_type][-1].append(word)

接下来,你可以通过这种方式来计算说话的单词总数。

for person, para_list in para_dict.items():
    counts_list = list()
    for para in para_list:
        counts_list.append(len(para))
    print person, 'spoke', sum(counts_list), 'words'
5

我需要一个新的变量,用来记录每个参与者的单词数,以便我以后可以操作。

不,你需要一个叫做 Counter 的东西(如果你用的是Python 2.7及以上版本),否则可以用 defaultdict(int)。这个工具可以把每个人和他们的单词数对应起来。

from collections import Counter
#from collections import defaultdict

words_per_person = Counter()
#words_per_person = defaultdict(int)

for ln in inputfile:
    person, text = ln.split(':', 1)
    words_per_person[person] += len(text.split())

现在 words_per_person['P1'] 就包含了参与者 P1 的单词数量,前提是 text.split() 能够很好地把文本分成单词。(语言学家对“单词”的定义有不同看法,所以你得到的结果总是一个近似值。)

1

恭喜你开始了Python的冒险之旅!虽然这篇文章里的内容现在可能不太明白,但可以先收藏起来,以后觉得有用再回来看看。最终,你应该尝试从简单的脚本编写转向软件工程,这里有一些建议给你!

有了强大的能力,就要承担相应的责任。作为一个Python开发者,你需要比其他语言的开发者更自律,因为Python会更严格地引导你去写出“好的”设计。

我觉得从整体设计开始会比较有帮助。

def main():
    text = get_text()
    p_text = process_text(text)
    catalogue = process_catalogue(p_text)

哇!你刚刚写完了整个程序——现在只需要回去填补空白就行了!这样做的话,感觉就没那么吓人了。老实说,我觉得自己不够聪明去解决特别大的问题,但我擅长解决小问题。所以我们一个一个来。我先从'process_text'开始。

def process_text(text):
    b_text = bundle_dialogue_items(text)   
    f_text = filter_dialogue_items(b_text)
    c_text = clean_dialogue_items(f_text)

我现在还不太明白那些东西是什么意思,但我知道文本处理问题通常遵循一种叫“映射/归约”的模式,也就是对某些东西进行操作,然后清理和合并,所以我放了一些占位函数。如果需要的话,我可能会再回去添加更多。

现在我们来写'process_catalogue'。我本可以写成“process_dict”,但我觉得那听起来没意思。

def process_catalogue(p_text): 
    speakers = make_catalogue(c_text)
    s_speakers = sum_words_per_paragraph_items(speakers)
    t_speakers = total_word_count(s_speakers)

不错,不算太难。你可能会有不同的处理方式,但我觉得先把项目聚合起来,统计每段的单词数,然后再统计所有的单词数,这样比较合理。

所以到这个时候,我可能会做一两个小的'lib'(库)模块来填补剩下的函数。为了方便你能运行这个程序而不用担心导入问题,我会把所有内容放在一个.py文件里,但最终你会学会如何把它们拆分得更好看。那我们就开始吧。

# ------------------ #
# == process_text == #
# ------------------ #

def bundle_dialogue_items(lines):
    cur_speaker = None
    paragraphs = Counter()
    for line in lines:
        if re.match(p, line):
            cur_speaker, dialogue = line.split(':')
            paragraphs[cur_speaker] += 1
        else:
            dialogue = line

        res = cur_speaker, dialogue, paragraphs[cur_speaker]
        yield res


def filter_dialogue_items(lines):
    for name, dialogue, paragraph in lines:
        if dialogue:
            res = name, dialogue, paragraph
            yield res

def clean_dialogue_items(flines):
    for name, dialogue, paragraph in flines:
        s_dialogue = dialogue.strip().split()
        c_dialouge = [clean_word(w) for w in s_dialogue]
        res = name, c_dialouge, paragraph
        yield res

还有一个小助手函数

# ------------------- #
# == aux functions == #
# ------------------- #

to_clean = string.whitespace + string.punctuation
def clean_word(word):
    res = ''.join(c for c in word if c not in to_clean)
    return res

虽然这可能不太明显,但这个库是作为一个数据处理管道设计的。处理数据有几种方式,一种是管道处理,另一种是批处理。我们来看看批处理。

# ----------------------- #
# == process_catalogue == #
# ----------------------- #

speaker_stats = 'stats'
def make_catalogue(names_with_dialogue):
    speakers = {}
    for name, dialogue, paragraph in names_with_dialogue:
        speaker = speakers.setdefault(name, {})
        stats = speaker.setdefault(speaker_stats, {})
        stats.setdefault(paragraph, []).extend(dialogue)
    return speakers



word_count = 'word_count'
def sum_words_per_paragraph_items(speakers):
    for speaker in speakers:
        word_stats = speakers[speaker][speaker_stats]
        speakers[speaker][word_count] = Counter()
        for paragraph in word_stats:
            speakers[speaker][word_count][paragraph] += len(word_stats[paragraph])
    return speakers


total = 'total'
def total_word_count(speakers):
    for speaker in speakers:
        wc = speakers[speaker][word_count]
        speakers[speaker][total] = 0
        for c in wc:
            speakers[speaker][total] += wc[c]
    return speakers

这些嵌套的字典有点复杂。在实际的生产代码中,我会用一些更易读的类来替代这些(同时添加测试和文档注释!!),但我不想让这个变得比现在更复杂!好吧,为了方便你,下面是整个代码的整合。

import pprint
import re
import string
from collections import Counter

p = re.compile(r'(\w+?):')


def get_text_line_items(text):
    for line in text.split('\n'):
        yield line


def bundle_dialogue_items(lines):
    cur_speaker = None
    paragraphs = Counter()
    for line in lines:
        if re.match(p, line):
            cur_speaker, dialogue = line.split(':')
            paragraphs[cur_speaker] += 1
        else:
            dialogue = line

        res = cur_speaker, dialogue, paragraphs[cur_speaker]
        yield res


def filter_dialogue_items(lines):
    for name, dialogue, paragraph in lines:
        if dialogue:
            res = name, dialogue, paragraph
            yield res


to_clean = string.whitespace + string.punctuation


def clean_word(word):
    res = ''.join(c for c in word if c not in to_clean)
    return res


def clean_dialogue_items(flines):
    for name, dialogue, paragraph in flines:
        s_dialogue = dialogue.strip().split()
        c_dialouge = [clean_word(w) for w in s_dialogue]
        res = name, c_dialouge, paragraph
        yield res


speaker_stats = 'stats'


def make_catalogue(names_with_dialogue):
    speakers = {}
    for name, dialogue, paragraph in names_with_dialogue:
        speaker = speakers.setdefault(name, {})
        stats = speaker.setdefault(speaker_stats, {})
        stats.setdefault(paragraph, []).extend(dialogue)
    return speakers


def clean_dict(speakers):
    for speaker in speakers:
        stats = speakers[speaker][speaker_stats]
        for paragraph in stats:
            stats[paragraph] = [''.join(c for c in word if c not in to_clean)
                                for word in stats[paragraph]]
    return speakers


word_count = 'word_count'


def sum_words_per_paragraph_items(speakers):
    for speaker in speakers:
        word_stats = speakers[speaker][speaker_stats]
        speakers[speaker][word_count] = Counter()
        for paragraph in word_stats:
            speakers[speaker][word_count][paragraph] += len(word_stats[paragraph])
    return speakers


total = 'total'


def total_word_count(speakers):
    for speaker in speakers:
        wc = speakers[speaker][word_count]
        speakers[speaker][total] = 0
        for c in wc:
            speakers[speaker][total] += wc[c]
    return speakers


def get_text():
    text = '''BOB: blah blah blah blah
blah hello goodbye etc.

JERRY:.............................................
...............

BOB:blah blah blah
blah blah blah
blah.
BOB: boopy doopy doop
P1: Bla bla bla.
P2: Bla bla bla bla.
P1: Bla bla.
P3: Bla.'''
    text = get_text_line_items(text)
    return text


def process_catalogue(c_text):
    speakers = make_catalogue(c_text)
    s_speakers = sum_words_per_paragraph_items(speakers)
    t_speakers = total_word_count(s_speakers)
    return t_speakers


def process_text(text):
    b_text = bundle_dialogue_items(text)
    f_text = filter_dialogue_items(b_text)
    c_text = clean_dialogue_items(f_text)
    return c_text


def main():

    text = get_text()
    c_text = process_text(text)
    t_speakers = process_catalogue(c_text)

    # take a look at your hard work!
    pprint.pprint(t_speakers)


if __name__ == '__main__':
    main()

所以这个脚本对于这个应用来说几乎是过于复杂了,但重点是看看(可能)可读、可维护、模块化的Python代码会是什么样子。

我很确定输出大概是这样的:

{'BOB': {'stats': {1: ['blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah',
                       'hello',
                       'goodbye',
                       'etc'],
                   2: ['blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah',
                       'blah'],
                   3: ['boopy', 'doopy', 'doop']},
         'total': 18,
         'word_count': Counter({1: 8, 2: 7, 3: 3})},
 'JERRY': {'stats': {1: ['', '']}, 'total': 2, 'word_count': Counter({1: 2})},
 'P1': {'stats': {1: ['Bla', 'bla', 'bla'], 2: ['Bla', 'bla']},
        'total': 5,
        'word_count': Counter({1: 3, 2: 2})},
 'P2': {'stats': {1: ['Bla', 'bla', 'bla', 'bla']},
        'total': 4,
        'word_count': Counter({1: 4})},
 'P3': {'stats': {1: ['Bla']}, 'total': 1, 'word_count': Counter({1: 1})}}

撰写回答