在一个文件文本中查找同一行中项目的组合及其频率

2024-04-25 03:52:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个文件文本,两个术语列表。你知道吗

file = "the workers have human rights, the women have rights, the people have to work."

list1 = ['workers, rights']
list2 = ['have', 'the']

所需的是查找list1中的一个项和list2中的一个项是否在文件的同一行中,并在文件文本级别计算它们的频率。我尝试了下面的代码,但没有给出正确的频率。你知道吗

freq = 0
result = []
for line in file.splitlines():
    for i in list1:
            for x in list2:
                    if i in line and x in line:
                            freq +=1
                            result.append((i,x, freq))

Tags: 文件thein文本forhavelineresult
1条回答
网友
1楼 · 发布于 2024-04-25 03:52:30

请执行以下操作:

import itertools

frequencies = {}
for line in open_file: # You don't need .splitlines() to iterate, and you shouldn't use file as a name
    line = line.strip().split()
    list1_used = (x for x in list1 if x in line)
    list2_used = (x for x in list2 if x in line)
    for combination in itertools.product(list1_used, list2_used):
        frequencies[combination] = frequencies.get(combination, 0) + 1

这将为每一对创建一个频率字典。例如,如果您给出的行是file对象中唯一的一行,那么您可能会得到类似{('rights', 'have'): 1, ('workers', 'have'): 1, ('rights', 'the'): 1, ('workers', 'the'): 1}的结果。如果要考虑给定单词出现的次数,则list1_usedlist2_used要复杂一些:

list1_used = sum((((x,) * line.count(x)) for x in list1), ())
list2_used = sum((((y,) * line.count(y)) for y in list2), ())

在这里使用defaultdict可能更容易:

from collections import defaultdict
import itertools

frequencies = defaultdict(int)
for line in open_file:
    line = line.strip().split()
    list1_used = ...
    list2_used = ...
    for combination in itertools.product(list1_used, list2_used):
        frequencies[combination] += 1

相关问题 更多 >