给定一个字符串列表和一个列表,如何返回字数?

2024-06-02 07:57:48 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有一长串带有标点符号、空格等的列表,比如:

list_1 = [[the guy was plaguy but unable to play football, but he was able to play tennis],[That was absolute cool],...,[This is an implicit living.]]

我还有一个长长的清单:

list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']

如何为list_1的每个子列表提取list_2中出现的所有单词的计数或频率?。例如,给出上述列表:

list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']

[the guy was unable to play football, but he was able to play tennis]

由于unable出现在list_2的前一个子列表中,因此此列表的计数为1。你知道吗

list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']

[That was absolute cool]

因为没有list_2的单词出现在上一个子列表中,所以计数是0。你知道吗

list_2 =['unable', 'unquestioning', 'implicit',...,'living', 'relative', 'comparative']

[This is an implicit living.]

由于implicit和living出现在list_2的前一个子列表中,因此此列表的计数为2。你知道吗

所需的输出是[1,0,2]。你知道吗

你知道如何处理这个任务以返回计数列表吗?。提前谢谢各位。你知道吗

例如:

>>> [sum(1 for word in list_2 if word in sentence) for sublist in list_1 for sentence in sublist]

是错误的,因为它混淆了两个词guyplayguy。你知道怎么解决这个问题吗?你知道吗


Tags: toin列表playlistbut计数was
3条回答

我宁愿用正则表达式。首先,因为需要匹配整个单词,这与其他字符串搜索方法比较复杂。而且,即使它看起来像火箭筒,它通常是非常有效的。你知道吗

首先从list_2生成正则表达式,然后使用它搜索list_1的句子。正则表达式是这样构造的:"(\bword1\b|\bword2\b|...)",意思是“整字1或整字2或…”\b意思是在单词的开头或结尾匹配。你知道吗

我假设您想要list_1的每个子列表的结果,而不是每个子列表的每个句子的结果。你知道吗

_regex = re.compile(r"(\b{}\b)".format(r"\b|\b".join(list_2)))
word_counts = [ 
    sum(
        sum(1 for occurence in _regex.findall(sentence))
        for sentence in sublist
    ) for sublist in list_1
]

Here you can find a whole sample code通过与普通字符串搜索的性能比较,知道匹配整个单词需要更多的工作,因此效率更低。你知道吗

诀窍是使用split()方法和列表理解。如果仅使用空格分隔:

list_1 = ["the guy was unable to play football but he was able to play tennis", "That was absolute cool", "This is implicit living"]

list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']

print([sum(sum(1 for j in list_2 if j in i.split()) for i in k for k) inlist_1])

但是,如果要使用所有非字母数字进行标记化,则应使用re

import re

list_1 = ["the guy was unable to play football,but he was able to play tennis", "That was absolute cool", "This is implicit living"]
list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']

print(sum([sum(1 for j in list_2 if re.split("\W",i)) for i in k) for k in list_1])

\W字符集都是非字母数字的。你知道吗

使用带有列表理解的内置函数sum

>>> list_1 = [['the guy was unable to play football, but he was able to play tennis'],['That was absolute cool'],['This is implicit living.']]
>>> list_2 =['unable', 'unquestioning', 'implicit','living', 'relative', 'comparative']   
>>> [sum(1 for word in list_2 if word in sentence) for sublist in list_1 for sentence in sublist]

[1, 0, 2]

相关问题 更多 >