NLTK子短语计数频率

2024-05-16 09:52:44 发布

您现在位置:Python中文网/ 问答频道 /正文

对于这句话:“我看见外面有棵大树。一个人在大树下”

如何计算tall tree的频率?我可以在搭配中使用一个二元组,比如

bgs= nltk.bigrams(tokens)
fdist1= nltk.FreqDist(bgs)
pairs = fdist1.most_common(500)

但我只需要计算一个特定的短语。在


Tags: treemostcommon频率nltktokenspairsbgs
2条回答

@uday1889的回答有一些缺陷:

>>> string = "I see a tall tree outside. A man is under the tall tree"
>>> string.count("tall tree")
2
>>> string = "The see a stall tree outside. A man is under the tall trees"
>>> string.count("tall tree")
2
>>> string = "I would like to install treehouses at my yard"
>>> string.count("tall tree")
1

一种廉价的黑客方法是在str.count()的空间中填充:

^{pr2}$

但正如你所看到的,当子串在句子的开头或结尾或标点旁边时,会出现一些问题。在

>>> from nltk.util import ngrams
>>> from nltk import word_tokenize
>>> string = "I see a tall tree outside. A man is under the tall tree"
>>> len([i for i in ngrams(word_tokenize(string),n=2) if i==('tall', 'tree')])
2
>>> string = "I would like to install treehouses at my yard"
>>> len([i for i in ngrams(word_tokenize(string),n=2) if i==('tall', 'tree')])
0

count()方法应该这样做:

string = "I see a tall tree outside. A man is under the tall tree"
string.count("tall tree")

相关问题 更多 >