没有重复单词的双字格

2024-06-16 11:34:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我想通过计算大字来分析一篇文章。不幸的是,我的文本中有很多重复的单词(比如:hello-hello),我不想被算作bigrams

我的代码如下:

b = nltk.collocations.BigramCollocationFinder.from_words('this this is is a a test test'.split())
b.ngram_fd.keys()

这将返回:

>> dict_keys([('this', 'this'), ('this', 'is'), ('is', 'is'), ('is', 'a'), ('a', 'a'), ('a', 'test'), ('test', 'test')])

但我希望输出为:

>> [('a', 'test'), ('is', 'a'), ('this', 'is')]

你有什么建议,也可以使用不同的图书馆吗? 提前感谢您的帮助。 弗朗西丝卡


Tags: 代码fromtest文本hellois文章keys
2条回答

您可以在传递到函数nltk.collocations.BigramCollocationFinder.from_words之前删除重复的单词

words = 'this this is is a a test test'.split()
removed_duplicates = [first for first, second in zip(words, ['']+words) if first != second]

output:

['this', 'is', 'a', 'test']

然后做:

b = nltk.collocations.BigramCollocationFinder.from_words(removed_duplicates)
b.ngram_fd.keys()

尝试:

result_cleared = [x for x in b.ngram_fd.keys() if x[0] != x[1]]

编辑:如果文本存储在数据框中,则可以执行以下操作:

# the dummy data from your comment
df=pd.DataFrame({'Text': ['this is a stupid text with no no no sense','this song says na na na','this is very very very very annoying']})

def create_bigrams(text):
    b = nltk.collocations.BigramCollocationFinder.from_words(text.split())
    return [x for x in b.ngram_fd.keys() if x[0] != x[1]]

df["bigrams"] = df["Text"].apply(create_bigrams)
df["bigrams"].apply(print)

这首先将包含bigram的列添加到数据帧,然后打印列值。如果只希望输出而不操纵df,请将最后两行替换为:

df["Text"].apply(create_bigrams).apply(print)

相关问题 更多 >