没有重复单词的双字格

2条回答

网友

1楼 · 编辑于 2024-06-16 11:34:55

您可以在传递到函数nltk.collocations.BigramCollocationFinder.from_words之前删除重复的单词

words = 'this this is is a a test test'.split()
removed_duplicates = [first for first, second in zip(words, ['']+words) if first != second]

output:

['this', 'is', 'a', 'test']

然后做：

b = nltk.collocations.BigramCollocationFinder.from_words(removed_duplicates)
b.ngram_fd.keys()

网友

2楼 · 编辑于 2024-06-16 11:34:55

尝试：

result_cleared = [x for x in b.ngram_fd.keys() if x[0] != x[1]]

编辑：如果文本存储在数据框中，则可以执行以下操作：

# the dummy data from your comment
df=pd.DataFrame({'Text': ['this is a stupid text with no no no sense','this song says na na na','this is very very very very annoying']})

def create_bigrams(text):
    b = nltk.collocations.BigramCollocationFinder.from_words(text.split())
    return [x for x in b.ngram_fd.keys() if x[0] != x[1]]

df["bigrams"] = df["Text"].apply(create_bigrams)
df["bigrams"].apply(print)

这首先将包含bigram的列添加到数据帧，然后打印列值。如果只希望输出而不操纵df，请将最后两行替换为：

df["Text"].apply(create_bigrams).apply(print)

相关问题更多 >

编程相关推荐

热门问题

热门文章

没有重复单词的双字格

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >