如何以单个字母串列表作为inpu生成概率最高的二元图的结果

2024-06-07 16:11:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在学习自然语言处理的二元主题。在这个阶段,我在Python计算方面遇到了困难,但是我尝试了。在

我将使用这个没有经过标记化的语料库作为我的主要原始数据集。我可以使用nltk模块生成bigram结果。然而,我的问题是如何在Python中计算以生成包含两个以上特定单词的bigram。更具体地说,我希望找到语料库中所有的二元组,它们都包含来自于感兴趣的单词_的单词。在

corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]

word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']

从每个单词的单词列表中得到每个单词的兴趣。接下来,我要根据每个二元图在语料库中的出现情况来获得它们的出现频率,然后根据它们的概率从高到低排序并打印出来。在

我试过从网上搜索的代码,但它没有给我一个输出。代码如下:

for i in corpus:
    bigrams_i = BigramCollocationFinder.from_words(corpus, window_size=5)
    bigram_j = lambda i[x] not in i
    x += 1
print(bigram_j)

不幸的是,输出没有返回我计划实现的目标。在

请告诉我。我想要的输出将有二元图,其中包含感兴趣单词“_”的特定单词及其概率排序,如下所示。在

^{pr2}$

Tags: andtoforisnotcorpussanta单词
1条回答
网友
1楼 · 发布于 2024-06-07 16:11:11

您可以尝试以下代码:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(2,2),use_idf=False)

corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
matrix = vec.fit_transform(corpus).toarray()
vocabulary = vec.get_feature_names()

all_bigrams = []
all_frequencies = []
for word in word_of_interest:
    for bigram in vocabulary:
        if word in bigram:
            index = vocabulary.index(bigram)
            tuple_bigram = tuple(bigram.split(' '))
            frequency = matrix[:,index].sum()
            all_bigrams.append(tuple_bigram)
            all_frequencies.append(frequency)

df = pd.DataFrame({'bigram':all_bigrams,'frequency':all_frequencies})
df.sort_values('frequency',inplace=True)
df.head()

输出是一个pandas数据帧,显示按频率排序的bigram。在

^{pr2}$

这里的基本原理是TfidfVectorizer计算一个标记在语料库的每个文档中出现的次数,然后计算特定于术语的频率,然后将该信息存储在与该标记相关联的列中。该列的索引与使用方法检索的词汇表中关联词的索引相同。矢量器上的get_feature_names()已适合。 然后,您只需从包含令牌相对频率的矩阵中选择所有行,并沿着感兴趣的列求和。在

不过,双嵌套for循环并不理想,可能有更有效的实现。问题是get_feature_names返回的不是元组,而是格式为['first_token second_token',]的字符串列表。 我希望看到上面代码的后半部分有更好的实现。在

相关问题 更多 >

    热门问题