如何以单个字母串列表作为inpu生成概率最高的二元图的结果

1条回答

网友

1楼 · 发布于 2024-06-07 16:11:11

您可以尝试以下代码：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(2,2),use_idf=False)

corpus = ["he is not giving up so easily but he feels lonely all the time his mental is strong and he always meet new friends to get motivation and inspiration to success he stands firm for academic integrity when he was young he hope that santa would give him more friends after he is a grown up man he stops wishing for santa clauss to arrival he and his friend always eat out but they clean their hand to remove sand first before eating"]
word_of_interest = ['santa', 'and', 'hand', 'stands', 'handy', 'sand']
matrix = vec.fit_transform(corpus).toarray()
vocabulary = vec.get_feature_names()

all_bigrams = []
all_frequencies = []
for word in word_of_interest:
    for bigram in vocabulary:
        if word in bigram:
            index = vocabulary.index(bigram)
            tuple_bigram = tuple(bigram.split(' '))
            frequency = matrix[:,index].sum()
            all_bigrams.append(tuple_bigram)
            all_frequencies.append(frequency)

df = pd.DataFrame({'bigram':all_bigrams,'frequency':all_frequencies})
df.sort_values('frequency',inplace=True)
df.head()

输出是一个pandas数据帧，显示按频率排序的bigram。在

^{pr2}$

这里的基本原理是TfidfVectorizer计算一个标记在语料库的每个文档中出现的次数，然后计算特定于术语的频率，然后将该信息存储在与该标记相关联的列中。该列的索引与使用方法检索的词汇表中关联词的索引相同。矢量器上的get_feature_names（）已适合。然后，您只需从包含令牌相对频率的矩阵中选择所有行，并沿着感兴趣的列求和。在

不过，双嵌套for循环并不理想，可能有更有效的实现。问题是get_feature_names返回的不是元组，而是格式为['first_token second_token'，]的字符串列表。我希望看到上面代码的后半部分有更好的实现。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何以单个字母串列表作为inpu生成概率最高的二元图的结果

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >