转换为数据帧

2024-05-08 19:33:52 发布

您现在位置:Python中文网/ 问答频道 /正文

当我运行下面的程序时,我可以打印单词出现的频率,如何将其保存为数据帧。如何在dataframe中保存标记字及其计数

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             min_df = 0,          \
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

输出

2 am
1 going
2 hello
1 to
1 with

Tags: thetotextimportnonehellowithnp
1条回答
网友
1楼 · 发布于 2024-05-08 19:33:52

只需结合vocab和dict,并使用pandas将它们转换成数据帧。你知道吗

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

vectorizer = CountVectorizer(analyzer = "word",   \
                         tokenizer = None,    \
                         preprocessor = None, \
                         stop_words = None,   \
                         min_df = 0,          \
                         max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

l=list(zip(vocab,dist))
df=pd.DataFrame(l, columns=['count','tag'])

相关问题 更多 >