如何在Pandas中获得一个特定单词的热编码？

toxic = ['bad','horrible','disguisting'] df = pd.DataFrame({'text':['You look horrible','You are good','you are bad and disguisting']}) main = pd.concat([df,pd.DataFrame(columns=toxic)]).fillna(0) samp = main['text'].str.split().apply(lambda x : [i for i in toxic if i in x]) for i,j in enumerate(samp): for k in j: main.loc[i,k] = 1

pd.concat([df,main['text'].str.get_dummies(' ')[toxic]],1) text bad horrible disguisting 0 You look horrible 0 1 0 1 You are good 0 0 0 2 you are bad and disguisting 1 0 1

1条回答

网友

1楼 · 发布于 2024-05-15 05:55:20

使用sklearn.feature_extraction.text.CountVectorizer：

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(vocabulary=toxic)

r = pd.SparseDataFrame(cv.fit_transform(df['text']), 
                       df.index,
                       cv.get_feature_names(), 
                       default_fill_value=0)

结果：

^{pr2}$

将SparseStataFrame与原始数据帧连接：

In [137]: r2 = df.join(r)

In [138]: r2
Out[138]:
                          text  bad  horrible  disguisting
0            You look horrible    0         1            0
1                 You are good    0         0            0
2  you are bad and disguisting    1         0            1

In [139]: r2.memory_usage()
Out[139]:
Index          80
text           24
bad             8
horrible        8
disguisting     8
dtype: int64

In [140]: type(r2)
Out[140]: pandas.core.frame.DataFrame

In [141]: type(r2['horrible'])
Out[141]: pandas.core.sparse.series.SparseSeries

In [142]: type(r2['text'])
Out[142]: pandas.core.series.Series

旧版Pandas Sparsed columns中的PS在将SparsedDataFrame与常规DataFrame连接后，松散了它们的稀疏性（变得密集），现在我们可以混合使用常规序列（columns）和稀疏序列-真的很好的特性！在

相关问题更多 >

编程相关推荐

热门问题

热门文章