scikit learn:选择k个最佳特征后更新countvectorizer
我有一个计数向量器,它有很多特征。我想从转换后的特征中选择出最好的k个特征,然后更新这个计数向量器,只保留这些特征。这样做可以吗?
import pandas as pd
import numpy as np
import scipy as sp
import scipy.stats as ss
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
merge=re.compile('\*\|.+?\|\*')
def stripmerge(sub):
for i in merge.findall(sub):
j=i
j=j.replace('*|','mcopen')
j=j.replace('|*','mcclose')
j=re.sub('[^0-9a-zA-Z]','',j)
sub=sub.replace(i,j)
return sub
input=pd.read_csv('subject_tool_test_23.csv')
input.subject[input.subject.isnull()]=' '
subjects=np.asarray([stripmerge(i) for i in input.subject])
count_vectorizer = CountVectorizer(strip_accents='unicode', ngram_range=(1,1), binary=True, stop_words='english', max_features=500)
counts=count_vectorizer.fit_transform(subjects)
#see the first output example here
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
good=np.asarray(input.unique_open_performance>0)
count_new = SelectKBest(chi2, k=250).fit_transform(counts, good)
第一次输出的例子,特征是有意义的。
>>> counts[1]
<1x500 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> subjects[1]
"Lake Group Media's Thursday Target"
>>> count_vectorizer.inverse_transform(counts[1])
[array([u'group', u'media', u'thursday'],
dtype='<U18')]
第二次输出的例子,特征就不再匹配了。
>>> count_new = SelectKBest(chi2, k=250).fit_transform(counts, good)
>>> count_new.shape
(992979, 250)
>>> count_new[1]
<1x250 sparse matrix of type '<type 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>
>>> count_vectorizer.inverse_transform(count_new[1])
[array([u'independence', u'easy'],
dtype='<U18')]
>>> subjects[1]
"Lake Group Media's Thursday Target"
有没有办法把特征选择的结果应用到我的计数向量器上,这样我就能生成只包含重要特征的新向量?
3 个回答
1
我觉得这就是你想要的东西。它是一个经过修改的 SelectKBest 对象,可以转换一个词汇对象(也就是词和索引的字典)或者一个 CountVectorizer 对象,并更新它的词汇表。这样就不需要重新提取所有的特征了。
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np
class CustomSelectKBest(SelectKBest):
"""
Extending SelectKBest with the ability to update a vocabulary that is given
from a CountVectorizer object.
"""
def __init__(self, score_func=f_classif, k=10):
super(CustomSelectKBest, self).__init__(score_func, k)
def transform_vocabulary(self, vocabulary):
mask = self.get_support(True)
i_map = { j:i for i, j in enumerate(mask) }
return { k:i_map[i] for k, i in vocabulary.iteritems() if i in i_map }
def transform_vectorizer(self, cv):
cv.vocabulary_ = self.transform_vocabulary(cv.vocabulary_)
if __name__ == '__main__':
def score_func(X, y):
# Fake scores and p-values
return (np.arange(X.shape[1]), np.zeros(X.shape[1]))
# Create test data.
size = (4, 10)
X = (np.random.randint(0,5, size=size))
y = np.random.randint(2, size=size[0])
vocabulary = {chr(i+ord('a')):i for i in range(size[1])}
skb = CustomSelectKBest(score_func=score_func, k=5)
X_s = skb.fit_transform(X, y)
vocab_s = skb.transform_vocabulary(vocabulary)
# Confirm they have the right values.
for k, i_s in vocab_s.iteritems():
i = vocabulary[k]
assert((X_s[:,i_s].T == X[:,i].T).all())
print 'Test passed'
2
使用Pipeline可以让你的工作变得更简单。Pipeline会自动对测试数据进行处理,你不需要手动重新创建向量化工具。
text_clf_red = Pipeline([('vect', CountVectorizer()),
('reducer', SelectKBest(chi2, k=3000)),
('clf', MultinomialNB())
])
text_clf_red.fit(X_train, y_train)
y_test_pred = text_clf_red.predict(X_test)
metrics.accuracy_score(y_test, y_test_pred)
4
我解决这个问题的方法是先进行特征选择,找出原始数据中哪些列被选中了,然后把这些列放进一个字典里。接着,我用这个字典来运行一个新的计数向量器。虽然在处理大数据集时会花费更多时间,但这个方法是有效的。
ch2 = SelectKBest(chi2, k = 3000)
count_new = ch2.fit_transform(counts, good)
dict=np.asarray(count_vectorizer.get_feature_names())[ch2.get_support()]
count_vectorizer=CountVectorizer(strip_accents='unicode', ngram_range=(1,1), binary=True, vocabulary=dict)