scikit learn:选择k个最佳特征后更新countvectorizer

1 投票
3 回答
5001 浏览
提问于 2025-04-18 14:38

我有一个计数向量器,它有很多特征。我想从转换后的特征中选择出最好的k个特征,然后更新这个计数向量器,只保留这些特征。这样做可以吗?

import pandas as pd
import numpy as np
import scipy as sp
import scipy.stats as ss
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

merge=re.compile('\*\|.+?\|\*')
def stripmerge(sub):
    for i in merge.findall(sub):
        j=i
        j=j.replace('*|','mcopen')
        j=j.replace('|*','mcclose')
        j=re.sub('[^0-9a-zA-Z]','',j)
        sub=sub.replace(i,j)
    return sub

input=pd.read_csv('subject_tool_test_23.csv')
input.subject[input.subject.isnull()]=' '


subjects=np.asarray([stripmerge(i) for i in input.subject])
count_vectorizer = CountVectorizer(strip_accents='unicode', ngram_range=(1,1), binary=True, stop_words='english', max_features=500)
counts=count_vectorizer.fit_transform(subjects)

#see the first output example here

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

good=np.asarray(input.unique_open_performance>0)

count_new = SelectKBest(chi2, k=250).fit_transform(counts, good)

第一次输出的例子,特征是有意义的。

>>> counts[1]
<1x500 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> subjects[1]
"Lake Group Media's Thursday Target"
>>> count_vectorizer.inverse_transform(counts[1])
[array([u'group', u'media', u'thursday'], 
      dtype='<U18')]

第二次输出的例子,特征就不再匹配了。

>>> count_new = SelectKBest(chi2, k=250).fit_transform(counts, good)
>>> count_new.shape
(992979, 250)
>>> count_new[1]
<1x250 sparse matrix of type '<type 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>
>>> count_vectorizer.inverse_transform(count_new[1])
[array([u'independence', u'easy'], 
      dtype='<U18')]
>>> subjects[1]
"Lake Group Media's Thursday Target"

有没有办法把特征选择的结果应用到我的计数向量器上,这样我就能生成只包含重要特征的新向量?

3 个回答

1

我觉得这就是你想要的东西。它是一个经过修改的 SelectKBest 对象,可以转换一个词汇对象(也就是词和索引的字典)或者一个 CountVectorizer 对象,并更新它的词汇表。这样就不需要重新提取所有的特征了。

from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np

class CustomSelectKBest(SelectKBest):
  """
    Extending SelectKBest with the ability to update a vocabulary that is given
    from a CountVectorizer object.
  """
  def __init__(self, score_func=f_classif, k=10):
    super(CustomSelectKBest, self).__init__(score_func, k)

  def transform_vocabulary(self, vocabulary):
    mask  = self.get_support(True)
    i_map = { j:i for i, j in enumerate(mask) }
    return { k:i_map[i] for k, i in vocabulary.iteritems() if i in i_map }

  def transform_vectorizer(self, cv):
    cv.vocabulary_ = self.transform_vocabulary(cv.vocabulary_)

if __name__ == '__main__':
  def score_func(X, y):
    # Fake scores and p-values
    return (np.arange(X.shape[1]), np.zeros(X.shape[1]))

  # Create test data.
  size = (4, 10)
  X = (np.random.randint(0,5, size=size))
  y = np.random.randint(2, size=size[0])
  vocabulary = {chr(i+ord('a')):i for i in range(size[1])}

  skb = CustomSelectKBest(score_func=score_func, k=5)
  X_s = skb.fit_transform(X, y)
  vocab_s = skb.transform_vocabulary(vocabulary)

  # Confirm they have the right values.
  for k, i_s in vocab_s.iteritems():
    i = vocabulary[k]
    assert((X_s[:,i_s].T == X[:,i].T).all())

  print 'Test passed'
2

使用Pipeline可以让你的工作变得更简单。Pipeline会自动对测试数据进行处理,你不需要手动重新创建向量化工具。

text_clf_red = Pipeline([('vect', CountVectorizer()), 
                       ('reducer', SelectKBest(chi2, k=3000)),
                       ('clf', MultinomialNB())
                       ])

text_clf_red.fit(X_train, y_train)
y_test_pred = text_clf_red.predict(X_test)
metrics.accuracy_score(y_test, y_test_pred)
4

我解决这个问题的方法是先进行特征选择,找出原始数据中哪些列被选中了,然后把这些列放进一个字典里。接着,我用这个字典来运行一个新的计数向量器。虽然在处理大数据集时会花费更多时间,但这个方法是有效的。

ch2 = SelectKBest(chi2, k = 3000)

count_new = ch2.fit_transform(counts, good)
dict=np.asarray(count_vectorizer.get_feature_names())[ch2.get_support()]
count_vectorizer=CountVectorizer(strip_accents='unicode', ngram_range=(1,1), binary=True,  vocabulary=dict)

撰写回答