快速信息增益计算问题的回答

快速信息增益计算

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我需要计算10k文档中100k个功能的信息增益分数。下面的代码可以正常工作，但对于完整的数据集来说，速度非常慢-在笔记本电脑上需要一个多小时。数据集是20新闻组，我使用的是scikit learn，<a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html" rel="noreferrer">chi2</a>函数，它在scikit中提供的工作速度非常快。 你知道如何为这样的数据集更快地计算信息增益吗？ <pre><code>def information_gain(x, y): def _entropy(values): counts = np.bincount(values) probs = counts[np.nonzero(counts)] / float(len(values)) return - np.sum(probs * np.log(probs)) def _information_gain(feature, y): feature_set_indices = np.nonzero(feature)[1] feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices] entropy_x_set = _entropy(y[feature_set_indices]) entropy_x_not_set = _entropy(y[feature_not_set_indices]) return entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set) + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set)) feature_size = x.shape[0] feature_range = range(0, feature_size) entropy_before = _entropy(y) information_gain_scores = [] for feature in x.T: information_gain_scores.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(_information_gain(feature, y)) return information_gain_scores, [] </code></pre> 编辑： 我合并了内部函数并按如下方式运行<code>cProfiler</code>（在一个限制为~15k个特性和~1k个文档的数据集上）： <pre><code>cProfile.runctx( """for feature in x.T: feature_set_indices = np.nonzero(feature)[1] feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices] values = y[feature_set_indices] counts = np.bincount(values) probs = counts[np.nonzero(counts)] / float(len(values)) entropy_x_set = - np.sum(probs * np.log(probs)) values = y[feature_not_set_indices] counts = np.bincount(values) probs = counts[np.nonzero(counts)] / float(len(values)) entropy_x_not_set = - np.sum(probs * np.log(probs)) result = entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set) + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set)) information_gain_scores.append(result)""", globals(), locals()) </code></pre> 结果前20名由<code>tottime</code>： <pre><code>ncalls tottime percall cumtime percall filename:lineno(function) 1 60.27 60.27 65.48 65.48 <string>:1(<module>) 16171 1.362 0 2.801 0 csr.py:313(_get_row_slice) 16171 0.523 0 0.892 0 coo.py:201(_check) 16173 0.394 0 0.89 0 compressed.py:101(check_format) 210235 0.297 0 0.297 0 {numpy.core.multiarray.array} 16173 0.287 0 0.331 0 compressed.py:631(prune) 16171 0.197 0 1.529 0 compressed.py:534(tocoo) 16173 0.165 0 1.263 0 compressed.py:20(__init__) 16171 0.139 0 1.669 0 base.py:415(nonzero) 16171 0.124 0 1.201 0 coo.py:111(__init__) 32342 0.123 0 0.123 0 {method 'max' of 'numpy.ndarray' objects} 48513 0.117 0 0.218 0 sputils.py:93(isintlike) 32342 0.114 0 0.114 0 {method 'sum' of 'numpy.ndarray' objects} 16171 0.106 0 3.081 0 csr.py:186(__getitem__) 32342 0.105 0 0.105 0 {numpy.lib._compiled_base.bincount} 32344 0.09 0 0.094 0 base.py:59(set_shape) 210227 0.088 0 0.088 0 {isinstance} 48513 0.081 0 1.777 0 fromnumeric.py:1129(nonzero) 32342 0.078 0 0.078 0 {method 'min' of 'numpy.ndarray' objects} 97032 0.066 0 0.153 0 numeric.py:167(asarray) </code></pre> 看来大部分时间都花在了<code>_get_row_slice</code>。我不完全确定第一行，看起来它覆盖了我提供给<code>cProfile.runctx</code>的整个块，尽管我不知道为什么第一行和第二行之间有这么大的差距。区别在哪里？是否可以签入<code>cProfile</code>？ 基本上，看起来问题在于稀疏矩阵运算（切片，获取元素）——解决方法可能是使用矩阵代数（比如<a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/univariate_selection.py#L154" rel="noreferrer">chi2 is implemented in scikit</a>）计算信息增益。但我不知道如何用矩阵运算来表达这个计算。。。有人有主意吗？？

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

不知道一年过去了，这是否还有帮助。但现在我碰巧面临着同样的文本分类任务。我使用为稀疏矩阵提供的<a href="http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.sparse.csr_matrix.nonzero.html" rel="noreferrer">nonzero()</a>函数重写了您的代码。然后我扫描nz，计算相应的y_值并计算熵。 以下代码只需几秒钟即可运行news20数据集（使用libsvm稀疏矩阵格式加载）。 <pre><code>def information_gain(X, y): def _calIg(): entropy_x_set = 0 entropy_x_not_set = 0 for c in classCnt: probs = classCnt[c] / float(featureTot) entropy_x_set = entropy_x_set - probs * np.log(probs) probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot) entropy_x_not_set = entropy_x_not_set - probs * np.log(probs) for c in classTotCnt: if c not in classCnt: probs = classTotCnt[c] / float(tot - featureTot) entropy_x_not_set = entropy_x_not_set - probs * np.log(probs) return entropy_before - ((featureTot / float(tot)) * entropy_x_set + ((tot - featureTot) / float(tot)) * entropy_x_not_set) tot = X.shape[0] classTotCnt = {} entropy_before = 0 for i in y: if i not in classTotCnt: classTotCnt[i] = 1 else: classTotCnt[i] = classTotCnt[i] + 1 for c in classTotCnt: probs = classTotCnt[c] / float(tot) entropy_before = entropy_before - probs * np.log(probs) nz = X.T.nonzero() pre = 0 classCnt = {} featureTot = 0 information_gain = [] for i in range(0, len(nz[0])): if (i != 0 and nz[0][i] != pre): for notappear in range(pre+1, nz[0][i]): information_gain.append(0) ig = _calIg() information_gain.append(ig) pre = nz[0][i] classCnt = {} featureTot = 0 featureTot = featureTot + 1 yclass = y[nz[1][i]] if yclass not in classCnt: classCnt[yclass] = 1 else: classCnt[yclass] = classCnt[yclass] + 1 ig = _calIg() information_gain.append(ig) return np.asarray(information_gain) </code></pre>

快速信息增益计算

1 个回答

相关Python问题