机器学习中的不平衡数据集工具箱。
imbalanced-learn的Python项目详细描述
不平衡学习
不平衡学习是一个python包,提供了许多重采样技术 通常用于显示类间严重不平衡的数据集。 它与scikit-learn兼容,是scikit-learn-contrib的一部分。 项目。
文件
安装文档、API文档和示例可以在 documentation。
安装
依赖关系
不平衡学习在Python3.6+下测试。 依赖性要求基于最新的scikit学习版本:
- scipy(>;=0.17)
- 纽比(>;=1.11)
- 科学套件学习(>;=0.21)
- 作业库(>;=0.11)
- 路缘石2(可选)
- TensorFlow(可选)
此外,要运行示例,需要matplotlib(>;=2.0.0)和 熊猫(>;=0.22)。
安装
不平衡学习目前在pypi的存储库中可用,您可以 通过pip安装
pip install -U imbalanced-learn
该软件包也在anaconda云平台中发布:
conda install -c conda-forge imbalanced-learn
如果愿意,可以克隆它并运行setup.py文件。使用以下命令 从github获取副本并安装所有依赖项的命令:
git clone https://github.com/scikit-learn-contrib/imbalanced-learn.git cd imbalanced-learn pip install .
或者使用pip和github安装:
pip install -U git+https://github.com/scikit-learn-contrib/imbalanced-learn.git
测试
安装后,您可以使用pytest运行测试套件:
make coverage
发展
这套科学仪器的研制与 在scikit学习社区。因此,您可以参考 Development Guide。
关于
如果你在科学刊物上使用不平衡学习法,我们将不胜感激。 以下论文的引文:
@article{JMLR:v18:16-365, author = {Guillaume Lema{{\^i}}tre and Fernando Nogueira and Christos K. Aridas}, title = {Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning}, journal = {Journal of Machine Learning Research}, year = {2017}, volume = {18}, number = {17}, pages = {1-5}, url = {http://jmlr.org/papers/v18/16-365} }
大多数分类算法只有在 每个类的样本大致相同。高度倾斜的数据集,其中 少数民族的人数远远超过一个或多个阶级,事实证明 挑战的同时也变得越来越普遍。
解决此问题的一种方法是重新采样数据集以抵消 不平衡,希望达到一个更稳健和公平的决策边界 否则的话。
- 重采样技术分为两类:
- 大多数类别的样本不足。
- 对少数民族的抽样过多。
- 结合过采样和欠采样。
- 创建集成平衡集。
下面是此模块中当前实现的方法的列表。
不同的算法在sphinx-gallery中给出。
参考文献:
[1] | : I. Tomek, “Two modifications of CNN,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6, pp. 769-772, 1976. |
[2] | : I. Mani, J. Zhang. “kNN approach to unbalanced data distributions: A case study involving information extraction,” In Proceedings of the Workshop on Learning from Imbalanced Data Sets, pp. 1-7, 2003. |
[3] | : P. E. Hart, “The condensed nearest neighbor rule,” IEEE Transactions on Information Theory, vol. 14(3), pp. 515-516, 1968. |
[4] | : M. Kubat, S. Matwin, “Addressing the curse of imbalanced training sets: One-sided selection,” In Proceedings of the 14th International Conference on Machine Learning, vol. 97, pp. 179-186, 1997. |
[5] | : J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe, pp. 63-66, 2001. |
[6] | : D. Wilson, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” IEEE Transactions on Systems, Man, and Cybernetrics, vol. 2(3), pp. 408-421, 1972. |
[7] | : M. R. Smith, T. Martinez, C. Giraud-Carrier, “An instance level analysis of data complexity,” Machine learning, vol. 95(2), pp. 225-256, 2014. |
[8] | : N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002. |
[9] | : H. Han, W.-Y. Wang, B.-H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” In Proceedings of the 1st International Conference on Intelligent Computing, pp. 878-887, 2005. |
[10] | : H. M. Nguyen, E. W. Cooper, K. Kamei, “Borderline over-sampling for imbalanced data classification,” In Proceedings of the 5th International Workshop on computational Intelligence and Applications, pp. 24-29, 2009. |
[11] | : G. E. A. P. A. Batista, R. C. Prati, M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM Sigkdd Explorations Newsletter, vol. 6(1), pp. 20-29, 2004. |
[12] | : G. E. A. P. A. Batista, A. L. C. Bazzan, M. C. Monard, “Balancing training data for automated annotation of keywords: A case study,” In Proceedings of the 2nd Brazilian Workshop on Bioinformatics, pp. 10-18, 2003. |
[13] | (1, 2) : X.-Y. Liu, J. Wu and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 39(2), pp. 539-550, 2009. |
[14] | (1, 2) : I. Tomek, “An experiment with the edited nearest-neighbor rule,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), pp. 448-452, 1976. |
[15] | : H. He, Y. Bai, E. A. Garcia, S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In Proceedings of the 5th IEEE International Joint Conference on Neural Networks, pp. 1322-1328, 2008. |
[16] | : C. Chao, A. Liaw, and L. Breiman. “Using random forest to learn imbalanced data.” University of California, Berkeley 110 (2004): 1-12. |