机器学习中的不平衡数据集工具箱。

imbalanced-learn的Python项目详细描述


TravisAppVeyorCodecovCircleCIReadTheDocsPythonVersionPypiGitter

不平衡学习

不平衡学习是一个python包,提供了许多重采样技术 通常用于显示类间严重不平衡的数据集。 它与scikit-learn兼容,是scikit-learn-contrib的一部分。 项目。

文件

安装文档、API文档和示例可以在 documentation

安装

依赖关系

不平衡学习在Python3.6+下测试。 依赖性要求基于最新的scikit学习版本:

  • scipy(>;=0.17)
  • 纽比(>;=1.11)
  • 科学套件学习(>;=0.21)
  • 作业库(>;=0.11)
  • 路缘石2(可选)
  • TensorFlow(可选)

此外,要运行示例,需要matplotlib(>;=2.0.0)和 熊猫(>;=0.22)。

安装

不平衡学习目前在pypi的存储库中可用,您可以 通过pip安装

pip install -U imbalanced-learn

该软件包也在anaconda云平台中发布:

conda install -c conda-forge imbalanced-learn

如果愿意,可以克隆它并运行setup.py文件。使用以下命令 从github获取副本并安装所有依赖项的命令:

git clone https://github.com/scikit-learn-contrib/imbalanced-learn.git
cd imbalanced-learn
pip install .

或者使用pip和github安装:

pip install -U git+https://github.com/scikit-learn-contrib/imbalanced-learn.git

测试

安装后,您可以使用pytest运行测试套件:

make coverage

发展

这套科学仪器的研制与 在scikit学习社区。因此,您可以参考 Development Guide

关于

如果你在科学刊物上使用不平衡学习法,我们将不胜感激。 以下论文的引文:

@article{JMLR:v18:16-365,
author  = {Guillaume  Lema{{\^i}}tre and Fernando Nogueira and Christos K. Aridas},
title   = {Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning},
journal = {Journal of Machine Learning Research},
year    = {2017},
volume  = {18},
number  = {17},
pages   = {1-5},
url     = {http://jmlr.org/papers/v18/16-365}
}

大多数分类算法只有在 每个类的样本大致相同。高度倾斜的数据集,其中 少数民族的人数远远超过一个或多个阶级,事实证明 挑战的同时也变得越来越普遍。

解决此问题的一种方法是重新采样数据集以抵消 不平衡,希望达到一个更稳健和公平的决策边界 否则的话。

重采样技术分为两类:
  1. 大多数类别的样本不足。
  2. 对少数民族的抽样过多。
  3. 结合过采样和欠采样。
  4. 创建集成平衡集。

下面是此模块中当前实现的方法的列表。

  • 采样不足
    1. 替换抽样下的随机多数
    2. 提取大多数少数民族的链接[1]
    3. 使用簇质心进行欠采样
    4. 未遂事故-(1&2&3)[2]
    5. 凝聚近邻[3]
    6. 单面选择[4]
    7. 邻里清洁规则[5]
    8. 编辑近邻[6]
    9. 实例硬度阈值[7]
    10. 重复编辑近邻[14]
    11. 全部[14]
  • 过采样
    1. 带替换的随机少数抽样
    2. Smote-合成少数民族过采样技术[8]
    3. bsmote(1&2)-类型1和2的边界smote [9]
    4. SVM smote-支持向量smote[10]
    5. 非平衡学习的自适应综合采样方法[15]
  • 过采样,然后是欠采样
    1. 打击+打击链接[12]
    2. 击打+enn[11]
  • 内部使用采样器的集成分类器
    1. 简易程序[13]
    2. 平衡ascade[13]
    3. 平衡随机林[16]
    4. 平衡装袋

不同的算法在sphinx-gallery中给出。

参考文献:

[1]: I. Tomek, “Two modifications of CNN,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6, pp. 769-772, 1976.
[2]: I. Mani, J. Zhang. “kNN approach to unbalanced data distributions: A case study involving information extraction,” In Proceedings of the Workshop on Learning from Imbalanced Data Sets, pp. 1-7, 2003.
[3]: P. E. Hart, “The condensed nearest neighbor rule,” IEEE Transactions on Information Theory, vol. 14(3), pp. 515-516, 1968.
[4]: M. Kubat, S. Matwin, “Addressing the curse of imbalanced training sets: One-sided selection,” In Proceedings of the 14th International Conference on Machine Learning, vol. 97, pp. 179-186, 1997.
[5]: J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe, pp. 63-66, 2001.
[6]: D. Wilson, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” IEEE Transactions on Systems, Man, and Cybernetrics, vol. 2(3), pp. 408-421, 1972.
[7]: M. R. Smith, T. Martinez, C. Giraud-Carrier, “An instance level analysis of data complexity,” Machine learning, vol. 95(2), pp. 225-256, 2014.
[8]: N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[9]: H. Han, W.-Y. Wang, B.-H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” In Proceedings of the 1st International Conference on Intelligent Computing, pp. 878-887, 2005.
[10]: H. M. Nguyen, E. W. Cooper, K. Kamei, “Borderline over-sampling for imbalanced data classification,” In Proceedings of the 5th International Workshop on computational Intelligence and Applications, pp. 24-29, 2009.
[11]: G. E. A. P. A. Batista, R. C. Prati, M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM Sigkdd Explorations Newsletter, vol. 6(1), pp. 20-29, 2004.
[12]: G. E. A. P. A. Batista, A. L. C. Bazzan, M. C. Monard, “Balancing training data for automated annotation of keywords: A case study,” In Proceedings of the 2nd Brazilian Workshop on Bioinformatics, pp. 10-18, 2003.
[13](1, 2) : X.-Y. Liu, J. Wu and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 39(2), pp. 539-550, 2009.
[14](1, 2) : I. Tomek, “An experiment with the edited nearest-neighbor rule,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), pp. 448-452, 1976.
[15]: H. He, Y. Bai, E. A. Garcia, S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In Proceedings of the 5th IEEE International Joint Conference on Neural Networks, pp. 1322-1328, 2008.
[16]: C. Chao, A. Liaw, and L. Breiman. “Using random forest to learn imbalanced data.” University of California, Berkeley 110 (2004): 1-12.

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
使用SerializationUtils时java ClassNotFoundException。克隆()   java Cucumber+spring:如何通过测试触发SmartLifecycle事件?   java如何使ProGuard以简单的方式工作?   java JSP页面显示来自集合的日期   谷歌地图检查坐标是否位于JAVA中谷歌地图API的多边形中   java如何在终端中使用“tokens”打印令牌?   java获取编译错误:包com。威里奥。sdk不存在   java会使用JAXB或类似工具自动填充HATEAOS链接吗?   Javascript(GraalJS)与Java中未签名的右移>>>>   如何在Java代码中创建jdbc请求的Jmeter测试   java如何在CellList中添加或删除单个元素?   java Progressbar:如何创建原始对象的深度副本