Python autoimpute包_程序模块 - PyPI

python中的插补方法

autoimpute的Python项目详细描述

autoimpute是一个用于分析和实现插补方法的python包！

查看我们的网站以了解更详细的自动输入。

查看我们的文档以获取《开发人员指南》自动输入

安装

自动输入现在已注册到PYPI！使用pip install autoimpute下载
最新版本的自动输入是0.11.4
如果pip缓存了旧版本，请尝试pip install--no cache dir--upgrade autoimpute
如果要使用开发分支，请使用以下脚本：

开发

git clone -b dev --single-branch https://github.com/kearnz/autoimpute.git
cd autoimpute
python setup.py install

动机

大多数机器学习算法期望干净完整的数据集，但现实世界中的数据是混乱和缺失的。不幸的是，处理丢失的数据相当复杂，所以编程语言通常将此责任推给最终用户。默认情况下，r会删除所有缺少数据的记录，这种方法易于实现，但在实践中常常会出现问题。对于更丰富的插补策略，r有多个包来处理丢失的数据（MICE，AMELIA，tsimpute，等等）。python用户并没有这么幸运。当最终用户在缺少记录的数据集上部署模型时，python的scikit learn会抛出一个运行时错误，并且几乎没有第三方包可以处理端到端的插补。

因此，该包通过提供更清晰的插补过程、使插补方法更易访问以及测量插补方法在监督回归和分类中的影响来帮助python用户。在这样做的过程中，这个包将丢失的数据插补方法带到python世界中，并使它们在python机器学习中很好地工作。g项目（特别是利用scikit learn的项目）。最后，这个包提供了自己的有监督机器学习方法的实现，这些方法将scikit learn和statsmodels扩展到多个输入数据集。

主要功能

实用函数，用于检查缺失数据中的模式并确定相关的插补特征
缺失分类器和自动缺失数据测试集生成器
分类变量的本地处理（作为预测和插补目标）
pandasdataframes

的单个和多个插补类
对实用函数和插补方法的自定义可视化支持
使用多输入数据集的分析方法和混合参数推断
下表规定了多种插补方法：

支持的插补方法

<表><广告>单变量多变量时间序列/插值 < /广告><正文>平均值线性回归线性中间值二项logistic回归二次型模式多项式logistic回归立方随机随机回归多项式标准值贝叶斯线性回归样条曲线分类贝叶斯二元逻辑回归时间加权预测平均匹配下一个OBS结转局部剩余绘制最后结转的OBS
待办事项
其他横截面方法，包括随机森林、knn、em和最大似然
其他时间序列方法，包括ewma、arima、kalman滤波器和状态空间模型
对缺失数据模式、插补方法和分析模型可视化的扩展支持
对多重插补后的分析指标和分析模型的额外支持
对更大数据集的多处理和gpu支持，以及与daskdataframes
的集成
示例用法
自动输入设计为用户友好和灵活。在执行插补时，自动插补直接适用于scikit learn机器学习项目。输入源继承自sklearn的baseestimator和transformerminxin并实现fit和transform方法，使它们成为sklearn管道中的有效转换器。
现在，我们将使用两个输入程序类：
fromautoimpute.imputationsimportSingleImputer,MultipleImputersi=SingleImputer()# imputation methods, passing through the data oncemi=MultipleImputer()# imputation methods, passing through the data multiple times
估算可以简单到：
# simple example using default instance of MultipleImputerimp=MultipleImputer()# fit transform returns a generator by default, calculating each imputation method lazilyimp.fit_transform(data)
或相当复杂，例如：
# create a complex instance of the MultipleImputer# Here, we specify strategies by column and predictors for each column# We also specify what additional arguments any `pmm` strategies should takeimp=MultipleImputer(n=10,strategy={"salary":"pmm","gender":"bayesian binary logistic","age":"norm"},predictors={"salary":"all","gender":["salary","education","weight"]},imp_kwgs={"pmm":{"fill_value":"random"}},visit="left-to-right",return_list=True)# Because we set return_list=True, imputations are done all at once, not evaluated lazily.# This will return M*N, where M is the number of imputations and N is the size of original dataframe.imp.fit_transform(data)
autoimpute还将有监督的机器学习方法从scikit learn和statsmodels扩展到将它们应用于多个输入数据集（使用引擎盖下的multipleimputer）。目前，自输入支持线性回归和二元逻辑回归。目前正在开发其他监督方法。
与输入法一样，自动输入法的分析方法可以简单也可以复杂：
fromautoimpute.analysisimportMiLinearRegression# By default, use statsmodels OLS and MultipleImputer()simple_lm=MiLinearRegression()# fit the model on each multiply imputed dataset and pool parameterssimple_lm.fit(X_train,y_train)# get summary of fit, which includes pooled parameters under Rubin's rules# also provides diagnostics related to analysis after multiple imputationsimple_lm.summary()# make predictions on a new dataset using pooled parameterspredictions=simple_lm.predict(X_test)# Control both the regression used and the MultipleImputer itselfmultiple_imputer_arguments=dict(n=3,strategy={"salary":"pmm","gender":"bayesian binary logistic","age":"norm"},predictors={"salary":"all","gender":["salary","education","weight"]},imp_kwgs={"pmm":{"fill_value":"random"}},visit="left-to-right")complex_lm=MiLinearRegression(model_lib="sklearn",# use sklearn linear regressionmi_kwgs=multiple_imputer_arguments# control the multiple imputer)# fit the model on each multiply imputed datasetcomplex_lm.fit(X_train,y_train)# get summary of fit, which includes pooled parameters under Rubin's rules# also provides diagnostics related to analysis after multiple imputationcomplex_lm.summary()# make predictions on new dataset using pooled parameterspredictions=complex_lm.predict(X_test)
请注意，我们还可以将预先指定的多处理器传递给任一分析模型，而不是使用mi-kwgs。这是我们的选择，这是一个优先的问题。如果我们传递一个预先指定的multipleimputer，则忽略mi-kwgs中的任何内容，尽管mi-kwgs参数仍然有效。
fromautoimpute.imputationsimportMultipleImputerfromautoimpute.analysisimportMiLinearRegression# create a multiple imputer firstcustom_imputer=MultipleImputer(n=3,strategy="pmm",return_list=True)# pass the imputer to a linear regression modelcomplex_lm=MiLinearRegression(mi=custom_imputer,model_lib="statsmodels")# proceed the same as the previous examplescomplex_lm.fit(X_train,y_train).predict(X_test)complex_lm.summary()
为了更深入地了解该软件包可以正常工作并提供其可用功能，请参见我们的教程网站
版本和依赖项
巨蟒3.6+
依赖项： numpy>；=1.15.4 scipy>；=1.2.1 熊猫>；=0.20.3 statsmodels>；=0.9.0 scikit learn>；=0.20.2 xgboost>；=0.83 pymc3>；=3.5 seaborn>；=0.9.0 缺失no>；=0.4.1
Windows用户的注意事项：
autoimpute可以在windows上工作，但是用户可能会对pymc3的贝叶斯方法有困难。（请参阅"话语"）
当使用多个链进行采样时，用户可能会收到运行时错误'无法pickle fortran对象'。
要克服这个错误，有两件事要做：重新安装ano和pymc3。确保删除主文件夹中的"无缓存"。在此过程中升级joblib，它负责生成错误（pymc3在引擎盖下使用joblib）。在pm.sample中设置cores=1。这应该是最后的办法，因为这意味着后验取样将只使用1个岩芯。不使用多处理将大大降低贝叶斯插补方法的速度。
如果您在windows上成功解决了这个问题并有更好的解决方案，请联系我们！
创建者和维护者
约瑟夫·卡尼–@kearnz 沙希德·巴克特-@shabarka 请参见作者页以获得联系！
许可证
根据麻省理工学院的许可证发行。有关详细信息，请参见许可证。
贡献
为我们的项目做出贡献的准则。有关详细信息，请参见贡献。
贡献者行为准则
改编自贡献者契约，1.0.0版。有关更多信息，请参见行为准则。
标签：
the
数据
方法
gt
multiple
fit
lm
complex
欢迎加入QQ群-->： 979659372
推荐PyPI第三方库
django-gpxp gpxpy的django集成
filtertools 基于正则表达式和迭代器的文本处理/过滤
hnccorr 用于钙显像细胞检测的hnccorr算法。
Spelt spelt是一个小型python应用程序，旨在允许用户将照片从https://vk.com备份到本地存储。
django-dbindexer Django的表达型NoSQL
ddtrace-graphql 用datadog跟踪graphql调用的python库
sphinxcontrib-lastupdate Sphinx LastUpdate扩展
cachepot 又一个python缓存库
pytest-spec pytest插件，用于像规范一样显示测试执行输出
aws-cdk.aws-medialive aws：：medialive的cdk构造库
hypothesis-gufunc 对生成通用（gu）numpy函数输入的假设的扩展。
rexpython python的简单反应式扩展（rx）
pybrainyquote 从brainyquote.com获取报价
nip.cli nip不是pip
ez_xml 模板生成器

导航栏项目描述版本历史下载文件项目链接首页标签许可证: BSD许可证（BSD 3条款）作者信息:: 暂无维护者 kearnz 最新PyPI项目 italian_vip_says UFx vofs fake_item_generator NerEva django-monologue fio_product_attribute_strict climailsystem pyshape tbb-devel npy-append-arra anthill.tal.macrorenderer odoo11-addon-stock-a uuuu contextil fyl_nester appomatic_renderable teacher chuletas slackbot_ce 最新Python常见问题我是否正确构建了这个递归神经网络我是否正确理解acquire和realease是如何在python库“线程化”中工作的我是否正确理解Keras中的批次大小？我是否正确理解PyTorch的加法和乘法？我是否正确组织了我的Django应用程序？我是否正确计算执行时间？如果是这样，那么并行处理将花费更长的时间。这看起来很奇怪我是否每次创建新项目时都必须在PyCharm中安装numpy？（安装而不是导入）我是否每次运行jupyter笔记本时都必须重新启动内核？我是否用python安装了socks模块？我是否真的需要知道超过一种语言，如果我想要制作网页应用程序？我是否缺少spaCy柠檬化中的预处理功能？我是否缺少给定状态下操作的检查？我是否能够使用函数“count（）”来查找密码中大写字母的数量(（Python）我是否能够使用用户输入作为colorama模块中的颜色？我是否能够创建一个能够添加新Django.contrib.auth公司没有登录到管理面板的用户？

autoimpute 0.11.4

autoimpute的Python项目详细描述

安装

动机

主要功能

支持的插补方法

待办事项

示例用法

`版本和依赖项`

`创建者和维护者`

`许可证`

`贡献`

`贡献者行为准则`

`推荐PyPI第三方库`

django-gpxp

filtertools

hnccorr

Spelt

django-dbindexer

ddtrace-graphql

sphinxcontrib-lastupdate

cachepot

pytest-spec

aws-cdk.aws-medialive

hypothesis-gufunc

rexpython

pybrainyquote

nip.cli

ez_xml

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

autoimpute 0.11.4

autoimpute的Python项目详细描述

安装

动机

主要功能

支持的插补方法

待办事项

示例用法

版本和依赖项

创建者和维护者

许可证

贡献

贡献者行为准则

推荐PyPI第三方库

django-gpxp

filtertools

hnccorr

Spelt

django-dbindexer

ddtrace-graphql

sphinxcontrib-lastupdate

cachepot

pytest-spec

aws-cdk.aws-medialive

hypothesis-gufunc

rexpython

pybrainyquote

nip.cli

ez_xml

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

`版本和依赖项`

`创建者和维护者`

`许可证`

`贡献`

`贡献者行为准则`

`推荐PyPI第三方库`

导航栏

项目链接

标签