如何在python中处理机器学习中丢失的nan

Int64Index: 7049 entries, 0 to 7048 Data columns (total 31 columns): left_eye_center_x 7039 non-null float64 left_eye_center_y 7039 non-null float64 right_eye_center_x 7036 non-null float64 right_eye_center_y 7036 non-null float64 left_eye_inner_corner_x 2271 non-null float64 left_eye_inner_corner_y 2271 non-null float64 left_eye_outer_corner_x 2267 non-null float64 left_eye_outer_corner_y 2267 non-null float64 right_eye_inner_corner_x 2268 non-null float64 right_eye_inner_corner_y 2268 non-null float64 right_eye_outer_corner_x 2268 non-null float64 right_eye_outer_corner_y 2268 non-null float64 left_eyebrow_inner_end_x 2270 non-null float64 left_eyebrow_inner_end_y 2270 non-null float64 left_eyebrow_outer_end_x 2225 non-null float64 left_eyebrow_outer_end_y 2225 non-null float64 right_eyebrow_inner_end_x 2270 non-null float64 right_eyebrow_inner_end_y 2270 non-null float64 right_eyebrow_outer_end_x 2236 non-null float64 right_eyebrow_outer_end_y 2236 non-null float64 nose_tip_x 7049 non-null float64 nose_tip_y 7049 non-null float64 mouth_left_corner_x 2269 non-null float64 mouth_left_corner_y 2269 non-null float64 mouth_right_corner_x 2270 non-null float64 mouth_right_corner_y 2270 non-null float64 mouth_center_top_lip_x 2275 non-null float64 mouth_center_top_lip_y 2275 non-null float64 mouth_center_bottom_lip_x 7016 non-null float64 mouth_center_bottom_lip_y 7016 non-null float64 Image 7049 non-null object

2条回答

网友

1楼 · 编辑于 2024-05-16 09:28:00

没有一种最好的方法来处理丢失的数据。最严格的方法是将丢失的值建模为概率框架（如PyMC）中的附加参数。这样你就可以得到可能值的分布，而不仅仅是一个答案。下面是使用PyMC处理丢失数据的示例：http://stronginference.com/missing-data-imputation.html

如果你真的想用点估计来填补这些漏洞，那么你就要进行“插补”。我会避开简单的插补方法，比如平均填充法，因为它们真的会破坏你特征的联合分布。相反，尝试softImpute（它尝试通过低阶近似来推断缺少的值）。softImpute的原始版本是为R编写的，但是我在这里制作了一个Python版本（以及kNN imputation等其他方法）：https://github.com/hammerlab/fancyimpute

网友

2楼 · 编辑于 2024-05-16 09:28:00

What is the best way to handle missing values in data set?

没有最好的方法，每个解决方案/算法都有各自的优缺点（您甚至可以将其中的一些组合在一起，以创建自己的策略，并调整相关参数，得出一个最能满足您的数据的结果，有很多关于此主题的研究/论文）。

例如，平均插补快速而简单，但它会低估方差，并且用平均值替换NaN会扭曲分布形状，而KNN插补在时间复杂度方面，在大型数据集中可能不是理想的，因为它迭代所有数据点并执行计算每个NaN值，假设NaN属性与其他属性相关。

How to handle missing values in datasets before applying machine learning algorithm??

除了您提到的平均插补之外，您还可以查看K-最近邻插补和回归插补，并参考scikit-learn中强大的Imputer类来检查要使用的现有api。

KNN插补

计算该NaN点最近邻k的平均值。

回归插补

一个回归模型被估计为基于其他变量预测一个变量的观测值，然后该模型被用于在该变量缺失的情况下估算值。

Here链接到scikit的“缺失值的估算”部分。我也听说过Orange图书馆，但还没有机会使用它。

相关问题更多 >

编程相关推荐

热门问题

热门文章