你能在大样本上使用隔离林算法吗?

2024-06-16 08:55:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我一直在使用隔离林的scikit learnsklearn.ensemble.IsolationForest实现来检测我的数据集中的异常,从100行到数百万行的数据。它似乎工作得很好,我已经将max_samples重写为一个非常大的整数来处理一些较大的数据集(基本上不使用子采样)。我注意到original paper 指出,较大的样本量会产生淹没和掩蔽的风险

如果隔离林似乎可以正常工作,那么在大样本量上使用它可以吗?我尝试用较小的max_samples进行训练,测试产生了太多异常。我的数据真的开始增长了,我想知道对于如此大的样本量,使用不同的异常检测算法是否会更好


Tags: 数据算法整数scikitmaxpaper风险samples
1条回答
网友
1楼 · 发布于 2024-06-16 08:55:51

引用原文:

The isolation characteristic of iTrees enables them to build partial models and exploit sub-sampling to an extent that is not feasible in existing methods. Since a large part of an iTree that isolates normal points is not needed for anomaly detection; it does not need to be constructed. A small sample size produces better iTrees because the swamping and masking effects are reduced.

从您的问题中,我有一种感觉,您混淆了数据集的大小和从中获取的用于构建iTree的样本的大小。隔离林可以处理非常大的数据集。当它对它们进行采样时,效果更好

原始文件在第3章中对此进行了讨论:

The data set has two anomaly clusters located close to one large cluster of normal points at the centre. There are interfering normal points surrounding the anomaly clusters, and the anomaly clusters are denser than normal points in this sample of 4096 instances. Figure 4(b) shows a sub-sample of 128 instances of the original data. The anomalies clusters are clearly identifiable in the sub-sample. Those normal instances surrounding the two anomaly clusters have been cleared out, and the size of anomaly clusters becomes smaller which makes them easier to identify. When using the entire sample, iForest reports an AUC of 0.67. When using a sub-sampling size of 128, iForest achieves an AUC of 0.91.

enter image description here

隔离林不是一个完美的算法,需要针对特定数据调整参数。它甚至可能在某些数据集上表现不佳。如果您想考虑其他方法,Local Outlier Factor也包含在^ {CD1> }中。您还可以组合多种方法(集成)

在这里,您可以找到一个很好的comparison不同的方法

相关问题 更多 >