寻找合适的截断值

1 投票

3 回答

2676 浏览

提问于 2025-04-16 13:09

我正在尝试实现Hampel tanh 估计器，目的是为了规范化高度不对称的数据。为了做到这一点，我需要进行以下计算：

给定x - 一个排好序的数字列表，以及m - x的中位数，我需要找到a，使得大约70%的x中的值都落在范围(m-a; m+a)内。我们对x中值的分布一无所知。我使用numpy在python中编程，我想到的最好主意是写一种随机的迭代搜索（比如，像Solis和Wets描述的那样），但我怀疑还有更好的方法，可能是更好的算法或者现成的函数。我查阅了numpy和scipy的文档，但没有找到任何有用的提示。

编辑

Seth 建议使用scipy.stats.mstats.trimboth，但在我对一个偏斜分布的测试中，这个建议没有奏效：

from scipy.stats.mstats import trimboth
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

trimmedList = trimboth(theList, proportiontocut=0.15)
a = (trimmedList.max() - trimmedList.min()) * 0.5

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

输出是0.79（大约80%，而不是70%）

算法优化中位数数据规范化截断值偏斜分布随机迭代搜索 Hampel tanh 估计器 scipy.stats

3 个回答

你需要用到的是 scipy.stats.mstats.trimboth 这个工具。你可以把 proportiontocut=0.15 设置为0.15。然后在去掉一些数据后，计算 (max-min)/2。

回答于 2025-04-16 由 Python大师

分享举报

稍微重新表述一下问题。你知道列表的长度，以及要考虑的数字所占的比例。根据这些信息，你可以确定列表中第一个和最后一个索引之间的差值，这样就能得到你想要的范围。接下来的目标是找到那些索引，使得与中位数对称的值的成本函数最小化。

设较小的索引为 n1，较大的索引为 n2；这两个索引是相互关联的。列表中这两个索引对应的值分别是 x[n1] = m-b 和 x[n2]=m+c。现在你想选择 n1（因此也确定了 n2），使得 b 和 c 尽可能接近。这种情况发生在 (b - c)**2 最小的时候。使用 numpy.argmin 来实现这一点非常简单。为了与问题中的例子相呼应，这里有一个互动示例来说明这个方法：

$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> theList = np.log10(1+np.arange(.1, 100))
>>> theMedian = np.median(theList)
>>> listHead = theList[0:30]
>>> listTail = theList[-30:]
>>> b = np.abs(listHead - theMedian)
>>> c = np.abs(listTail - theMedian)
>>> squaredDiff = (b - c) ** 2
>>> np.argmin(squaredDiff)
25
>>> listHead[25] - theMedian, listTail[25] - theMedian
(-0.2874888056626983, 0.27859407466756614)

回答于 2025-04-16 由 Python大师

分享举报

你首先需要把你的数据分布对称化，也就是说，把所有小于平均值的数值都“折叠”到右边去。然后，你就可以使用标准的 scipy.stats 函数来处理这个单边的分布了：

from scipy.stats import scoreatpercentile
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

oneSidedList = theList[:]               # copy original list
# fold over to the right all values left of the median
oneSidedList[theList < theMedian] = 2*theMedian - theList[theList < theMedian]

# find the 70th centile of the one-sided distribution
a = scoreatpercentile(oneSidedList, 70) - theMedian

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

这样就能得到你需要的结果 0.7。

回答于 2025-04-16 由 Python大师

分享举报

寻找合适的截断值

3 个回答

撰写回答