慢的“统计”函数

import numpy as np import statistics ll_int = [x for x in range(100_000)] %timeit statistics.mean(ll_int) # 42 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit sum(ll_int) / len(ll_int) # 460 µs ± 5.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit np.mean(ll_int) # 4.62 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ll_float = [x / 10 for x in range(100_000)] %timeit statistics.mean(ll_float) # 56.7 ms ± 879 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit sum(ll_float) / len(ll_float) # 459 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit np.mean(ll_float) # 2.7 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

def next_mean(value, mean_, num): return (num * mean_ + value) / (num + 1) def imean(items, mean_=0.0): for i, item in enumerate(items): mean_ = next_mean(item, mean_, i) return mean_

2条回答

网友

1楼 · 编辑于 2024-04-26 10:17:06

statistics module使用解释的Python代码，但是numpy使用优化的编译代码来完成所有繁重的工作，因此如果numpy没有将statistics从水中吹出来，那将是令人惊讶的。你知道吗

此外，statistics被设计成与decimal和fractions这样的模块配合使用，并且使用了重视数值精度和类型安全性的代码。您的天真实现使用sum。统计模块在内部使用自己的函数_sum。Looking at its source表明它所做的远远不止是把东西加在一起：

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)
    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.
    If optional argument ``start`` is given, it is added to the total.
    If ``data`` is empty, ``start`` (defaulting to 0) is returned.
    Examples
        
    >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
    (<class 'float'>, Fraction(11, 1), 5)
    Some sources of round-off error will be avoided:
    # Built-in sum returns zero.
    >>> _sum([1e50, 1, -1e50] * 1000)
    (<class 'float'>, Fraction(1000, 1), 3000)
    Fractions and Decimals are also supported:
    >>> from fractions import Fraction as F
    >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
    (<class 'fractions.Fraction'>, Fraction(63, 20), 4)
    >>> from decimal import Decimal as D
    >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
    >>> _sum(data)
    (<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
    Mixed types are currently treated as an error, except that int is
    allowed.
    """
    count = 0
    n, d = _exact_ratio(start)
    partials = {d: n}
    partials_get = partials.get
    T = _coerce(int, type(start))
    for typ, values in groupby(data, type):
        T = _coerce(T, typ)  # or raise TypeError
        for n,d in map(_exact_ratio, values):
            count += 1
            partials[d] = partials_get(d, 0) + n
    if None in partials:
        # The sum will be a NAN or INF. We can ignore all the finite
        # partials, and just look at this special one.
        total = partials[None]
        assert not _isfinite(total)
    else:
        # Sum all the partial sums using builtin sum.
        # FIXME is this faster if we sum them in order of the denominator?
        total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)

这段代码最令人惊讶的地方是它将数据转换成分数，以最小化舍入误差。没有理由期望这样的代码会像简单的sum(nums)/len(nums)方法那样快速。你知道吗

网友

2楼 · 编辑于 2024-04-26 10:17:06

统计模块的开发人员用explicit decision来评估正确性而不是速度：

Correctness over speed. It is easier to speed up a correct but slow function than to correct a fast but buggy one.

而且stated我们无意

to replace, or even compete directly with, numpy

但是，提出了一个enhancement request来添加一个额外的、更快、更简单的实现statistics.fmean，这个函数将在python3.8中发布。根据增强开发者的说法，这个函数比现有的statistics.mean快500倍。你知道吗

fmeanimplementation基本上是sum/len。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章