有人能解释Pandas仓的精确性吗？

import matplotlib.pyplot as plt import numpy as np import pandas as pd import time np.random.seed(1) n_samples = 37000 n_bins = 91000 data = pd.Series(np.random.gamma(1, 1, n_samples)) t1 = time.time() binned_df = pd.cut(data, bins = n_bins, precision = 100).value_counts() t2 = time.time() print("pd.cut speed: {}".format(t2-t1)) summed = np.sum(binned_df) print("sum: {:.4f}".format(summed)) print("len: {}".format(len(binned_df))) print(binned_df.head()) plt.hist(data, bins = 100) plt.show()

1条回答

网友

1楼 · 发布于 2024-04-19 22:54:53

看一下源代码，似乎给pandas一个高于19的精度可以让你跳过一个本来要运行的循环（前提是你的dtype不是{}或{}；请参见Line 326）。相关代码以on Line 393 and goes to Line 415开头。我有双重评论：

## This function figures out how far to round the bins after decimal place
def _round_frac(x, precision):
    """
    Round the fractional part of the given number
    """
    if not np.isfinite(x) or x == 0:
        return x
    else:
        frac, whole = np.modf(x)
        if whole == 0:
            digits = -int(np.floor(np.log10(abs(frac)))) - 1 + precision
        else:
            digits = precision
        return np.around(x, digits)

## This function loops through and makes the cuts more and more precise
## sequentially and only stops if either the number of unique levels created
## by the precision are equal to the number of bins or, if that doesn't
## work, just returns the precision you gave it. 

## However, range(100, 20) cannot loop so you jump to the end
def _infer_precision(base_precision, bins):
    """Infer an appropriate precision for _round_frac
    """
    for precision in range(base_precision, 20):
        levels = [_round_frac(b, precision) for b in bins]
        if algos.unique(levels).size == bins.size:
            return precision
    return base_precision # default

编辑：人工示例

假设您有一个列表my_list，它有六个元素，您想将它们分成三个容器：

^{pr2}$

显然，您希望在1.123和1.133之后进行拆分，但是假设您没有直接给pandas存储箱，而是提供了存储箱的数量（n_bins = 3）。假设pandas从将数据平均分为3的切分开始猜测（注意：我不知道pandas是如何选择初始切分的-这只是为了示例目的）：

# To calculate where the bin cuts start
x = (1.143 - 1.121)/3
cut1 = 1.121 + x  # 1.1283
cut2 = 1.121 + (2*x) # 1.1356
bins = [cut1, cut2]

但在此基础上，假设您建议pandas使用精度为1。将这个精度应用于上面的剪切得到1.1-这对于分隔{}是没有用的，因为每个条目看起来都是1.1。因此，包需要遍历并在估计的剪切值上使用越来越多的十进制数，直到结果级别的数量与n_bins匹配：

# Adapted from infer_precision
for precision in range(1, 4):     
    levels = [_round_frac(b, precision) for b in bins]
    print levels

只有当唯一级别的数量与存储箱的数量相匹配，或者达到小数点后20位时，此过程才会停止。提供100的精度允许包在小数点后使用100位来在数据中越来越精确的值之间选择剪切值。在

相关问题更多 >

编程相关推荐

热门问题

热门文章