当我对DataFrame.agg()的func参数使用字符串时,如何知道调用了什么函数?

2024-05-23 22:07:46 发布

您现在位置:Python中文网/ 问答频道 /正文

例如,假设我有一个数据帧,如

import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3, 4]})

我打电话

df.agg(func='sum')

这是参考资料吗

  1. ^{}
  2. ^{}
  3. ^{}

我意识到在引擎盖下这些函数做同样的事情,但是我仍然想知道哪个函数被分派。这有文件记录吗


Tags: 数据函数importdataframepandasdfas事情
2条回答

@Ch3steR,谢谢你帮我看到光明。不过我想详细说明一下你的答案

{a1}包括这些相关行

def aggregate(
    obj: AggObjType,
    arg: AggFuncType,
    *args,
    **kwargs,
):

...

if isinstance(arg, str):
    return obj._try_aggregate_string_function(arg, *args, **kwargs), None

然后我们追踪^{}

def _try_aggregate_string_function(self, arg: str, *args, **kwargs):
        """
        if arg is a string, then try to operate on it:
        - try to find a function (or attribute) on ourselves
        - try to find a numpy function
        - raise
        """
        assert isinstance(arg, str)

        f = getattr(self, arg, None)
        if f is not None:
            if callable(f):
                return f(*args, **kwargs)

            # people may try to aggregate on a non-callable attribute
            # but don't let them think they can pass args to it
            assert len(args) == 0
            assert len([kwarg for kwarg in kwargs if kwarg not in ["axis"]]) == 0
            return f

        f = getattr(np, arg, None)
        if f is not None:
            if hasattr(self, "__array__"):
                # in particular exclude Window
                return f(self, *args, **kwargs)

        raise AttributeError(
            f"'{arg}' is not a valid function for '{type(self).__name__}' object"
        )

因此,当您进行类似df.agg('foo')的调用时,熊猫首先查找名为foo的数据帧属性,然后查找名为foo的NumPy函数(假设foo不作为数据帧属性存在)

这是内部细节,我不认为这会被记录下来

pandas dev以这种方式处理这些字符串,即'sum''mean'。它们有一个映射,将函数映射到该函数的内部cythonised实现

摘自^{}

_cython_table = {
        builtins.sum: "sum",
        builtins.max: "max",
        builtins.min: "min",
        np.all: "all",
        np.any: "any",
        np.sum: "sum",
        np.nansum: "sum",
        np.mean: "mean",
        np.nanmean: "mean",
        np.prod: "prod",
        np.nanprod: "prod",
        np.std: "std",
        np.nanstd: "std",
        np.var: "var",
        np.nanvar: "var",
        np.median: "median",
        np.nanmedian: "median",
        np.max: "max",
        np.nanmax: "max",
        np.min: "min",
        np.nanmin: "min",
        np.cumprod: "cumprod",
        np.nancumprod: "cumprod",
        np.cumsum: "cumsum",
        np.nancumsum: "cumsum",
    }

所以,Series.agg(sum)Series.agg('sum')Series.agg(np.sum)Series.agg(np.nansum)都调用相同的内部cythonized函数

摘自^{}

    def _get_cython_func(self, arg: Callable) -> Optional[str]:
        """
        if we define an internal function for this argument, return it
        """
        return self._cython_table.get(arg)

你可以在^{}中找到它们是如何处理的,它们使用getattr在这里,似乎cythonized func是定义的类属性。我没有找到好的起点,但最好是在^{}看看^{}

def aggregate(
    obj: AggObjType,
    arg: AggFuncType,
    *args,
    **kwargs,
):
    ...
    ...
    if callable(arg):
        f = obj._get_cython_func(arg)
        if f and not args and not kwargs:
            return getattr(obj, f)(), None
   ...
   ...

相关问题 更多 >