在pandas中应用np.histogram重塑数据框
我想要获取一个 pandas 数据框中每一列的标准化直方图。我想使用 np.histogram
,但是它返回的是一个元组,而我只想要第一个元素。不过 pandas 似乎不太喜欢这样做。例如,下面这个代码可以正常工作:
import numpy as np
df = pd.DataFrame(np.random.uniform(size=20).reshape(5, 4))
bins = (0, 0.5, 1)
df.apply(np.histogram, bins=bins, normed=True)
并且返回了:
0 ([0.8, 1.2], [0.0, 0.5, 1.0])
1 ([0.8, 1.2], [0.0, 0.5, 1.0])
2 ([0.8, 1.2], [0.0, 0.5, 1.0])
3 ([0.8, 1.2], [0.0, 0.5, 1.0])
dtype: object
但是我只想要这个元组的第一个元素,所以我尝试了下面这个:
df.apply(lambda x : np.histogram(x, bins=bins, normed=True)[0])
但是出现了错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-51-3191795e120c> in <module>()
----> 1 df.apply(lambda x : np.histogram(x, bins=bins, normed=True)[0])
/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
3310 if reduce is None:
3311 reduce = True
-> 3312 return self._apply_standard(f, axis, reduce=reduce)
3313 else:
3314 return self._apply_broadcast(f, axis)
/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
3415 index = None
3416
-> 3417 result = self._constructor(data=results, index=index)
3418 result.columns = res_index
3419
/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
199 dtype=dtype, copy=copy)
200 elif isinstance(data, dict):
--> 201 mgr = self._init_dict(data, index, columns, dtype=dtype)
202 elif isinstance(data, ma.MaskedArray):
203 import numpy.ma.mrecords as mrecords
/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in _init_dict(self, data, index, columns, dtype)
321
322 return _arrays_to_mgr(arrays, data_names, index, columns,
--> 323 dtype=dtype)
324
325 def _init_ndarray(self, values, index, columns, dtype=None,
/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
4471 axes = [_ensure_index(columns), _ensure_index(index)]
4472
-> 4473 return create_block_manager_from_arrays(arrays, arr_names, axes)
4474
4475
/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in create_block_manager_from_arrays(arrays, names, axes)
3757 return mgr
3758 except (ValueError) as e:
-> 3759 construction_error(len(arrays), arrays[0].shape[1:], axes, e)
3760
3761
/usr/local/lib/python2.7/site-packages/pandas/core/internals.pyc in construction_error(tot_items, block_shape, axes, e)
3729 raise e
3730 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 3731 passed,implied))
3732
3733 def create_block_manager_from_blocks(blocks, axes):
ValueError: Shape of passed values is (4,), indices imply (4, 5)
> /usr/local/lib/python2.7/site-packages/pandas/core/internals.py(3731)construction_error()
3730 raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 3731 passed,implied))
3732
有没有什么好的建议呢?
1 个回答
3
如果你想这样做,可以试试。
In [26]: df.apply(lambda x : Series(np.histogram(x, bins=bins, normed=True)[0]))
Out[26]:
0 1 2 3
0 0.4 1.6 0.8 1.6
1 1.6 0.4 1.2 0.4
np.histogram
既不是一个 归约器(返回一个单一的值),也不是一个 变换器(返回的值和输入的数量相同)。所以 apply
不知道怎么处理返回的值。
这里有另一种方法(也是理解 apply 的一种思路)
In [28]: f = lambda x : Series(np.histogram(x, bins=bins, normed=True)[0])
In [31]: concat([ f(col) for c, col in df.iteritems() ],axis=1)
Out[31]:
0 1 2 3
0 0.4 1.6 0.8 1.6
1 1.6 0.4 1.2 0.4