为什么在pandas DataFrame中向量查找不生效,但在Series/日期查找中有效?
对于:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.random.randn(6),index=pd.date_range('2015-01-15','2015-01-20')
In [37]: x[datetime(2015,1,15)]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-37-0ce45ca5a858> in <module>()
----> 1 x[datetime(2015,1,15)]
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
1656 return self._getitem_multilevel(key)
1657 else:
-> 1658 return self._getitem_column(key)
1659
1660 def _getitem_column(self, key):
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
1663 # get column
1664 if self.columns.is_unique:
-> 1665 return self._get_item_cache(key)
1666
1667 # duplicate columns & possible reduce dimensionaility
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1003 res = cache.get(item)
1004 if res is None:
-> 1005 values = self._data.get(item)
1006 res = self._box_item_values(item, values)
1007 cache[item] = res
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
2871 return self.get_for_nan_indexer(indexer)
2872
-> 2873 _, block = self._find_block(item)
2874 return block.get(item)
2875 else:
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item)
3183
3184 def _find_block(self, item):
-> 3185 self._check_have(item)
3186 for i, block in enumerate(self.blocks):
3187 if item in block:
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item)
3190 def _check_have(self, item):
3191 if item not in self.items:
-> 3192 raise KeyError('no item named %s' % com.pprint_thing(item))
3193
3194 def reindex_axis(self, new_axis, indexer=None, method=None, axis=0,
KeyError: u'no item named 2015-01-15 00:00:00'
但是,
In [39]: x = pd.Series(np.random.randn(6),index=pd.date_range('2015-01-15','2015-01-20'))
查找是正确的:
In [40]: x[datetime(2015,1,15)]
Out[40]: -2.0727569075280319
有人能解释一下为什么在Series上查找可以正常工作,而在DataFrame上查找却不行吗?
这里是x:
In [41]: x
Out[41]:
2015-01-15 -2.072757
2015-01-16 -0.682232
2015-01-17 1.681293
2015-01-18 2.151027
2015-01-19 0.493222
2015-01-20 0.538554
Freq: D, dtype: float64
1 个回答
2
简单来说,你是在从不同的轴中选择数据。你可以查看索引的文档,了解更多信息这里
In [1]: df = pd.DataFrame(np.random.randn(6),index=pd.date_range('2015-01-15','2015-01-20'))
In [2]: s = pd.Series(np.random.randn(6),index=pd.date_range('2015-01-15','2015-01-20'))
In [3]: key = datetime.datetime(2015,1,15)
这段代码是从索引轴中选择数据
In [4]: df.loc[key]
Out[4]:
0 0.562973
Name: 2015-01-15 00:00:00, dtype: float64
这段代码也是如此
In [5]: s.loc[key]
Out[5]: 1.1151835852265839
这段代码也是这样做的(因为它只有一个轴!)
In [6]: s[key]
Out[6]: 1.1151835852265839
这里是数据框(DataFrame)的列
In [8]: df.columns
Out[8]: Int64Index([0], dtype='int64')
getitem
在数据框中默认是按列选择的!
In [9]: df[0]
Out[9]:
2015-01-15 0.562973
2015-01-16 -1.112382
2015-01-17 0.279265
2015-01-18 -0.919848
2015-01-19 -1.156900
2015-01-20 -0.887971
Freq: D, Name: 0, dtype: float64
不要搞混了,当你选择一个部分切片
时,数据框确实允许这种方便的操作(这也可以是datetime(2015,1,15):
- 但必须是一个切片。这个想法是,这在时间序列中是一个常见的操作,所以这样做是有效的(我认为这有点让人困惑,但自从pandas开始以来,这种用法就已经存在了)。
查看部分字符串索引
In [13]: df['20150115':]
Out[13]:
0
2015-01-15 0.562973
2015-01-16 -1.112382
2015-01-17 0.279265
2015-01-18 -0.919848
2015-01-19 -1.156900
2015-01-20 -0.887971
[6 rows x 1 columns]
在Series中也是一样的效果
In [15]: s['20150115':]
Out[15]:
2015-01-15 1.115184
2015-01-16 0.604819
2015-01-17 -0.112881
2015-01-18 -1.234023
2015-01-19 1.264301
2015-01-20 -0.873921
Freq: D, dtype: float64