优化Pandas多重索引查找

1 投票

1 回答

1487 浏览

提问于 2025-04-17 22:38

我使用的是 Pandas 0.12.0。假设 multi_df 是一个有多个索引的 Pandas 数据框。而我有一个很长的元组列表（多个索引），叫做 look_up_list。我想在 look_up_list 中的元组如果在 multi_df 中，就执行某个操作。

下面是我的代码。有没有更快的方法来实现这个呢？ 实际上，len(multi_df) 和 len(look_up_list) 都很大，所以我需要优化这一行代码：[multi_df.ix[idx]**2 for idx in look_up_list if idx in multi_df.index]。

特别是，line_profiler 告诉我，条件检查：if idx in multi_df.index 花费了很长时间。

import pandas as pd
df = pd.DataFrame({'id' : range(1,9),
                    'code' : ['one', 'one', 'two', 'three',
                                'two', 'three', 'one', 'two'],
                    'colour': ['black', 'white','white','white',
                            'black', 'black', 'white', 'white'],
                    'texture': ['soft', 'soft', 'hard','soft','hard',
                                        'hard','hard','hard'],
                    'shape': ['round', 'triangular', 'triangular','triangular','square',
                                        'triangular','round','triangular']
                    },  columns= ['id','code','colour', 'texture', 'shape'])
multi_df = df.set_index(['code','colour','texture','shape']).sort_index()['id']

# define the list of indices that I want to look up for in multi_df
look_up_list = [('two', 'white', 'hard', 'triangular'),('five', 'black', 'hard', 'square'),('four', 'black', 'hard', 'round') ] 
# run a list comprehension
[multi_df.ix[idx]**2 for idx in look_up_list if idx in multi_df.index]

附注：列表推导式中的实际操作不是 multi_df.ix[idx]**2，而是类似于：a_slow_function(multi_df.ix[idx]) 的东西。

代码优化列表推导式数据处理性能分析数据框多重索引条件检查 pandas优化

1 个回答

或许可以使用 multi_df.loc[look_up_list].dropna()。

import pandas as pd
df = pd.DataFrame(
    {'id': range(1, 9),
     'code': ['one', 'one', 'two', 'three',
              'two', 'three', 'one', 'two'],
     'colour': ['black', 'white', 'white', 'white',
                'black', 'black', 'white', 'white'],
     'texture': ['soft', 'soft', 'hard', 'soft', 'hard',
                 'hard', 'hard', 'hard'],
     'shape': ['round', 'triangular', 'triangular', 'triangular', 'square',
               'triangular', 'round', 'triangular']
     }, columns=['id', 'code', 'colour', 'texture', 'shape'])
multi_df = df.set_index(
    ['code', 'colour', 'texture', 'shape']).sort_index()['id']

# define the list of indices that I want to look up for in multi_df
look_up_list = [('two', 'white', 'hard', 'triangular'), (
    'five', 'black', 'hard', 'square'), ('four', 'black', 'hard', 'round')]

subdf = multi_df.loc[look_up_list].dropna()
print(subdf ** 2)

这样会得到

(two, white, hard, triangular)     9
(two, white, hard, triangular)    64
Name: id, dtype: float64

注意：

上面提到的 multi_df 是一个序列，不是数据框。我觉得这对解决问题没有影响。
你之前发的代码会出现 IndexingError: Too many indexers 的错误，所以我对代码的意图有点猜测。

回答于 2025-04-17 由 Python大师

分享举报

优化Pandas多重索引查找

1 个回答

撰写回答