擅长:python、mysql、java
<p>在合并列上设置索引确实加快了速度。下面是一个更加真实的版本@julien marrec Answer。</p>
<pre><code>import pandas as pd
import numpy as np
myids=np.random.choice(np.arange(10000000), size=1000000, replace=False)
df1 = pd.DataFrame(myids, columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))
%%timeit
x = df1.merge(df2, how='left', left_on='A', right_on='A2')
#1 loop, best of 3: 664 ms per loop
%%timeit
x = df1.set_index('A').join(df2.set_index('A2'), how='left')
#1 loop, best of 3: 354 ms per loop
%%time
df1.set_index('A', inplace=True)
df2.set_index('A2', inplace=True)
#Wall time: 16 ms
%%timeit
x = df1.join(df2, how='left')
#10 loops, best of 3: 80.4 ms per loop
</code></pre>
<p>当要连接的列在两个表上的整数顺序不相同时,您仍然可以期望8倍的大速度。</p>