基于多个列关系合并pandas数据框

Question

假设你有一个包含区域（起始和结束坐标）的数据表，还有另一个包含位置的数据表，这些位置可能在某个区域内，也可能不在。例如：

region = pd.DataFrame({'chromosome': [1, 1, 1, 1, 2, 2, 2, 2], 'start': [1000, 2000, 3000, 4000, 1000, 2000, 3000, 4000], 'end': [2000, 3000, 4000, 5000, 2000, 3000, 4000, 5000]})
position = pd.DataFrame({'chromosome': [1, 2, 1, 3, 2, 1, 1], 'BP': [1500, 1100, 10000, 2200, 3300, 400, 5000]})
print region
print position


   chromosome   end  start
0           1  2000   1000
1           1  3000   2000
2           1  4000   3000
3           1  5000   4000
4           2  2000   1000
5           2  3000   2000
6           2  4000   3000
7           2  5000   4000

      BP  chromosome
0   1500           1
1   1100           2
2  10000           1
3   2200           3
4   3300           2
5    400           1
6   5000           1

一个位置如果在某个区域内，满足以下条件：

position['BP'] >= region['start'] &
position['BP'] <= region['end'] &
position['chromosome'] == region['chromosome']

每个位置最多只能在一个区域内，但也可能不在任何区域内。

那么，最好的方法是将这两个数据表合并，这样如果某个位置在某个区域内，就会在位置数据表中添加相应的区域信息。这样做的结果大概是这样的：

      BP  chromosome  start  end
0   1500           1  1000   2000
1   1100           2  1000   2000
2  10000           1  NA     NA
3   2200           3  NA     NA
4   3300           2  3000   4000
5    400           1  NA     NA
6   5000           1  4000   5000

一种方法是写一个函数来计算我想要的关系，然后使用数据表的.apply方法，像这样：

def within(pos, regs):
    istrue = (pos.loc['chromosome'] == regs['chromosome']) & (pos.loc['BP'] >= regs['start']) &  (pos.loc['BP'] <= regs['end'])
    if istrue.any():
        ind = regs.index[istrue].values[0]
        return(regs.loc[ind ,['start', 'end']])
    else:
        return(pd.Series([None, None], index=['start', 'end']))

position[['start', 'end']] = position.apply(lambda x: within(x, region), axis=1)
print position

      BP  chromosome  start   end
0   1500           1   1000  2000
1   1100           2   1000  2000
2  10000           1    NaN   NaN
3   2200           3    NaN   NaN
4   3300           2   3000  4000
5    400           1    NaN   NaN
6   5000           1   4000  5000

不过我希望能有比每次比较都花O(N)时间的方式更高效的方法。谢谢！

性能优化数据处理数据表数据框合并应用函数位置匹配数据关系区域分析

基于多个列关系合并pandas数据框

2 个回答

撰写回答