大Pandas将统计异常数据矢量化

2024-04-20 11:10:54 发布

男 | 程序猿一只，喜欢编程写python代码。

我正在寻找一种在大型数据集上进行优势比测试的更快方法。我有大约1200个变量（见var_col），我想测试彼此的互斥/共现性。优势比检验被定义为（a*d）/（b*c）），其中a、b、c、d是（a）两个位点均发生改变的样本数（b）（b）在位点x中发生改变，而不是在y（c）在y中改变，而不是在x（d）中同时改变的样本数。我还想计算fisher精确检验来确定统计意义。scipy函数fisher_exact可以同时计算这两个参数（见下文）。在

#here's a sample of my original dataframe
sample_id_no  var_col
       0    258.0
       1    -24.0
       2   -150.0
       3    149.0
       4    108.0
       5   -126.0
       6    -83.0
       7      2.0
       8   -177.0
       9   -171.0
      10     -7.0
      11   -377.0
      12   -272.0
      13     66.0
      14    -13.0
      15     -7.0
      16      0.0
      17    189.0
      18      7.0
      13    -21.0
      19     80.0
      20    -14.0
      21    -76.0
       3     83.0
      22   -182.0

import pandas as pd
import numpy as np
from scipy.stats import fisher_exact
import itertools

#create a dataframe with each possible pair of variable
var_pairs = pd.DataFrame(list(itertools.combinations(df.var_col.unique(),2) )).rename(columns = {0:'alpha_site', 1: 'beta_site'})

#create a cross-tab with samples and vars
sample_table = pd.crosstab(df.sample_id_no, df.var_col)

odds_ratio_results = var_pairs.apply(getOddsRatio, axis=1, args = (sample_table,))

#where the function getOddsRatio is defined as:
def getOddsRatio(pairs, sample_table):   

    alpha_site, beta_site = pairs
    oddsratio, pvalue = fisher_exact(pd.crosstab(sample_table[alpha_site] > 0, sample_table[beta_site] > 0))
    return ([oddsratio, pvalue])

这段代码运行非常慢，尤其是在大型数据集上使用时。在我的实际数据集中，我有大约700k个变量对。由于getOddsRatio（）函数分别应用于每对线对，因此它无疑是导致速度慢的主要原因。有没有更有效的解决方案？在

Tags：数据 sample import alpha df var as table

0条回答

目前没有回答

大Pandas将统计异常数据矢量化

相关问题更多 >

编程相关推荐

热门问题

热门文章

大Pandas将统计异常数据矢量化

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >