将plink文件读入pandas数据帧
pandas-plink的Python项目详细描述
熊猫叮当声
pandas plink是一个python包,用于读取PLINK binary file format和(从2.0.0版开始)plink和gcta实现的关系矩阵。 文件读取是通过lazy loading进行的,这意味着它通过实际读取用户实际访问的基因型来节省内存。
显著的变化可以在CHANGELOG.md找到。
安装
使用pip:
安装pip install pandas-plink
或者可以通过conda:
conda install -c conda-forge pandas-plink
用法
它非常简单
>>>frompandas_plinkimportread_plink1_bin>>>G=read_plink1_bin("chr11.bed","chr11.bim","chr11.fam",verbose=False)>>>print(G)<xarray.DataArray'genotype'(sample:14,variant:779)>dask.array<shape=(14,779),dtype=float64,chunksize=(14,779)>Coordinates:*sample(sample)object'B001''B002''B003'...'B012''B013''B014'*variant(variant)object'11_316849996''11_316874359'...'11_345698259'father(sample)<U1'0''0''0''0''0''0'...'0''0''0''0''0''0'fid(sample)<U4'B001''B002''B003''B004'...'B012''B013''B014'gender(sample)<U1'0''0''0''0''0''0'...'0''0''0''0''0''0'i(sample)int64012345678910111213iid(sample)<U4'B001''B002''B003''B004'...'B012''B013''B014'mother(sample)<U1'0''0''0''0''0''0'...'0''0''0''0''0''0'trait(sample)<U2'-9''-9''-9''-9''-9'...'-9''-9''-9''-9''-9'a0(variant)<U1'C''G''G''C''C''T'...'T''A''C''A''A''T'a1(variant)<U1'T''C''C''T''T''A'...'C''G''T''G''C''C'chrom(variant)<U2'11''11''11''11''11'...'11''11''11''11''11'cm(variant)float640.00.00.00.00.00.0...0.00.00.00.00.0pos(variant)int64157439181802248969...289373752896109129005702snp(variant)<U9'316849996''316874359'...'345653648''345698259'>>>print(G.sel(sample="B003",variant="11_316874359").values)0.0>>>print(G.a0.sel(variant="11_316874359").values)G>>>print(G.sel(sample="B003",variant="11_316941526").values)2.0>>>print(G.a1.sel(variant="11_316941526").values)C
当用户访问时,基因型的一部分将被读取。
协方差矩阵也可以很容易地读取。 示例:
>>>frompandas_plinkimportread_rel>>>K=read_rel("plink2.rel.bin")>>>print(K)<xarray.DataArray(sample_0:10,sample_1:10)>array([[0.885782,0.233846,-0.186339,-0.009789,-0.138897,0.287779,0.269977,-0.231279,-0.095472,-0.213979],[0.233846,1.077493,-0.452858,0.192877,-0.186027,0.171027,0.406056,-0.013149,-0.131477,-0.134314],[-0.186339,-0.452858,1.183312,-0.040948,-0.146034,-0.204510,-0.314808,-0.042503,0.296828,-0.011661],[-0.009789,0.192877,-0.040948,0.895360,-0.068605,0.012023,0.057827,-0.192152,-0.089094,0.174269],[-0.138897,-0.186027,-0.146034,-0.068605,1.183237,0.085104,-0.032974,0.103608,0.215769,0.166648],[0.287779,0.171027,-0.204510,0.012023,0.085104,0.956921,0.065427,-0.043752,-0.091492,-0.227673],[0.269977,0.406056,-0.314808,0.057827,-0.032974,0.065427,0.714746,-0.101254,-0.088171,-0.063964],[-0.231279,-0.013149,-0.042503,-0.192152,0.103608,-0.043752,-0.101254,1.423033,-0.298255,-0.074334],[-0.095472,-0.131477,0.296828,-0.089094,0.215769,-0.091492,-0.088171,-0.298255,0.910274,-0.024663],[-0.213979,-0.134314,-0.011661,0.174269,0.166648,-0.227673,-0.063964,-0.074334,-0.024663,0.914586]])Coordinates:*sample_0(sample_0)object'HG00419''HG00650'...'NA20508''NA20753'*sample_1(sample_1)object'HG00419''HG00650'...'NA20508''NA20753'fid(sample_1)object'HG00419''HG00650'...'NA20508''NA20753'iid(sample_1)object'HG00419''HG00650'...'NA20508''NA20753'>>>print(K.values)[[0.890.23-0.19-0.01-0.140.290.27-0.23-0.10-0.21][0.231.08-0.450.19-0.190.170.41-0.01-0.13-0.13][-0.19-0.451.18-0.04-0.15-0.20-0.31-0.040.30-0.01][-0.010.19-0.040.90-0.070.010.06-0.19-0.090.17][-0.14-0.19-0.15-0.071.180.09-0.030.100.220.17][0.290.17-0.200.010.090.960.07-0.04-0.09-0.23][0.270.41-0.310.06-0.030.070.71-0.10-0.09-0.06][-0.23-0.01-0.04-0.190.10-0.04-0.101.42-0.30-0.07][-0.10-0.130.30-0.090.22-0.09-0.09-0.300.91-0.02][-0.21-0.13-0.010.170.17-0.23-0.06-0.07-0.020.91]]
请参阅pandas-plink documentation了解更多信息。
作者
许可证
这个项目是根据MIT License授权的。