删除具有错误等位基因的SNP

2024-06-08 20:02:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个这样的文件:

  1. 参考面板(参考面板.csv

    "id","position","allele0","allele1","allele1_frequency"
    "seq-rs1010355",55102179,"T","C",0.098
    "seq-rs272408",55103603,"C","T",0.787
    "seq-rs11669899",55104559,"A","T",0.029
    "imm_19_59798585",55106773,"A","G",0.499
    
  2. BIM文件(我的文件.bim

    19    19:55102179    0    55102179    C    T
    19    19:55103603    0    55103603    C    T
    19    19:55104559    0    55104559    G    C
    19    19:55106773    0    55106773    A    T
    

我想删除BIM文件中两个等位基因与参考面板不同的所有行。换句话说,我只想保留与参考面板具有完全相同等位基因的行-顺序无关紧要。你知道吗

示例

参考等位基因:

"seq-rs1010355",55102179,"T","C",0.098
"seq-rs272408",55103603,"C","T",0.787
"seq-rs11669899",55104559,"A","T",0.029
"imm_19_59798585",55106773,"A","G",0.499

BIM文件(我的文件.bim)你知道吗

19    19:55102179 0   55102179    C   T
19    19:55103603 0   55103603    C   T
19    19:55104559 0   55104559    G   C
19    19:55106773 0   55106773    A   T

仅保留以下行:

19    19:55102179 0   55102179    C   T
19    19:55103603 0   55103603    C   T

我用这些线从参考面板中提取了所有的位置:

#Create an empty list 
positions=[]

#Populate list with positions 
for line in open("ReferencePanel.csv"):
    columns = line.split(",")
    positions.append(columns[1])
#Remove first element which corresponds to the header
positions.pop(0)

但我被困在这里了。我希望有人能帮助我。 提前谢谢!你知道吗


Tags: 文件csv面板lineseqlist等位基因positions
1条回答
网友
1楼 · 发布于 2024-06-08 20:02:47

如果您不反对使用awk,可以使用以下命令:

awk -F'[",]*' 'NR==FNR && $4 && $5 {ref[$4][$5]=1} NR>FNR {FS=" *"} NR>FNR && ref[$6][$7]' reference.csv myfile.bim

导致:

19    19:55102179    0    55102179    C    T
19    19:55103603    0    55103603    C    T
19    19:55106773    0    55106773    A    T

注:最后一行与参考文件的第4行匹配(带A,T)

解释:

-F'[",]*'正在匹配用于分析引用文件的CSV分隔符

NR==FNR && $4 && $5 {ref[$4][$5]=1}从引用文件中获取所有C,T,G,A

NR>FNR {FS=" *"}正在将awk字段分隔符更改为空格以解析第二个文件

NR>FNR && ref[$6][$7]是第二个文件的打印行,如果第6列和第7列与数组中存储的内容匹配

相关问题 更多 >