我有两个这样的文件:
参考面板(参考面板.csv)
"id","position","allele0","allele1","allele1_frequency" "seq-rs1010355",55102179,"T","C",0.098 "seq-rs272408",55103603,"C","T",0.787 "seq-rs11669899",55104559,"A","T",0.029 "imm_19_59798585",55106773,"A","G",0.499
BIM文件(我的文件.bim)
19 19:55102179 0 55102179 C T 19 19:55103603 0 55103603 C T 19 19:55104559 0 55104559 G C 19 19:55106773 0 55106773 A T
我想删除BIM文件中两个等位基因与参考面板不同的所有行。换句话说,我只想保留与参考面板具有完全相同等位基因的行-顺序无关紧要。你知道吗
示例:
参考等位基因:
"seq-rs1010355",55102179,"T","C",0.098 "seq-rs272408",55103603,"C","T",0.787 "seq-rs11669899",55104559,"A","T",0.029 "imm_19_59798585",55106773,"A","G",0.499
BIM文件(我的文件.bim)你知道吗
19 19:55102179 0 55102179 C T 19 19:55103603 0 55103603 C T 19 19:55104559 0 55104559 G C 19 19:55106773 0 55106773 A T
仅保留以下行:
19 19:55102179 0 55102179 C T 19 19:55103603 0 55103603 C T
我用这些线从参考面板中提取了所有的位置:
#Create an empty list
positions=[]
#Populate list with positions
for line in open("ReferencePanel.csv"):
columns = line.split(",")
positions.append(columns[1])
#Remove first element which corresponds to the header
positions.pop(0)
但我被困在这里了。我希望有人能帮助我。 提前谢谢!你知道吗
如果您不反对使用
awk
,可以使用以下命令:导致:
注:最后一行与参考文件的第4行匹配(带A,T)
解释:
-F'[",]*'
正在匹配用于分析引用文件的CSV分隔符NR==FNR && $4 && $5 {ref[$4][$5]=1}
从引用文件中获取所有C,T,G,ANR>FNR {FS=" *"}
正在将awk
字段分隔符更改为空格以解析第二个文件NR>FNR && ref[$6][$7]
是第二个文件的打印行,如果第6列和第7列与数组中存储的内容匹配相关问题 更多 >
编程相关推荐