使用python,我想提取并打印每个b的第一行中包含10917、11396和1116920的块

2024-06-09 00:05:42 发布

您现在位置:Python中文网/ 问答频道 /正文

每个块以hg19开始,以空格结束。我可以使用正则表达式来提取所需的块吗

hg19.chr1 10917 479
panTro2.chr15 13606 455

hg19.chr1 11396 93
panTro2.chr15 14061 42
bosTau4.chr5 113864279 105

hg19.chr1 11489 81
panTro2.chr15 14103 81
bosTau4.chr5 113864398 80
equCab2.chr6 54105327 83
canFam2.chr27 45128907 82
calJac1.Contig8673 78513 67

hg19.chr1 1116920 38
panTro2.chr1 1103202 38
gorGor1.Supercontig_0004540 23214 38
ponAbe2.chr1 534356 38
papHam1.scaffold19767 38455 38
calJac1.Contig4288 217257 29
micMur1.scaffold_101519 296 37
dipOrd1.scaffold_7421 49811 22
cavPor3.scaffold_186 248497 22
bosTau4.chr16 29320296 47
equCab2.chr2 72413055 53
felCat3.scaffold_124042 293309 9

hg19.chr1 1116863 57
papHam1.scaffold19767 38399 56
ponAbe2.chr1 534300 56


and so on...

我试过使用各种正则表达式,但都没有成功


Tags: scaffold空格chr1hg19chr5chr6chr15scaffold19767
1条回答
网友
1楼 · 发布于 2024-06-09 00:05:42

下面将从名为input.txt的文件中读取数据。然后创建一个包含所有块的列表。然后,它会过滤此列表以仅包含所需的条目,然后显示它们:

import re

with open('input.txt') as f_input:
    data = f_input.read()
    blocks = re.findall(r'(^hg19\..*?)\n*?(?=^hg19\.|\Z)', data, re.S + re.M)

allowed = set(["10917", "11396", "1116920"])
blocks = [block for block in blocks if block.split('\n', 1)[0].split()[1] in allowed]

for block in blocks:
    print block
    print '  '

这将显示以下内容:

hg19.chr1 10917 479
panTro2.chr15 13606 455
  
hg19.chr1 11396 93
panTro2.chr15 14061 42
bosTau4.chr5 113864279 105
  
hg19.chr1 1116920 38
panTro2.chr1 1103202 38
gorGor1.Supercontig_0004540 23214 38
ponAbe2.chr1 534356 38
papHam1.scaffold19767 38455 38
calJac1.Contig4288 217257 29
micMur1.scaffold_101519 296 37
dipOrd1.scaffold_7421 49811 22
cavPor3.scaffold_186 248497 22
bosTau4.chr16 29320296 47
equCab2.chr2 72413055 53
felCat3.scaffold_124042 293309 9
  

这假设您的文件足够小,可以一次轻松地放入内存。使用Python2.7.6测试

相关问题 更多 >