用于分析序列ID的正则表达式

with open("composition.in","rb") as yeast_all: yeast_all=yeast_all.read() # convert file to string ## Regular expression to clean up rogue ">" characters ## i.e. "<i>", "<sub>", etc which screw up ## the structure of the eveuntual list import re id_delimeter = r'^>{1}+\w{7,10}+\s' match=re.search(id_delimeter, yeast_all) if match: print 'found', match.group() else: print 'did not find' yeast_all=yeast_all.split(id_delimeter)[1:]

1条回答

网友

1楼 · 发布于 2024-06-07 10:09:02

试试看

>(?P<id>[\w-]+)\s.*\n(?P<sequence>[\w\n]+)

您将在组id中找到ID，在组sequence中找到序列。你知道吗

Demo.

说明：

> # start with a ">" character
(?P<id> # capture the ID in group "id"
    [\w-]+ # this matches any number (>1) of word characters (A to Z, a to z, digits, and _) or dashes "-"
)
\s+ # after the ID, there must be at least one whitespace character
.* # consume the metadata part, we have no interest in this
\n # up to a newline
(?P<sequence> # finally, capture the sequence data in group "sequence"
    [\w\n]+ # this matches any number (>1) of word characters and newlines.
)

作为python代码：

text= '''>YKL068W-A
foo
ABCD

>XYZ1234
<><><><>><<<>
LMNOP'''

pattern= '>(?P<id>[\w-]+)\n.*\n(?P<sequence>\w+)'

for id, sequence in re.findall(pattern, text):
    print((id, sequence))

#

#

相关问题更多 >

编程相关推荐

热门问题

热门文章