正则表达式帮助

2024-05-21 06:20:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在python3中创建一个regex,它匹配由未知字符数分隔的7个字符(例如>;AB0012),然后再匹配另外6个字符(例如aaabbb或bbbaaa)。我的输入字符串可能如下所示:

>AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

这是我想出的正则表达式:

matches = re.findall(r'(>.{7})(aaabbb|bbbaaa)', mystring)  
print(matches)

我尝试生成的输出如下所示:

^{pr2}$

我通读了Python文档,但找不到如何匹配regex的两个部分之间的未知距离。有没有某种通配符可以让我完成正则表达式?提前感谢您的帮助!在

编辑:
如果我在代码中使用*?,如下所示:

mystring = str(input("Paste promoters here: "))
matches = re.findall(r'(>.{7})*?(aaabbb|bbbaaa)', mystring)
print(matches)

我的输出如下:
[('>;CD00192','aabbb'),('','bbaaba'),('',aaabbb')]

*列表中的第二项和第三项分别缺少>;CD00192和>;ZP01990。如何让regex在列表中包含这些字符?在


Tags: gtre列表字符python3regexprintmatches
3条回答

使用*可以匹配零个或多个字符,因此a*将匹配"""a""aa"等。+匹配一个或多个字符。在

您可能还想通过使用+?或{}使量词(+或{})变懒。在

有关详细信息,请参见regular-expressions.info。在

我有一个代码也给出了位置。在

下面是这个代码的简单版本:

import re
from collections import OrderedDict

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'

regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')

dic = OrderedDict()


# Finding the result
for mat in regx.finditer(ch):
    chunk,head = mat.groups()
    headstart = mat.start()
    dic[(headstart,head)] = [(headstart+six.start(),six.start(),six.group())
                             for six in rag.finditer(chunk)]


# Diplaying the result
for (headstart,head),li in dic.iteritems():
    print '{:>10} {}'.format(headstart,head)
    for x in li:
        print '{0[0]:>10} {0[1]:>6} {0[2]}'.format(x)

结果

^{pr2}$

同样的代码,在功能上,使用生成器:

import re
from itertools import imap
from collections import OrderedDict

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'

regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')

gen = ((mat.groups(),mat.start()) for mat in regx.finditer(ch)) 


dic = OrderedDict(((headstart,head),
                   [(headstart+six.start(),six.start(),six.group())
                    for six in rag.finditer(chunk)])
                  for (chunk,head),headstart in gen)


print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
                '\n'.join(imap('{0[0]:>10} {0[1]:>6} {0[2]}'.format,li))
                for (headstart,head),li in dic.iteritems())

一。在

编辑

我测量了死刑的时间。在

对于每个代码,我分别测量了字典的创建和显示。在

使用生成器的代码(第二个)显示结果的速度(0.020秒)比另一个(0.148秒)快7.4倍

但令我惊讶的是,使用生成器的代码比其他代码(0.000718秒)计算字典的时间多47%。在

一。在

编辑2

另一种方法:

import re
from collections import OrderedDict
from itertools import imap

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'


regx = re.compile('((?<=>).{7})|(aaabbb|bbbaaa)')

def collect(ch):
    li = []
    dic = OrderedDict()

    gen = ( (x.start(),x.group(1),x.group(2)) for x in regx.finditer(ch))
    for st,g1,g2 in gen:
        if g1:
            if li:
                dic[(stprec,g1prec)] = li
            li,stprec,g1prec = [],st,g1
        elif g2:
            li.append((st,g2))
    if li:
        dic[(stprec,g1prec)] = li
    return dic


dic = collect(ch)

print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
                '\n'.join(imap('{0[0]:>10}   {0[1]}'.format,li))
                for (headstart,head),li in dic.iteritems())

结果

>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa 

        24 CD00192
        31   aaabbb
        41   bbbaaa
        52   bbbaaa
        62   aaabbb
        69 ZP01990
        95   aaabbb
       136 SE45789
       148   aaabbb
       172   bbbaaa

此代码在0.00040秒内计算dic,并在0.0321秒内显示

一。在

编辑3

为了回答您的问题,除了将“CD00192”、“zp0100”、“SE45789”等中的每个当前值保存在一个名称下(我不想在Python中说“in a variable”,因为Python中没有变量。但你可以在一个名为“下读”,就像我在变量“中写了”)

为此,必须使用finditer()

以下是此解决方案的代码:

import re

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'

regx = re.compile('(>.{7})|(aaabbb|bbbaaa)')

matches = []
for mat in regx.finditer(ch):
    g1,g2= mat.groups()
    if g1:
        head = g1
    else:
        matches.append((head,g2))

print matches

结果

>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa 

[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>SE45789', 'aaabbb'), ('>SE45789', 'bbbaaa')]

我前面的代码更复杂,因为它们捕捉位置并将“CD00192”、“ZP01990”、“SE45789”等一个标题的值“aaabbb”和“bbaaa”收集到一个列表中。在

这里有一个非正则表达式方法。在“>;”(您的数据将从第二个元素开始)上拆分,然后由于您不关心这7个字符是什么,所以从第8个字符开始检查,直到第14个字符。在

>>> string=""" AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa""" 
>>> for i in string.split(">")[1:]:
...   if i[7:13] in ["aaabbb","bbbaaa"]:
...     print ">" + i[:13]
...
>CD00192aaabbb

相关问题 更多 >