使用正则表达式替换文本文件中括号内的对象

0 投票
4 回答
2163 浏览
提问于 2025-04-16 23:33

我有一个打开的文本文件,叫做 f。我需要找到所有用方括号括起来的文本,包括这些方括号本身。例如,下面这个:

1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last

它会匹配并打印出:

1 - [First]
3 - [Finally]
3 - [B]

一旦我打印了这些匹配的内容,我想把它们删除,并整理一下多余的空格,这样最后的文本会变成:

1 - This is the line
2 - (And) another line
3 - the last

这个功能的概念大致是这样的,不过我在处理正则表达式的部分遇到了一些困难:

def find_and_replace(file):
    f=open(file)
    regex = re.compile("[.+]")
    find regex.all
    for item in regex.all:
        print item, line-number
        replace(item, '')
        normalize white space

谢谢。

4 个回答

1

在正则表达式方面,"[.+]" 会创建一个字符类,这个字符类可以匹配一个 . 或者一个 +。你需要对 [] 这两个字符进行转义,因为它们在正则表达式中有特殊的含义。此外,这个表达式会匹配像 [a] foo [b] 这样的字符串,因为默认情况下,量词是贪婪的。你可以在 + 后面加一个 ?,这样就可以让它匹配尽可能短的字符序列。

所以可以试试 "\\[.+?\\]",看看是否有效。

如果你还想找到并去掉 [],那么可以把 + 这个量词换成 *

2

你需要对 [] 这些字符进行转义 并且 使用一个非贪婪的操作符。

r'\[.+?\]'

注意,使用正则表达式时,你不能有嵌套的括号,比如 [foo [bar]] 是不可以的。

另外,为了去掉多余的空格,可以在正则表达式的末尾加上 \s?

举个例子:

>>> a = '''1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
'''
>>> a = re.sub(r'\[.+?\]\s?','',a)
>>> print(a)
1 - This is the line
2 - (And) another line
3 - the last
1

根据JBernardo的正则表达式,我们可以在每次去掉带括号的字符串块时,显示出这一行和它的行号:

import re

ss = '''When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ  ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—'''

print ss,'\n'

dico_lines = dict( (n,repr(line)) for n,line in enumerate(ss.splitlines(True),1))

def repl(mat, countline =[1]):
    if mat.group(1):
        print "line %s: detecting \\n , the counter of lines is incremented -> %s" % (countline[0],countline[0]+1)
        countline[0] += 1
        return mat.group(1)
    else:
        print "line %s: removing %10s  in  %s" % (countline[0],repr(mat.group()),dico_lines[countline[0]])
        return ''

print '\n'+re.sub(r'(\n)|\[.*?\] ?',repl,ss)

结果是

When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ  ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:— 

line 1: removing  '[xxxx] '  in  'When colour goes [xxxx] home into the eyes,\n'
line 1: detecting \n , the counter of lines is incremented -> 2
line 2: detecting \n , the counter of lines is incremented -> 3
line 3: removing    '[yyy]'  in  "With danc[yyy]ing girls and sweet birds' cries\n"
line 3: detecting \n , the counter of lines is incremented -> 4
line 4: removing '[ZZZZ  ] '  in  'Behind the gateways[ZZZZ  ] of the brain;\n'
line 4: detecting \n , the counter of lines is incremented -> 5
line 5: detecting \n , the counter of lines is incremented -> 6
line 6: removing    '[AAA]'  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[UUUUU] '  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing   '[BBBB]'  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'

When colour goes home into the eyes,
And lights that shine are shut again,
With dancing girls and sweet birds' cries
Behind the gatewaysof the brain;
And that no-place which gave them birth, shall close
The rainbow and the rose:—

但是正如JBernardo指出的,如果字符串中有嵌套的括号,这个正则表达式就会出现问题:

ss = 'one [two [three] ] end of line'
print re.sub(r'\[.+?\]\s?','',ss)

产生的结果是

one ] end of line

如果修改正则表达式的模式,最终只会去掉更深层的嵌套括号块:

ss = 'one [two [three] ] end of line'
print re.sub(r'\[[^\][]*\]\s?','',ss)

结果是

one [two ] end of line

所以我查找了一些解决方案,以便你想处理所有嵌套的括号字符串块时可以参考。
因为正则表达式并不是解析器,我们不能在不进行多次迭代的情况下,去掉包含嵌套括号的字符串块。

子案例 1

简单去掉嵌套的括号块:

import re

ss = '''This is the [first]       line   
(And) another line
   [Inter][A] initially shifted
[Finally][B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases
tuvulu[]gusti perena[3]              bdiiii
    [Away [is this] [][4] ] shifted content
    fgjezhr][fgh
'''

def clean(x, regx = re.compile('( |(?<! ))+((?<!])\[[^[\]]*\])( *)')):
    while regx.search(x):
        print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
        x = regx.sub('\\1',x)
    return x


print '\n==========================\n'+clean(ss)

我只给出结果。如果你想跟着执行,可以自己试试。

This is the line   
(And) another line
 initially shifted
the last
    Additional ending lines (this one without brackets):    
cases
tuvulugusti perenabdiiii
 shifted content
    fgjezhr][fgh

可以注意到,前两行留下了空白:

   [Inter][A] initially shifted
    [Away [is this] [][4] ] shifted content

被转换成

 initially shifted
 shifted content

子案例 2:

所以我改进了正则表达式和算法,以清除这些行开头的所有空白。

def clean(x, regx = re.compile('(?=^( ))?( |(?<! ))+((?<!])\[[^[\]]*\])( )*',re.MULTILINE)):
    def repl(mat):
        return '' if mat.group(1) else mat.group(2)
    while regx.search(x):
        print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
        x = regx.sub(repl,x)
    return x


print '\n==========================\n'+clean(ss)

结果是

This is the line   
(And) another line
initially shifted
the last
    Additional ending lines (this one without brackets):    
cases
tuvulugusti perenabdiiii
shifted content
    fgjezhr][fgh

那些开头有空白但没有被修正的括号块的行保持不变。如果你也想去掉这些行开头的空白,最好对所有行做一个strip(),这样你就不需要这个解决方案,之前的那个就足够了。

子案例 3:

为了显示进行删除操作的行,现在需要对代码进行修改,以考虑到我们进行的是迭代:

  • 每次迭代时,行会逐渐变化,我们不能使用一个固定的dico_lines

  • 而且在每次迭代时,行的计数器必须重置为1

为了实现这两个调整,我使用了一种小技巧:修改替换函数的func_default

import re

ss = '''This is the [first]       line   
(And) another line
   [Inter][A] initially shifted
[Finally][B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases
tuvulu[]gusti perena[3]              bdiiii
    [Away [is this] [][4] ] shifted content
    fgjezhr][fgh
'''

def clean(x, rag = re.compile('\[.*\]',re.MULTILINE),
          regx = re.compile('(\n)|(?=^( ))?( |(?<! ))+((?<!])\[[^[\]\n]*\])( *)',re.MULTILINE)):

    def repl(mat, cnt = None, dico_lignes = None):
        if mat.group(1):
            print "line %s: detecting %s  ==> count incremented to %s" % (cnt[0],str(mat.groups('')),cnt[0]+1)
            cnt[0] += 1
            return mat.group(1)
        if mat.group(4):
            print "line %s: removing %s   IN   %s" % (cnt[0],repr(mat.group(4)),dico_lignes[cnt[0]])
            return '' if mat.group(2) else mat.group(3)

    while rag.search(x):
        print '\n--------------------------\n'+x
        repl.func_defaults = ([1],dict( (n,repr(line)) for n,line in enumerate(x.splitlines(True),1)))
        x = regx.sub(repl,x)
    return x


print '\n==========================\n'+clean(ss)

结果是

--------------------------
This is the [first]       line   
(And) another line
   [Inter][A] initially shifted
[Finally][B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases
tuvulu[]gusti perena[3]              bdiiii
    [Away [is this] [][4] ] shifted content
    fgjezhr][fgh

line 1: removing '[first]'   IN   'This is the [first]       line   \n'
line 1: detecting ('\n', '', '', '', '')  ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '')  ==> count incremented to 3
line 3: removing '[Inter]'   IN   '   [Inter][A] initially shifted\n'
line 3: detecting ('\n', '', '', '', '')  ==> count incremented to 4
line 4: removing '[Finally]'   IN   '[Finally][B] the last\n'
line 4: detecting ('\n', '', '', '', '')  ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '')  ==> count incremented to 6
line 6: removing '[ 1]'   IN   '[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases\n'
line 6: removing '[some]'   IN   '[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases\n'
line 6: detecting ('\n', '', '', '', '')  ==> count incremented to 7
line 7: removing '[]'   IN   'tuvulu[]gusti perena[3]              bdiiii\n'
line 7: removing '[3]'   IN   'tuvulu[]gusti perena[3]              bdiiii\n'
line 7: detecting ('\n', '', '', '', '')  ==> count incremented to 8
line 8: removing '[is this]'   IN   '    [Away [is this] [][4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '')  ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '')  ==> count incremented to 10

--------------------------
This is the line   
(And) another line
[A] initially shifted
[B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref ] there are] other ]cases
tuvulugusti perenabdiiii
    [Away [][4] ] shifted content
    fgjezhr][fgh

line 1: detecting ('\n', '', '', '', '')  ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '')  ==> count incremented to 3
line 3: removing '[A]'   IN   '[A] initially shifted\n'
line 3: detecting ('\n', '', '', '', '')  ==> count incremented to 4
line 4: removing '[B]'   IN   '[B] the last\n'
line 4: detecting ('\n', '', '', '', '')  ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '')  ==> count incremented to 6
line 6: removing '[ref ]'   IN   '[Note that [ by the way [ref ] there are] other ]cases\n'
line 6: detecting ('\n', '', '', '', '')  ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '')  ==> count incremented to 8
line 8: removing '[]'   IN   '    [Away [][4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '')  ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '')  ==> count incremented to 10

--------------------------
This is the line   
(And) another line
initially shifted
the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way there are] other ]cases
tuvulugusti perenabdiiii
    [Away [4] ] shifted content
    fgjezhr][fgh

line 1: detecting ('\n', '', '', '', '')  ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '')  ==> count incremented to 3
line 3: detecting ('\n', '', '', '', '')  ==> count incremented to 4
line 4: detecting ('\n', '', '', '', '')  ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '')  ==> count incremented to 6
line 6: removing '[ by the way there are]'   IN   '[Note that [ by the way there are] other ]cases\n'
line 6: detecting ('\n', '', '', '', '')  ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '')  ==> count incremented to 8
line 8: removing '[4]'   IN   '    [Away [4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '')  ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '')  ==> count incremented to 10

--------------------------
This is the line   
(And) another line
initially shifted
the last
    Additional ending lines (this one without brackets):    
[Note that other ]cases
tuvulugusti perenabdiiii
    [Away ] shifted content
    fgjezhr][fgh

line 1: detecting ('\n', '', '', '', '')  ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '')  ==> count incremented to 3
line 3: detecting ('\n', '', '', '', '')  ==> count incremented to 4
line 4: detecting ('\n', '', '', '', '')  ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '')  ==> count incremented to 6
line 6: removing '[Note that other ]'   IN   '[Note that other ]cases\n'
line 6: detecting ('\n', '', '', '', '')  ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '')  ==> count incremented to 8
line 8: removing '[Away ]'   IN   '    [Away ] shifted content\n'
line 8: detecting ('\n', '', '', '', '')  ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '')  ==> count incremented to 10

==========================
This is the line   
(And) another line
initially shifted
the last
    Additional ending lines (this one without brackets):    
cases
tuvulugusti perenabdiiii
shifted content
    fgjezhr][fgh

撰写回答