使用正则表达式替换文本文件中括号内的对象
我有一个打开的文本文件,叫做 f。我需要找到所有用方括号括起来的文本,包括这些方括号本身。例如,下面这个:
1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
它会匹配并打印出:
1 - [First]
3 - [Finally]
3 - [B]
一旦我打印了这些匹配的内容,我想把它们删除,并整理一下多余的空格,这样最后的文本会变成:
1 - This is the line
2 - (And) another line
3 - the last
这个功能的概念大致是这样的,不过我在处理正则表达式的部分遇到了一些困难:
def find_and_replace(file):
f=open(file)
regex = re.compile("[.+]")
find regex.all
for item in regex.all:
print item, line-number
replace(item, '')
normalize white space
谢谢。
4 个回答
在正则表达式方面,"[.+]"
会创建一个字符类,这个字符类可以匹配一个 .
或者一个 +
。你需要对 [
和 ]
这两个字符进行转义,因为它们在正则表达式中有特殊的含义。此外,这个表达式会匹配像 [a] foo [b]
这样的字符串,因为默认情况下,量词是贪婪的。你可以在 +
后面加一个 ?
,这样就可以让它匹配尽可能短的字符序列。
所以可以试试 "\\[.+?\\]"
,看看是否有效。
如果你还想找到并去掉 []
,那么可以把 +
这个量词换成 *
。
你需要对 []
这些字符进行转义 并且 使用一个非贪婪的操作符。
r'\[.+?\]'
注意,使用正则表达式时,你不能有嵌套的括号,比如 [foo [bar]]
是不可以的。
另外,为了去掉多余的空格,可以在正则表达式的末尾加上 \s?
。
举个例子:
>>> a = '''1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
'''
>>> a = re.sub(r'\[.+?\]\s?','',a)
>>> print(a)
1 - This is the line
2 - (And) another line
3 - the last
根据JBernardo的正则表达式,我们可以在每次去掉带括号的字符串块时,显示出这一行和它的行号:
import re
ss = '''When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—'''
print ss,'\n'
dico_lines = dict( (n,repr(line)) for n,line in enumerate(ss.splitlines(True),1))
def repl(mat, countline =[1]):
if mat.group(1):
print "line %s: detecting \\n , the counter of lines is incremented -> %s" % (countline[0],countline[0]+1)
countline[0] += 1
return mat.group(1)
else:
print "line %s: removing %10s in %s" % (countline[0],repr(mat.group()),dico_lines[countline[0]])
return ''
print '\n'+re.sub(r'(\n)|\[.*?\] ?',repl,ss)
结果是
When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—
line 1: removing '[xxxx] ' in 'When colour goes [xxxx] home into the eyes,\n'
line 1: detecting \n , the counter of lines is incremented -> 2
line 2: detecting \n , the counter of lines is incremented -> 3
line 3: removing '[yyy]' in "With danc[yyy]ing girls and sweet birds' cries\n"
line 3: detecting \n , the counter of lines is incremented -> 4
line 4: removing '[ZZZZ ] ' in 'Behind the gateways[ZZZZ ] of the brain;\n'
line 4: detecting \n , the counter of lines is incremented -> 5
line 5: detecting \n , the counter of lines is incremented -> 6
line 6: removing '[AAA]' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[UUUUU] ' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[BBBB]' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
When colour goes home into the eyes,
And lights that shine are shut again,
With dancing girls and sweet birds' cries
Behind the gatewaysof the brain;
And that no-place which gave them birth, shall close
The rainbow and the rose:—
但是正如JBernardo指出的,如果字符串中有嵌套的括号,这个正则表达式就会出现问题:
ss = 'one [two [three] ] end of line'
print re.sub(r'\[.+?\]\s?','',ss)
产生的结果是
one ] end of line
如果修改正则表达式的模式,最终只会去掉更深层的嵌套括号块:
ss = 'one [two [three] ] end of line'
print re.sub(r'\[[^\][]*\]\s?','',ss)
结果是
one [two ] end of line
。
所以我查找了一些解决方案,以便你想处理所有嵌套的括号字符串块时可以参考。
因为正则表达式并不是解析器,我们不能在不进行多次迭代的情况下,去掉包含嵌套括号的字符串块。
。
子案例 1
简单去掉嵌套的括号块:
import re
ss = '''This is the [first] line
(And) another line
[Inter][A] initially shifted
[Finally][B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref [ 1]] there are] [some] other ]cases
tuvulu[]gusti perena[3] bdiiii
[Away [is this] [][4] ] shifted content
fgjezhr][fgh
'''
def clean(x, regx = re.compile('( |(?<! ))+((?<!])\[[^[\]]*\])( *)')):
while regx.search(x):
print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
x = regx.sub('\\1',x)
return x
print '\n==========================\n'+clean(ss)
我只给出结果。如果你想跟着执行,可以自己试试。
This is the line
(And) another line
initially shifted
the last
Additional ending lines (this one without brackets):
cases
tuvulugusti perenabdiiii
shifted content
fgjezhr][fgh
可以注意到,前两行留下了空白:
[Inter][A] initially shifted
[Away [is this] [][4] ] shifted content
被转换成
initially shifted
shifted content
子案例 2:
所以我改进了正则表达式和算法,以清除这些行开头的所有空白。
def clean(x, regx = re.compile('(?=^( ))?( |(?<! ))+((?<!])\[[^[\]]*\])( )*',re.MULTILINE)):
def repl(mat):
return '' if mat.group(1) else mat.group(2)
while regx.search(x):
print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
x = regx.sub(repl,x)
return x
print '\n==========================\n'+clean(ss)
结果是
This is the line
(And) another line
initially shifted
the last
Additional ending lines (this one without brackets):
cases
tuvulugusti perenabdiiii
shifted content
fgjezhr][fgh
那些开头有空白但没有被修正的括号块的行保持不变。如果你也想去掉这些行开头的空白,最好对所有行做一个strip(),这样你就不需要这个解决方案,之前的那个就足够了。
子案例 3:
为了显示进行删除操作的行,现在需要对代码进行修改,以考虑到我们进行的是迭代:
每次迭代时,行会逐渐变化,我们不能使用一个固定的dico_lines
而且在每次迭代时,行的计数器必须重置为1
为了实现这两个调整,我使用了一种小技巧:修改替换函数的func_default
import re
ss = '''This is the [first] line
(And) another line
[Inter][A] initially shifted
[Finally][B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref [ 1]] there are] [some] other ]cases
tuvulu[]gusti perena[3] bdiiii
[Away [is this] [][4] ] shifted content
fgjezhr][fgh
'''
def clean(x, rag = re.compile('\[.*\]',re.MULTILINE),
regx = re.compile('(\n)|(?=^( ))?( |(?<! ))+((?<!])\[[^[\]\n]*\])( *)',re.MULTILINE)):
def repl(mat, cnt = None, dico_lignes = None):
if mat.group(1):
print "line %s: detecting %s ==> count incremented to %s" % (cnt[0],str(mat.groups('')),cnt[0]+1)
cnt[0] += 1
return mat.group(1)
if mat.group(4):
print "line %s: removing %s IN %s" % (cnt[0],repr(mat.group(4)),dico_lignes[cnt[0]])
return '' if mat.group(2) else mat.group(3)
while rag.search(x):
print '\n--------------------------\n'+x
repl.func_defaults = ([1],dict( (n,repr(line)) for n,line in enumerate(x.splitlines(True),1)))
x = regx.sub(repl,x)
return x
print '\n==========================\n'+clean(ss)
结果是
--------------------------
This is the [first] line
(And) another line
[Inter][A] initially shifted
[Finally][B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref [ 1]] there are] [some] other ]cases
tuvulu[]gusti perena[3] bdiiii
[Away [is this] [][4] ] shifted content
fgjezhr][fgh
line 1: removing '[first]' IN 'This is the [first] line \n'
line 1: detecting ('\n', '', '', '', '') ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '') ==> count incremented to 3
line 3: removing '[Inter]' IN ' [Inter][A] initially shifted\n'
line 3: detecting ('\n', '', '', '', '') ==> count incremented to 4
line 4: removing '[Finally]' IN '[Finally][B] the last\n'
line 4: detecting ('\n', '', '', '', '') ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '') ==> count incremented to 6
line 6: removing '[ 1]' IN '[Note that [ by the way [ref [ 1]] there are] [some] other ]cases\n'
line 6: removing '[some]' IN '[Note that [ by the way [ref [ 1]] there are] [some] other ]cases\n'
line 6: detecting ('\n', '', '', '', '') ==> count incremented to 7
line 7: removing '[]' IN 'tuvulu[]gusti perena[3] bdiiii\n'
line 7: removing '[3]' IN 'tuvulu[]gusti perena[3] bdiiii\n'
line 7: detecting ('\n', '', '', '', '') ==> count incremented to 8
line 8: removing '[is this]' IN ' [Away [is this] [][4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '') ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '') ==> count incremented to 10
--------------------------
This is the line
(And) another line
[A] initially shifted
[B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref ] there are] other ]cases
tuvulugusti perenabdiiii
[Away [][4] ] shifted content
fgjezhr][fgh
line 1: detecting ('\n', '', '', '', '') ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '') ==> count incremented to 3
line 3: removing '[A]' IN '[A] initially shifted\n'
line 3: detecting ('\n', '', '', '', '') ==> count incremented to 4
line 4: removing '[B]' IN '[B] the last\n'
line 4: detecting ('\n', '', '', '', '') ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '') ==> count incremented to 6
line 6: removing '[ref ]' IN '[Note that [ by the way [ref ] there are] other ]cases\n'
line 6: detecting ('\n', '', '', '', '') ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '') ==> count incremented to 8
line 8: removing '[]' IN ' [Away [][4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '') ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '') ==> count incremented to 10
--------------------------
This is the line
(And) another line
initially shifted
the last
Additional ending lines (this one without brackets):
[Note that [ by the way there are] other ]cases
tuvulugusti perenabdiiii
[Away [4] ] shifted content
fgjezhr][fgh
line 1: detecting ('\n', '', '', '', '') ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '') ==> count incremented to 3
line 3: detecting ('\n', '', '', '', '') ==> count incremented to 4
line 4: detecting ('\n', '', '', '', '') ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '') ==> count incremented to 6
line 6: removing '[ by the way there are]' IN '[Note that [ by the way there are] other ]cases\n'
line 6: detecting ('\n', '', '', '', '') ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '') ==> count incremented to 8
line 8: removing '[4]' IN ' [Away [4] ] shifted content\n'
line 8: detecting ('\n', '', '', '', '') ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '') ==> count incremented to 10
--------------------------
This is the line
(And) another line
initially shifted
the last
Additional ending lines (this one without brackets):
[Note that other ]cases
tuvulugusti perenabdiiii
[Away ] shifted content
fgjezhr][fgh
line 1: detecting ('\n', '', '', '', '') ==> count incremented to 2
line 2: detecting ('\n', '', '', '', '') ==> count incremented to 3
line 3: detecting ('\n', '', '', '', '') ==> count incremented to 4
line 4: detecting ('\n', '', '', '', '') ==> count incremented to 5
line 5: detecting ('\n', '', '', '', '') ==> count incremented to 6
line 6: removing '[Note that other ]' IN '[Note that other ]cases\n'
line 6: detecting ('\n', '', '', '', '') ==> count incremented to 7
line 7: detecting ('\n', '', '', '', '') ==> count incremented to 8
line 8: removing '[Away ]' IN ' [Away ] shifted content\n'
line 8: detecting ('\n', '', '', '', '') ==> count incremented to 9
line 9: detecting ('\n', '', '', '', '') ==> count incremented to 10
==========================
This is the line
(And) another line
initially shifted
the last
Additional ending lines (this one without brackets):
cases
tuvulugusti perenabdiiii
shifted content
fgjezhr][fgh