Python中的多行模式匹配
一个定期生成的计算机消息(简化版):
Hello user123,
- (604)7080900
- 152
- minutes
Regards
使用Python,我该如何提取“(604)7080900”、“152”、“minutes”(也就是任何以“- ”开头的文本)在两个空行之间(空行是指“Hello user123”后面的\n\n
和“Regards”前面的\n\n
)。如果结果字符串列表能存储在一个数组里就更好了。谢谢!
补充说明:两个空行之间的行数是不固定的。
第二次补充:
例如:
hello
- x1
- x2
- x3
- x4
- x6
morning
- x7
world
x1、x2、x3是好的,因为它们都被两个空行包围,x4也是好的,原因相同。x6不好,因为后面没有空行,x7也不好,因为前面没有空行。x2是好的(和x6、x7不一样),因为它前面有一行好行,后面也有一行好行。
我在发布问题时,这些条件可能不太清楚:
a continuous of good lines between 2 empty lines
good line must have leading "- "
good line must follow an empty line or follow another good line
good line must be followed by an empty line or followed by another good line
谢谢
4 个回答
在编程中,有时候我们需要让程序在特定的条件下执行某些操作。这就像给程序设定了一些规则,只有当这些规则被满足时,程序才会按照我们的要求去做。
比如说,如果你想让程序在用户输入一个数字时,检查这个数字是否大于10。如果大于10,程序就会显示“这个数字很大”;如果小于或等于10,程序就会显示“这个数字不大”。这样,我们就可以通过简单的条件判断来控制程序的行为。
这就是条件语句的基本用法,它帮助我们让程序变得更聪明,能够根据不同的情况做出不同的反应。
>>> s = """Hello user123,
- (604)7080900
- 152
- minutes
Regards
"""
>>> import re
>>> re.findall(r'^- (.*)', s, re.M)
['(604)7080900', '152', 'minutes']
>>> import re
>>>
>>> x="""Hello user123,
...
... - (604)7080900
... - 152
... - minutes
...
... Regards
... """
>>>
>>> re.findall("\n+\n-\s*(.*)\n-\s*(.*)\n-\s*(minutes)\s*\n\n+",x)
[('(604)7080900', '152', 'minutes')]
>>>
当然可以!请把你想要翻译的内容发给我,我会帮你用简单易懂的语言解释清楚。
最简单的方法是遍历这些行(假设你有一个行的列表,或者一个文件,或者把字符串分割成行的列表),直到你看到一行是空的,也就是只有 '\n'
。然后检查每一行是否以 '- '
开头(可以用 startswith
这个字符串方法),把它切掉,保存结果,直到你遇到另一个空行。例如:
# if you have a single string, split it into lines.
L = s.splitlines()
# if you (now) have a list of lines, grab an iterator so we can continue
# iteration where it left off.
it = iter(L)
# Alternatively, if you have a file, just use that directly.
it = open(....)
# Find the first empty line:
for line in it:
# Treat lines of just whitespace as empty lines too. If you don't want
# that, do 'if line == ""'.
if not line.strip():
break
# Now starts data.
for line in it:
if not line.rstrip():
# End of data.
break
if line.startswith('- '):
data.append(line[:2].rstrip())
else:
# misformed data?
raise ValueError, "misformed line %r" % (line,)
更新:因为你详细说明了你想做的事情,这里是更新后的循环版本。它不再循环两次,而是收集数据,直到遇到一行“坏”的行,然后在遇到块分隔符时决定是保存还是丢弃收集到的行。它不需要明确的迭代器,因为它不会重新开始迭代,所以你只需传递一个行的列表(或者任何可迭代的对象):
def getblocks(L):
# The list of good blocks (as lists of lines.) You can also make this
# a flat list if you prefer.
data = []
# The list of good lines encountered in the current block
# (but the block may still become bad.)
block = []
# Whether the current block is bad.
bad = 1
for line in L:
# Not in a 'good' block, and encountering the block separator.
if bad and not line.rstrip():
bad = 0
block = []
continue
# In a 'good' block and encountering the block separator.
if not bad and not line.rstrip():
# Save 'good' data. Or, if you want a flat list of lines,
# use 'extend' instead of 'append' (also below.)
data.append(block)
block = []
continue
if not bad and line.startswith('- '):
# A good line in a 'good' (not 'bad' yet) block; save the line,
# minus
# '- ' prefix and trailing whitespace.
block.append(line[2:].rstrip())
continue
else:
# A 'bad' line, invalidating the current block.
bad = 1
# Don't forget to handle the last block, if it's good
# (and if you want to handle the last block.)
if not bad and block:
data.append(block)
return data
下面是它的实际运行效果:
>>> L = """hello
...
... - x1
... - x2
... - x3
...
... - x4
...
... - x6
... morning
... - x7
...
... world""".splitlines()
>>> print getblocks(L)
[['x1', 'x2', 'x3'], ['x4']]