Python正则表达式从句子中提取地址和旅行时间?

2024-04-20 14:04:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用正则表达式来解析一个句子中的地址和时间。不同的句子变量是:

  1. I want to go from Cosmos Station to 525 Greenlane highway.
  2. I want to go from Cosmos Station to 525 Greenlane highway tomorrow at 8am.
  3. I want to go from Cosmos Station to 525 Greenlane highway at 8am.

我希望用一种简单的方法来解决这个问题,得到一个介于from和to之间的文本,并假设它是一个原点等等。你知道吗

from(.*)to(*.)

走这条路对吗?我想提取的起源,目的地和时间。预期结果是:

Origin = cosmos station
Destination = 525 Greenlane Highway
remaining_string = none if sentences ends at destination
remaining_string = text after destination 

Tags: tofromgostring地址时间destinationat
2条回答
from\s(?P<Origin>[\d\w\s]*?)\sto\s(?P<Dest>[\d\w\s]*?)(?:$|(?P<Time>\b(?:tomorrow|at)\b.*))

你可以看看我的解决方案in a live online demo at regex101.com。你知道吗

有三个命名的捕获组,每个捕获组对应一个目标变量。你知道吗

您将注意到在Time capture组中,我有(tomorrow|at),它用于匹配时间子字符串的时间起始字。你知道吗

虽然这适用于您的特定问题,但必须对所有其他可能检查的时间值进行扩展。你知道吗

如果我们不知道我们可以或不能做出什么样的假设,那么很难做出一个能够捕获所有边缘情况的正则表达式,所以请随意发布完整的预期输入集。你知道吗

这项工作针对给定的样本:

import re

string = """
I want to go from Cosmos Station to 525 Greenlane highway.
I want to go from Cosmos Station to 525 Greenlane highway tomorrow at 8am.
I want to go from Cosmos Station to 525 Greenlane highway at 8am
"""
# to make the pattern a little readable
# in your example time separator are either at or tomorrow at you can add more
at_separators = {'at': '(?:(?:tomorrow at)|(?:at))'}
# after to we capture all string if there is no at separator after it
# if there is second group will capture the string between too and at separator
pattern = 'from\s(.+?)\sto\s(.+?(?=\s{at})|.+(?!{at}\s))(?:\s{at}(.+))?'.format(**at_separators)
pattern = re.compile(pattern, flags=re.MULTILINE)
# no you hust need to clean the result to clean '.' and noises because doing this
# in the pattern will make it a unreadable.
print(re.findall(pattern, string))

输出:

[('Cosmos Station', '525 Greenlane highway.', ''), ('Cosmos Station', '525 Greenlane highway', ' 8am.'), ('Cosmos Station', '525 Greenlane highway', ' 8am')]

正如您在第一组中看到的,第三个位置是空字符串,因为没有时间。这个键是正向的lookahead.+?(?=\s{at}),它不会占用时间部分,但是它会在(?:\s{at}(.+))?之前返回。你知道吗

相关问题 更多 >