用时间戳分割字符串

1 投票
1 回答
1361 浏览
提问于 2025-04-18 04:28

你好,我从网上抓取了一些信息,并把它们整理成一个字符串,去掉了所有的HTML标签,最终得到了一个单一的字符串,内容如下:

foo XX:XX +XX:XX bar XX:XX +X:XX bar2 XX:XX +X:XX bar3 XX:XX bar4 XX:XX bar5

在这个字符串中,foo前面没有时间戳,保留或删除foo都可以,因为它总是作为第一个bar出现。

我想在XX:XX的地方进行分割,但不想+XX:XX的地方分割。每个bar前面可能是XX:XX +XX:XX,也可能只有XX:XX

我还希望在分割时保留时间戳,这样我就能得到一个字符串列表,像这样:

XX:XX +XX:XX bar
XX:XX +XX:XX bar2
.....
XX:XX bar5

为了帮助理解,这个内容是基于从BBC网站上获取的足球比赛的HTML评论,比如这个链接:http://www.bbc.co.uk/sport/0/football/27092972

我正在尝试的正则表达式作为起点是:

(?(name)\d+:\d\d|\+\d+:\d\d)

这个表达式是错误的,因为它无法编译,应该是这样的格式:

(?(id/name)yes-pattern|no-pattern)

其中“是”的模式是:

\d+:\d\d (1 or more digits, colon, 2 digits)

而“不是”的模式是:

+\d+:\d\d (same as yes pattern, but with a + sign proceeding)

我将使用re.split(expression)来进行分割。

另外,我计划稍后将时间戳转换为秒,所以我会把XX:XX+XX:XX加到YY:YY上。

这是我程序当前处理的一个示例字符串:

Full Time Match ends, Everton 3, Swansea City 1. 90:00 +4:09 Full time Full Time Second Half ends, Everton 3, Swansea City 1. 90:00 +2:47 Attempt blocked. Nathan Dyer (Swansea City) right footed shot from the centre of the box is blocked. Assisted by Pablo Hernández. 90:00 +0:18 Offside, Swansea City. Leroy Lita tries a through ball, but Ashley Williams is caught offside. 89:31 Corner, Swansea City. Conceded by Leighton Baines. 88:42 Foul by James McCarthy (Everton). 

所以我希望得到一个列表,内容如下:

Full Time Match ends, Everton 3, Swansea City 1.
90:00 +4:09 Full time Full Time Second Half ends, Everton 3, Swansea City 1. 
90:00 +2:47 Attempt blocked. Nathan Dyer (Swansea City) right footed shot from the     centre of the box is blocked. Assisted by Pablo Hernández. 
90:00 +0:18 Offside, Swansea City. Leroy Lita tries a through ball, but Ashley Williams is caught offside. 
89:31 Corner, Swansea City. Conceded by Leighton Baines. 
88:42 Foul by James McCarthy (Everton).

1 个回答

1

你可以在这里使用正向前瞻

results = re.split(r'\s+(?=\d+:\d{2})', s)

正则表达式:

\s+           # whitespace (\n, \r, \t, \f, and " ") (1 or more times)
(?=           # look ahead to see if there is:
 \d+          # digits (0-9) (1 or more times)
 :            # ':'
 \d{2}        # digits (0-9) (2 times)
)             # end of look-ahead

输出结果

[
 'Full Time Match ends, Everton 3, Swansea City 1.', 
 '90:00 +4:09 Full time Full Time Second Half ends, Everton 3, Swansea City 1.', 
 '90:00 +2:47 Attempt blocked. Nathan Dyer (Swansea City) right footed shot from the centre of the box is blocked. Assisted by Pablo Hern\xc3\x83\xc2\xa1ndez.', 
 '90:00 +0:18 Offside, Swansea City. Leroy Lita tries a through ball, but Ashley Williams is caught offside.',
 '89:31 Corner, Swansea City. Conceded by Leighton Baines.', 
 '88:42 Foul by James McCarthy (Everton). '
]

撰写回答