使用正则表达式re.sub删除指定词之前及包含该词的所有内容

10 投票

3 回答

20300 浏览

提问于 2025-04-18 15:23

我有一个字符串，看起来像是“Blah blah blah, Updated: Aug. 23, 2012”，我想用正则表达式来提取出日期部分Aug. 23, 2012。我在网上找到了一篇类似的文章：正则表达式去掉某个字符前的所有文本，但我试过后发现也不管用。

date_div = "Blah blah blah, Updated: Aug. 23, 2012"
extracted_date = re.sub('^[^Updated]*',"", date_div)

我该怎么做才能去掉“Updated”之前的所有内容，包括“Updated”本身，这样只剩下Aug. 23, 2012呢？

谢谢！

正则表达式字符串处理编程技巧文本解析数据清洗信息提取日期格式正则替换

3 个回答

你可以使用前瞻功能：

import re
date_div = "Blah blah blah, Updated: Aug. 23, 2012"
extracted_date = re.sub('^(.*)(?=Updated)',"", date_div)
print extracted_date

输出结果

Updated: Aug. 23, 2012

编辑
如果下面MattDMo的评论是对的，并且你想要把“更新：”也去掉的话，你可以这样做：

extracted_date = re.sub('^(.*Updated: )',"", date_div)

回答于 2025-04-18 由 Python大师

分享举报

使用正则表达式时，可以根据单词出现的情况使用两种不同的正则表达式。

# Remove all up to the first occurrence of the word including it (non-greedy):
^.*?word
# Remove all up to the last occurrence of the word including it (greedy):
^.*word

你可以查看这个非贪婪正则表达式示例和这个贪婪正则表达式示例。

这里的 ^ 表示字符串的开始位置，.*? 表示匹配任意数量的字符（注意使用 re.DOTALL 标志，这样 . 就可以匹配换行符），并且尽量少匹配字符（而 .* 是尽量多匹配）。接着 word 会匹配并消耗这个单词，也就是说，它会把这个单词加入到匹配结果中，并推进正则表达式的索引。

注意使用 re.escape(up_to_word)：如果你的 up_to_word 里有特殊字符，而不仅仅是字母、数字和下划线，使用 re.escape 会更安全，这样像 (、[、? 这些特殊字符就不会影响正则表达式找到有效的匹配。

你可以查看这个Python 示例：

import re

date_div = "Blah blah\nblah, Updated: Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019"

up_to_word = "Updated:"
rx_to_first = r'^.*?{}'.format(re.escape(up_to_word))
rx_to_last = r'^.*{}'.format(re.escape(up_to_word))

print("Remove all up to the first occurrence of the word including it:")
print(re.sub(rx_to_first, '', date_div, flags=re.DOTALL).strip())
print("Remove all up to the last occurrence of the word including it:")
print(re.sub(rx_to_last, '', date_div, flags=re.DOTALL).strip())

输出结果：

Remove all up to the first occurrence of the word including it:
Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019
Remove all up to the last occurrence of the word including it:
Feb. 13, 2019

回答于 2025-04-18 由 Python大师

分享举报

在这种情况下，你可以不使用正则表达式来实现，比如：

>>> date_div = "Blah blah blah, Updated: Aug. 23, 2012"
>>> date_div.split('Updated: ')
['Blah blah blah, ', 'Aug. 23, 2012']
>>> date_div.split('Updated: ')[-1]
'Aug. 23, 2012'

回答于 2025-04-18 由 Python大师

分享举报

使用正则表达式re.sub删除指定词之前及包含该词的所有内容

3 个回答

撰写回答