为什么这不是一个固定宽度的模式？

5 投票

3 回答

6484 浏览

提问于 2025-04-16 13:52

我正在尝试正确地拆分英语句子，于是我写了下面这个不太靠谱的正则表达式：

(?<!\d|([A-Z]\.)|(\.[a-z]\.)|(\.\.\.)|etc\.|[Pp]rof\.|[Dd]r\.|[Mm]rs\.|[Mm]s\.|[Mm]z\.|[Mm]me\.)(?<=([\.!?])|(?<=([\.!?][\'\"])))[\s]+?(?=[\S])'

问题是，Python一直报以下错误：


Traceback (most recent call last):
  File "", line 1, in 
  File "sp.py", line 55, in analyze
    self.sentences = re.split(god_awful_regex, self.inputstr.strip())
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 165, in split
    return _compile(pattern, 0).split(string, maxsplit)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 243, in _compile
    raise error, v # invalid expression
sre_constants.error: look-behind requires fixed-width pattern

为什么这个正则表达式不算有效的固定宽度正则？我没有使用任何重复字符（*或+），只是用了|。

编辑 @Anomie帮我解决了这个问题，非常感谢！不过，我现在无法让新的表达式保持平衡：

(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[\.!?])|(?<=[\.!?][\'\"\]))[\s]+?(?=[\S])

这是我现在的代码。左括号的数量和右括号的数量是匹配的：

>>> god_awful_regex = r'''(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[\.!?])|(?<=[\.!?][\'\"\]))[\s]+?(?=[\S])'''
>>> god_awful_regex.count('(')
17
>>> god_awful_regex.count(')')
17
>>> god_awful_regex.count('[')
13
>>> god_awful_regex.count(']')
13

还有其他想法吗？

正则表达式错误处理编程调试文本解析句子拆分括号匹配固定宽度

3 个回答

-1

看起来你可能在最后部分用了重复的字符：

[\s]+?

除非我理解错了。

更新

或者像nightcracker提到的那样用竖线，第一条回答似乎也确认了这个观点：判断正则表达式是否只匹配固定长度的字符串

回答于 2025-04-16 由 Python大师

分享举报

这段话并没有直接回答你的问题。不过，如果你想把一段文字分成句子，可以看看 nltk 这个工具。它里面有很多功能，其中就包括一个叫 PunktSentenceTokenizer 的句子分割器。下面是一些示例代码：

""" PunktSentenceTokenizer

A sentence tokenizer which uses an unsupervised algorithm to build a model
for abbreviation words, collocations, and words that start sentences; and then
uses that model to find sentence boundaries. This approach has been shown to
work well for many European languages. """

from nltk.tokenize.punkt import PunktSentenceTokenizer

tokenizer = PunktSentenceTokenizer()
print tokenizer.tokenize(__doc__)

# [' PunktSentenceTokenizer\n\nA sentence tokenizer which uses an unsupervised
# algorithm to build a model\nfor abbreviation words, collocations, and words
# that start sentences; and then\nuses that model to find sentence boundaries.',
# 'This approach has been shown to\nwork well for many European languages. ']

回答于 2025-04-16 由 Python大师

分享举报

考虑这个子表达式：

(?<=([\.!?])|(?<=([\.!?][\'\"])))

在这个表达式中，|符号左边是一个字符，而右边是零个字符。你在更大的负向回顾中也会遇到同样的问题，它可以是1、2、3、4或5个字符。

从逻辑上讲，负向回顾 (?<!A|B|C) 应该等同于一系列的回顾 (?<!A)(?<!B)(?<!C)。而正向回顾 (?<=A|B|C) 应该等同于 (?:(?<=A)|(?<=B)|(?<=C))。

回答于 2025-04-16 由 Python大师

分享举报

为什么这不是一个固定宽度的模式？

3 个回答

撰写回答