在Python中分隔单词的Regex

2024-04-25 01:56:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我设计了一个regex,将给定文本中的所有实际单词拆分为:


输入示例:

"John's mom went there, but he wasn't there. So she said: 'Where are you'"


预期产量:

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]



我想到了这样一个正则表达式:

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"

在Python中进行拆分后,结果包含None项和空格。

如何摆脱无项目?为什么空间不匹配?


编辑:
在空格上拆分,将给出如下项:["there."]
在非字母上拆分,会得到如下项:["John","s"]
在除'以外的非字母上拆分,将得到如下项:["'Where","you'"]


Tags: yousowherejohnarebuthethere
3条回答

可以使用字符串函数代替正则表达式:

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')
s.split()

但是,在您的示例中,您不想删除John's中的撇号,而是希望在you!!'中删除它。所以字符串操作在这一点上失败了,您需要一个经过精细调整的正则表达式。

编辑:也许一个简单的正则表达式可以解决您的问题:

(\w[\w']*)

它将捕获以字母开头的所有字符,并在下一个字符是撇号或字母时继续捕获。

(\w[\w']*\w)

第二个regex是针对一个非常特殊的情况。。。。第一个regex可以捕获像you'这样的单词。这一个将避免这一点,并且只捕捉撇号,如果是在字内(不是在开头或结尾)。但在这一点上,一种情况是,您不能用第二个regex捕获撇号Moss' mom。您必须决定是否在以s结尾的名称中捕获尾随撇号并定义所有权。

示例:

rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

更新2:我在我的正则表达式中发现了一个错误!它不能捕获后跟撇号的单个字母,如A'。固定的全新regex在这里:

(\w[\w']*\w|\w)

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']

正则表达式中有太多捕获组;请将它们设为非捕获组:

(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)

演示:

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']

只返回一个空的元素。

此regex只允许一个结尾撇号,后面可以跟着一个字符:

([\w][\w]*'?\w?)

演示:

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]

相关问题 更多 >