在Python中分隔单词的Regex

网友

1楼 · 编辑于 2024-05-14 09:16:59

可以使用字符串函数代替正则表达式：

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')
s.split()

但是，在您的示例中，您不想删除John's中的撇号，而是希望在you!!'中删除它。所以字符串操作在这一点上失败了，您需要一个经过精细调整的正则表达式。

编辑：也许一个简单的正则表达式可以解决您的问题：

(\w[\w']*)

它将捕获以字母开头的所有字符，并在下一个字符是撇号或字母时继续捕获。

(\w[\w']*\w)

第二个regex是针对一个非常特殊的情况。。。。第一个regex可以捕获像you'这样的单词。这一个将避免这一点，并且只捕捉撇号，如果是在字内（不是在开头或结尾）。但在这一点上，一种情况是，您不能用第二个regex捕获撇号Moss' mom。您必须决定是否在以s结尾的名称中捕获尾随撇号并定义所有权。

示例：

rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

更新2：我在我的正则表达式中发现了一个错误！它不能捕获后跟撇号的单个字母，如A'。固定的全新regex在这里：

(\w[\w']*\w|\w)

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']

网友

2楼 · 编辑于 2024-05-14 09:16:59

正则表达式中有太多捕获组；请将它们设为非捕获组：

(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)

演示：

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']

只返回一个空的元素。

网友

3楼 · 编辑于 2024-05-14 09:16:59

此regex只允许一个结尾撇号，后面可以跟着一个字符：

([\w][\w]*'?\w?)

演示：

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]

相关问题更多 >

编程相关推荐

热门问题

热门文章