python-re：如何匹配字母字符

Question

我想知道怎么用正则表达式匹配字母字符。我想找一个在 \w 中的字符，但不在 \d 中。我需要它支持unicode，所以不能用 [a-zA-Z]。

Answer 1

你可以使用以下表达式来匹配一个字母：

(?![\d_])\w

或者

\w(?<![\d_])

在这里，我使用 \w 来匹配字母，但要确保在它前面和后面没有匹配到 [\d_]。

来自文档的说明：

(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.

(?<!...)
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

Answer 2

关于这个问题：

\p{L}

你可以把这个文档当作参考：Unicode 正则表达式

编辑：似乎Python 不支持 Unicode 表达式。可以看看这个链接：用 Python 正则表达式处理带重音的字符 -- [A-Z] 真的不够用（这个链接不再活跃，可以访问互联网档案馆）

其他参考资料：

为了后人，这里是博客上的例子：

import re
string = 'richÃ©'
print string
richÃ©

richre = re.compile('([A-z]+)')
match = richre.match(string)
print match.groups()
('rich',)

richre = re.compile('(\w+)',re.LOCALE)
match = richre.match(string)
print match.groups()
('rich',)

richre = re.compile('([Ã©\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

richre = re.compile('([\xe9\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

richre = re.compile('([\xe9-\xf8\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

string = 'richÃ©Ã±'
match = richre.match(string)
print match.groups()
('rich\xe9\xf1',)

richre = re.compile('([\u00E9-\u00F8\w]+)')
print match.groups()
('rich\xe9\xf1',)

matched = match.group(1)
print matched
richÃ©Ã±

Answer 3

你前面说的两句话其实是矛盾的。“在 \w 里，但不在 \d 里”是包括下划线的。我猜测你第三句话的意思是你不想要下划线。

用一个简单的维恩图在信封背面画一下会有帮助。我们先来看一下我们不想要的东西：

(1) 不被 \w 匹配的字符（也就是说，不想要字母、数字或下划线以外的任何东西） => \W
(2) 数字 => \d
(3) 下划线 => _

所以我们不想要的就是字符类 [\W\d_] 中的任何东西，因此我们想要的就是字符类 [^\W\d_] 中的任何东西。

这里有一个简单的例子（Python 2.6）。

>>> import re
>>> rx = re.compile("[^\W\d_]+", re.UNICODE)
>>> rx.findall(u"abc_def,k9")
[u'abc', u'def', u'k']

进一步探索发现这种方法有一些小问题：

>>> import unicodedata as ucd
>>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021"
>>> for x in allsorts:
...     print repr(x), ucd.category(x), ucd.name(x)
...
u'\u0473' Ll CYRILLIC SMALL LETTER FITA
u'\u0660' Nd ARABIC-INDIC DIGIT ZERO
u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU
u'\u24e8' So CIRCLED LATIN SMALL LETTER Y
u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A
u'\u3020' So POSTAL MARK FACE
u'\u3021' Nl HANGZHOU NUMERAL ONE
>>> rx.findall(allsorts)
[u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']

U+3021（杭州数字一）被当作数字处理（所以它匹配 \w），但看起来Python把“数字”理解为“十进制数字”（类别 Nd），所以它不匹配 \d。

U+2438（带圈的拉丁小写字母Y）不匹配 \w。

所有的中日韩汉字都被归类为“字母”，因此匹配 \w。

以上三点是否重要，这种方法在当前发布的re模块中是你能得到的最好结果。像 \p{letter} 这样的语法是未来的事。

python-re：如何匹配字母字符

3 个回答

撰写回答