在Python中匹配Unicode字边界 - 问答 - Python中文网

在Python中匹配Unicode字边界

2024-05-18 19:58:44 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

为了匹配Python中的Unicode字边界[如在Annex #29中定义的那样]，我使用了带有regex.WORD | regex.V1（regex.UNICODE应该是默认值，因为模式是Unicode字符串）的regex包如下：

>>> s="here are some words"
>>> regex.findall(r'\w(?:\B\S)*', s, flags = regex.V1 | regex.WORD)
['here', 'are', 'some', 'words']

在这种相当简单的情况下，它能很好地工作。但是，我想知道如果输入字符串包含某些标点符号，预期的行为是什么。在我看来，WB7表示例如x'z中的撇号不符合单词边界的条件，这似乎是事实：

^{pr2}$

但是，如果有元音，情况会发生变化：

>>> regex.findall(r'\w(?:\B\S)*', "l'avion", flags = regex.V1 | regex.WORD)
["l'", 'avion']

这意味着regex模块实现了标准中在Notes部分中提到的规则WB5a。但是，此规则还规定行为应该与\u2019（右单引号）相同，我无法复制：

>>> regex.findall(r'\w(?:\B\S)*', "l\u2019avion", flags = regex.V1 | regex.WORD)
['l’avion']

此外，即使使用“正常”撇号，连字（或y）似乎也表现为“非元音”：

>>> regex.findall(r'\w(?:\B\S)*', "l'œil", flags = regex.V1 | regex.WORD)
["l'œil"]
>>> regex.findall(r'\w(?:\B\S)*', "J'y suis", flags = regex.V1 | regex.WORD)
["J'y", 'suis']

这是预期的行为吗？（上面所有的例子都是用regex2.4.106和python3.5.2执行的）

Tags：字符串 here unicode 情况 some are regex word

1条回答

网友

1楼 · 发布于 2024-05-18 19:58:44

1-右单引号’似乎只是在source file中漏掉了：

/* Break between apostrophe and vowels (French, Italian). */
/* WB5a */
if (pos_m1 >= 0 && char_at(state->text, pos_m1) == '\'' &&
  is_unicode_vowel(char_at(state->text, text_pos)))
    return TRUE;

2-Unicode元音由^{}函数确定，该函数转换为以下列表：

^{pr2}$

因此LATIN SMALL LIGATURE OEœ字符不被视为unicode元音：

^{3}$

此错误现在已在regex 2016.08.27中的bug report之后修复。[_regex.c:#1668]

相关问题更多 >

编程相关推荐

热门问题

热门文章