我应该使用Python casefold吗？

1条回答

网友

1楼 · 发布于 2024-05-21 08:09:24

1）在python3中，^{}应该用于实现无大小写字符串匹配。

从python3.0开始，字符串被存储为Unicode。The Unicode Standard Chapter 3.13定义了默认的无大小写匹配，如下所示：

A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

Python's ^{} implements the Unicode's ^{}.所以应该使用它来实现无大小写字符串匹配。尽管如此，单是折箱还不足以覆盖一些角落的案件，也不足以通过土耳其测试（见第3点）。在

2）从Python3.6开始，casefold（）无法通过Turkey测试。

对于两个字符，大写字母I和带点大写字母I，the Unicode Standard defines two different casefolding mappings.

默认值（非突厥语）：
I→I（U+0049→U+0069）
İ→i̇（U+0130→U+0069 U+0307）

备选方案（突厥语）：
I→ı（U+0049→U+0131）
İ→i（U+0130→U+0069）

Pythonscasefold()只能应用默认映射，并且无法通过土耳其测试。例如，土耳其语单词“LİMANI”和“limanı”是无大小写的等价物，但是"LİMANI".casefold() == "limanı".casefold()返回{}。没有启用替代映射的选项。在

3）如何在Python3中进行无大小写字符串匹配。

The Unicode Standard Chapter 3.13描述了几种无实例匹配算法。规范的无容器匹配可能适合大多数用例。这个算法已经考虑了所有的角点情况。我们只需要增加一个选项，在非突厥语和突厥语之间切换。在

import unicodedata

def normalize_NFD(string):
    return unicodedata.normalize('NFD', string)

def casefold_(string, include_special_i=False):
    if include_special_i:
        string = unicodedata.normalize('NFC', string)
        string = string.replace('\u0049', '\u0131')
        string = string.replace('\u0130', '\u0069')
    return string.casefold()

def casefold_NFD(string, include_special_i=False):
    return normalize_NFD(casefold_(normalize_NFD(string), include_special_i))

def caseless_match(string1, string2, include_special_i=False):
    return  casefold_NFD(string1, include_special_i) == casefold_NFD(string2, include_special_i)

casefold_()是Python的casefold()的包装器。如果它的参数include_special_i被设置为True，那么它将应用突厥语映射，如果它被设置为False，则使用默认映射。在

caseless_match()对string1和string2进行规范的无容器匹配。如果字符串是突厥语单词，include_special_i参数必须设置为True。在

示例：

caseless_match('LİMANI', 'limanı', include_special_i=True) 是的

caseless_match('LİMANI', 'limanı') 假

caseless_match('INTENSIVE', 'intensive', include_special_i=True) 假

caseless_match('INTENSIVE', 'intensive') 是的

相关问题更多 >

编程相关推荐

热门问题

热门文章