在Python中将Unicode文本规范化为文件名等

20 投票

5 回答

7993 浏览

数据工程师

提问于 2025-04-17 11:16

有没有什么独立的解决方案，可以把国际化的Unicode文本转换成安全的ID和文件名，适用于Python呢？

比如，把 My International Text: åäö 转换成 my-international-text-aao。

plone.i18n 这个库做得很好，但可惜它依赖于 zope.security 和 zope.publisher 还有其他一些包，这让它变得不太稳定。

这是 plone.i18n 进行的一些操作

文本处理 unicode 国际化依赖管理文件名规范化安全ID plone.i18n 规范化工具

5 个回答

我也来分享一下我自己的（部分）解决方案：

import unicodedata

def deaccent(some_unicode_string):
    return u''.join(c for c in unicodedata.normalize('NFD', some_unicode_string)
               if unicodedata.category(c) != 'Mn')

这个方法虽然不能完全满足你的需求，但提供了一些不错的小技巧，封装在一个方便的方法里：unicode.normalise('NFD', some_unicode_string)可以对unicode字符进行分解，比如，它会把'ä'分解成两个unicode代码点：U+03B3和U+0308。

另一个方法是unicodedata.category(char)，它会返回特定字符char的unicode类别。类别Mn包含所有的组合重音符号，因此deaccent可以去掉单词中的所有重音。

不过要注意，这只是一个部分解决方案，它能去掉重音，但你还需要一个字符白名单，来决定在这之后允许哪些字符。

回答于 2025-04-17 由 Python大师

分享举报

解决这个问题的方法是先决定哪些字符是被允许的（不同的系统对有效标识符有不同的规则）。

一旦你决定了允许哪些字符，就可以写一个 allowed() 函数，并创建一个字典的子类，用来配合 str.translate 使用：

def makesafe(text, allowed, substitute=None):
    ''' Remove unallowed characters from text.
        If *substitute* is defined, then replace
        the character with the given substitute.
    '''
    class D(dict):
        def __getitem__(self, key):
            return key if allowed(chr(key)) else substitute
    return text.translate(D())

这个函数非常灵活。它让你可以轻松地指定规则，决定哪些文本被保留，哪些文本被替换或删除。

这里有一个简单的例子，使用的规则是“只允许属于unicode类别L的字符”：

import unicodedata

def allowed(character):
    return unicodedata.category(character).startswith('L')

print(makesafe('the*ides&of*march', allowed, '_'))
print(makesafe('the*ides&of*march', allowed))

这段代码会产生安全的输出，如下所示：

the_ides_of_march
theidesofmarch

回答于 2025-04-17 由 Python大师

分享举报

你想做的事情也叫“slugify”，就是把一个字符串转换成适合用作网址的一种格式。这里有一个可能的解决方案：

import re
from unicodedata import normalize

_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\\]^_`{|},.:]+')

def slugify(text, delim=u'-'):
    """Generates an slightly worse ASCII-only slug."""
    result = []
    for word in _punct_re.split(text.lower()):
        word = normalize('NFKD', word).encode('ascii', 'ignore')
        if word:
            result.append(word)
    return unicode(delim.join(result))

使用方法：

>>> slugify(u'My International Text: åäö')
u'my-international-text-aao'

你还可以改变分隔符：

>>> slugify(u'My International Text: åäö', delim='_')
u'my_international_text_aao'

来源： 生成 Slugs

适用于 Python 3： pastebin.com/ft7Yb3KS（感谢 @MrPoxipol）。

回答于 2025-04-17 由 Python大师

分享举报

在Python中将Unicode文本规范化为文件名等

5 个回答

撰写回答