python re（regex）是否有unicode转义序列的替代方法？

codepoint = 2014 # Say I got this dynamically from somewhere test = u"This string ends with \u2014" pattern = r"\u%s$" % codepoint assert(pattern[-5:] == "2014$") # Ends with an escape sequence for U+2014 assert(re.search(pattern, test) != None) # Failure -- No match (bad) assert(re.search(pattern, "u2014")!= None) # Success -- This matches (bad)

2条回答

网友

1楼 · 编辑于 2024-05-23 19:15:10

一种可能是，与其直接调用re方法，不如将它们包装在可以理解\u代表它们的转义符的东西中。像这样：

def my_re_search(pattern, s):
    return re.search(unicode_unescape(pattern), s)

def unicode_unescape(s):
        """
        Turn \uxxxx escapes into actual unicode characters
        """
        def unescape_one_match(matchObj):
                escape_seq = matchObj.group(0)
                return escape_seq.decode('unicode_escape')
        return re.sub(r"\\u[0-9a-fA-F]{4}", unescape_one_match, s)

it工作示例：

^{2}$

感谢Process escape sequences in a string in Python指出了解码（“unicode_escape”）的想法。在

但请注意，您不能仅仅通过解码（“unicode_escape”）来抛出整个模式。它有时会起作用（因为大多数regex特殊字符在前面加反斜杠时不会改变它们的含义），但一般情况下不起作用。例如，这里使用decode（“unicode_escape”）会改变正则表达式的含义：

pat = r"C:\\.*\u20ac" # U+20ac is the euro sign
>>> print pat
C:\\.*\u20ac # Asks for a literal backslash

pat_revised  = pat.decode("unicode_escape")
>>> print pat_revised
C:\.*€ # Asks for a literal period (without a backslash)

网友

2楼 · 编辑于 2024-05-23 19:15:10

使用^{} function从代码点创建unicode字符：

pattern = u"%s$" % unichr(codepoint)

相关问题更多 >

编程相关推荐

热门问题

热门文章