Python Unicode 正则表达式

0 投票

4 回答

3935 浏览

提问于 2025-04-15 13:05

我正在使用Python 2.4，遇到了一些关于Unicode正则表达式的问题。我试着整理了一个非常清晰简洁的例子来说明我的问题。看起来Python在识别不同字符编码时可能出现了一些问题，或者是我对这个概念的理解有误。非常感谢你能帮我看看！

#!/usr/bin/python
#
# This is a simple python program designed to show my problems with regular expressions and character encoding in python
# Written by Brian J. Stinar
# Thanks for the help! 

import urllib # To get files off the Internet
import chardet # To identify charactor encodings
import re # Python Regular Expressions 
#import ponyguruma # Python Onyguruma Regular Expressions - this can be uncommented if you feel like messing with it, but I have the same issue no matter which RE's I'm using

rawdata = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read()
print (chardet.detect(rawdata))
#print (rawdata)

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2') # Let's grab this as text
UTF_8_encoded = ISO_8859_2_encoded.encode('utf-8') # and encode the text as UTF-8
print(chardet.detect(UTF_8_encoded)) # Looks good

# This totally doesn't work, even though you can see UNSUBSCRIBE in the HTML
# Eventually, I want to recognize the entire physical address and UNSUBSCRIBE above it
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE)
print (str(re_UNSUB_amsterdam.match(UTF_8_encoded)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on UTF-8")
print (str(re_UNSUB_amsterdam.match(rawdata)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on raw data")

re_amsterdam = re.compile(".*Adobe.*", re.UNICODE)
print (str(re_amsterdam.match(rawdata)) + "\t--- RE for 'Adobe' on raw data") # However, this work?!?
print (str(re_amsterdam.match(UTF_8_encoded)) + "\t--- RE for 'Adobe' on UTF-8")

'''
# In additon, I tried this regular expression library much to the same unsatisfactory result
new_re = ponyguruma.Regexp(".*UNSUBSCRIBE.*")
if new_re.match(UTF_8_encoded) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on UTF-8")
else:
   print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on UTF-8")

if new_re.match(rawdata) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on raw data")
else:
   print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on raw data")

new_re = ponyguruma.Regexp(".*Adobe.*")
if new_re.match(UTF_8_encoded) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on UTF-8")
else:
   print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on UTF-8")

new_re = ponyguruma.Regexp(".*Adobe.*")
if new_re.match(rawdata) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on raw data")
else:
   print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on raw data")
'''

我正在做一个替换项目，但在处理非ASCII编码的文件时遇到了困难。这个问题是一个更大项目的一部分——最终我想用其他文本替换掉原来的文本（我已经在ASCII编码下搞定了，但目前还不能识别其他编码中的出现情况）。再次感谢！

http://brian-stinar.blogspot.com

-Brian J. Stinar-

正则表达式文本处理文本替换 unicode 字符编码编码问题 python 2.4 非ascii

4 个回答

你的问题是关于正则表达式的，但其实你可能不需要用到它们；可以直接使用标准的字符串 replace 方法来解决。

import urllib
raw = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read()
decoded = raw.decode('iso-8859-2')
type(decoded)    # decoded is now <type 'unicode'>
substituted = decoded.replace(u'UNSUBSCRIBE', u'whatever you prefer')

如果没有别的，这段话展示了如何处理编码问题：只需解码成一个unicode字符串，然后用这个字符串进行操作。但要注意，这种方法只适合处理一个或很少的替换情况（而且这些替换不是基于模式的），因为 replace() 每次只能处理一个替换。

如果你需要同时进行多个字符串或模式的替换，可以像这样做：

import re
REPLACEMENTS = ((u'[aA]dobe', u'!twiddle!'),
                (u'UNS.*IBE', u'@wobble@'),
                (u'Dublin', u'Sydney'))

def replacer(m):
    return REPLACEMENTS[list(m.groups()).index(m.group(0))][1]

r = re.compile('|'.join('(%s)' % t[0] for t in REPLACEMENTS))
substituted = r.sub(replacer, decoded)

回答于 2025-04-15 由 Python大师

分享举报

默认情况下，.* 这个表达式是匹配不了换行符的。UNSUBSCRIBE 这个词只会在第一个换行符后出现一次，而 Adobe 这个词则是在第一个换行符之前。你可以通过使用 re.DOTALL 来解决这个问题。

不过，你还没有检查 Adobe 匹配到的内容：它有 1478 字节那么宽！如果开启 re.DOTALL，它（以及对应的 UNSUBSCRIBE 模式）就会匹配整个文本了！！

你肯定需要去掉结尾的 .* —— 这部分你并不需要，而且会让匹配变慢。还有，你应该去掉开头的 .*，并使用 search() 而不是 match()。

在这种情况下，re.UNICODE 这个选项对你没有帮助——你可以查阅手册看看它的作用。

你为什么要把数据转成 UTF-8 再去搜索呢？直接用 Unicode 就可以了。

还有人提到，一般来说，在对数据进行任何重要操作之前，你需要先解码 Ӓ 这种东西……但没有提到你数据中到处都是的 « 这种东西 :-)

回答于 2025-04-15 由 Python大师

分享举报

你可能想要启用 DOTALL 这个选项，或者使用 search 方法，而不是 match 方法。比如：

# DOTALL makes . match newlines 
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE | re.DOTALL)

或者：

# search will find matches even if they aren't at the start of the string
... re_UNSUB_amsterdam.search(foo) ...

这两种方法会给你不同的结果，但都应该能找到匹配的内容。（看看哪种更符合你的需求。）

顺便提一下：你似乎把编码后的文本（字节）和解码后的文本（字符）搞混了。这种情况并不少见，尤其是在 Python 3.x 之前的版本中。特别是，这里有点可疑：

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2')

你是用 ISO-8859-2 进行解码，而不是编码，所以把这个变量叫做 "decoded" 是合适的。（为什么不叫 "ISO_8859_2_decoded"？因为 ISO_8859_2 是一种编码方式，解码后的字符串就不再有编码了。）

你代码的其余部分是在对原始数据和 UTF_8 编码的字符串（这两者都是编码后的字符串）进行匹配，但其实应该使用解码后的 Unicode 字符串。

回答于 2025-04-15 由 Python大师

分享举报

Python Unicode 正则表达式

4 个回答

撰写回答