Python Unicode正则表达式

#!/usr/bin/python # # This is a simple python program designed to show my problems with regular expressions and character encoding in python # Written by Brian J. Stinar # Thanks for the help! import urllib # To get files off the Internet import chardet # To identify charactor encodings import re # Python Regular Expressions #import ponyguruma # Python Onyguruma Regular Expressions - this can be uncommented if you feel like messing with it, but I have the same issue no matter which RE's I'm using rawdata = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read() print (chardet.detect(rawdata)) #print (rawdata) ISO_8859_2_encoded = rawdata.decode('ISO-8859-2') # Let's grab this as text UTF_8_encoded = ISO_8859_2_encoded.encode('utf-8') # and encode the text as UTF-8 print(chardet.detect(UTF_8_encoded)) # Looks good # This totally doesn't work, even though you can see UNSUBSCRIBE in the HTML # Eventually, I want to recognize the entire physical address and UNSUBSCRIBE above it re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE) print (str(re_UNSUB_amsterdam.match(UTF_8_encoded)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on UTF-8") print (str(re_UNSUB_amsterdam.match(rawdata)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on raw data") re_amsterdam = re.compile(".*Adobe.*", re.UNICODE) print (str(re_amsterdam.match(rawdata)) + "\t--- RE for 'Adobe' on raw data") # However, this work?!? print (str(re_amsterdam.match(UTF_8_encoded)) + "\t--- RE for 'Adobe' on UTF-8") ''' # In additon, I tried this regular expression library much to the same unsatisfactory result new_re = ponyguruma.Regexp(".*UNSUBSCRIBE.*") if new_re.match(UTF_8_encoded) != None: print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on UTF-8") else: print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on UTF-8") if new_re.match(rawdata) != None: print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on raw data") else: print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on raw data") new_re = ponyguruma.Regexp(".*Adobe.*") if new_re.match(UTF_8_encoded) != None: print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on UTF-8") else: print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on UTF-8") new_re = ponyguruma.Regexp(".*Adobe.*") if new_re.match(rawdata) != None: print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on raw data") else: print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on raw data") '''

3条回答

网友

1楼 · 编辑于 2024-05-23 20:17:32

这可能会有帮助：http://www.daa.com.au/pipermail/pygtk/2009-July/017299.html

网友

2楼 · 编辑于 2024-05-23 20:17:32

使用默认标志设置，.*与换行符不匹配。取消订阅只在第一个换行符之后出现一次。Adobe出现在第一个换行符之前。你可以用雷多尔. 在

不过，您还没有检查adobematch的性能：它有1478字节宽！打开多特尔它（和相应的取消订阅模式）将匹配整个文本！！在

你一定要输掉落后的比赛。*--你不感兴趣，这会拖慢比赛。另外，您应该丢失前导的.*并使用search（）而不是match（）。在

在re.UNICODE在这种情况下，flag对您没有用处--请阅读手册，看看它有什么作用。在

为什么要将数据转换成UTF-8并在上面搜索呢？只需输入Unicode。在

其他人指出，一般来说，在对数据进行任何认真的工作之前，您需要对Ӓ等内容进行解码。。。但没有提到您的数据中添加的«等内容：-）

网友

3楼 · 编辑于 2024-05-23 20:17:32

您可能想启用DOTALL标志，或者使用search方法而不是match方法。即：

# DOTALL makes . match newlines 
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE | re.DOTALL)

或者：

^{pr2}$

这将给你不同的结果，但两者都应该给你匹配。（看看你想要哪一种类型。）

顺便说一句：你似乎把编码文本（字节）和解码文本（字符）搞混了。这并不少见，尤其是在3.x之前的Python中。尤其值得怀疑的是：

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2')

您使用的是ISO-8859-2进行的de编码，而不是en编码，因此将此变量称为“decoded”。（为什么不“ISO__2_解码”？因为ISO__2是一种编码。解码后的字符串不再有编码。）

剩下的代码试图在rawdata和UTF_8_编码（两个编码字符串）上进行匹配，而它可能应该使用解码的unicode字符串。在

相关问题更多 >

编程相关推荐

热门问题

热门文章