Python Unicode正则表达式问题的回答

Python Unicode正则表达式

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我使用的是python2.4，在使用unicode正则表达式时遇到了一些问题。我试着把我的问题归纳成一个非常清楚和简明的例子。Python如何识别不同的字符编码似乎有一些问题，或者我的理解有问题。非常感谢您的关注！在 <pre><code>#!/usr/bin/python # # This is a simple python program designed to show my problems with regular expressions and character encoding in python # Written by Brian J. Stinar # Thanks for the help! import urllib # To get files off the Internet import chardet # To identify charactor encodings import re # Python Regular Expressions #import ponyguruma # Python Onyguruma Regular Expressions - this can be uncommented if you feel like messing with it, but I have the same issue no matter which RE's I'm using rawdata = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read() print (chardet.detect(rawdata)) #print (rawdata) ISO_8859_2_encoded = rawdata.decode('ISO-8859-2') # Let's grab this as text UTF_8_encoded = ISO_8859_2_encoded.encode('utf-8') # and encode the text as UTF-8 print(chardet.detect(UTF_8_encoded)) # Looks good # This totally doesn't work, even though you can see UNSUBSCRIBE in the HTML # Eventually, I want to recognize the entire physical address and UNSUBSCRIBE above it re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE) print (str(re_UNSUB_amsterdam.match(UTF_8_encoded)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on UTF-8") print (str(re_UNSUB_amsterdam.match(rawdata)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on raw data") re_amsterdam = re.compile(".*Adobe.*", re.UNICODE) print (str(re_amsterdam.match(rawdata)) + "\t--- RE for 'Adobe' on raw data") # However, this work?!? print (str(re_amsterdam.match(UTF_8_encoded)) + "\t--- RE for 'Adobe' on UTF-8") ''' # In additon, I tried this regular expression library much to the same unsatisfactory result new_re = ponyguruma.Regexp(".*UNSUBSCRIBE.*") if new_re.match(UTF_8_encoded) != None: print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on UTF-8") else: print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on UTF-8") if new_re.match(rawdata) != None: print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on raw data") else: print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on raw data") new_re = ponyguruma.Regexp(".*Adobe.*") if new_re.match(UTF_8_encoded) != None: print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on UTF-8") else: print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on UTF-8") new_re = ponyguruma.Regexp(".*Adobe.*") if new_re.match(rawdata) != None: print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on raw data") else: print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on raw data") ''' </code></pre> 我正在做一个替代项目，在处理非ASCII编码的文件时遇到了困难。这个问题是一个更大项目的一部分-最终我想用其他文本替换文本（我在ASCII中得到了这个工作，但是我还不能确定在其他编码中出现的情况。）再次感谢。在 <a href="http://brian-stinar.blogspot.com" rel="nofollow noreferrer">http://brian-stinar.blogspot.com</a> -布莱恩·J·斯蒂纳-

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

Python Unicode正则表达式

1 个回答

相关Python问题