Python中文字词提取时的UnicodeDecodeError

Question

我真是搞不懂了。

我有一个包含几千个单词的列表。

x = ['company', 'arriving', 'wednesday', 'and', 'then', 'beach', 'how', 'are', 'you', 'any', 'warmer', 'there', 'enjoy', 'your', 'day', 'follow', 'back', 'please', 'everyone', 'go', 'watch', 's', 'new', 'video', 'you', 'know', 'the', 'deal', 'make', 'sure', 'to', 'subscribe', 'and', 'like', '<http>', 'you', 'said', 'next', 'week', 'you', 'will', 'be', 'the', 'one', 'picking', 'me', 'up', 'lol', 'hindi', 'na', 'tl', 'huehue', 'that', 'works', 'you', 'said', 'everyone', 'of', 'us', 'my', 'little', 'cousin', 'keeps', 'asking', 'if', 'i', 'wanna', 'play', 'and', "i'm", 'like', 'yes', 'but', 'with', 'my', 'pals', 'not', 'you', "you're", 'welcome', 'pas', 'quand', 'tu', 'es', 'vers', '<num>', 'i', 'never', 'get', 'good', 'mornng', 'texts', 'sad', 'sad', 'moment', 'i', 'think', 'ima', 'go', 'get', 'a', 'glass', 'of', 'milk', 'ahah', 'for', 'the', 'first', 'time', 'i', 'actually', 'know', 'what', 'their', 'doing', 'd', 'thank', 'you', 'happy', 'birthday', 'hope', "you're"...........]

现在，我已经确认这个列表里的每个元素都是字符串类型。

types = []
for word in x:
    a.append(type(word))
print set(a)

>>>set([<type 'str'>])

接下来，我尝试用NLTK的porter词干提取器来处理每个单词。

import nltk
porter = nltk.PorterStemmer()
stemmed_x = [porter.stem(word) for word in x]

结果我遇到了一个错误，这个错误明显和词干提取的包以及unicode有关：

File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 633, in stem
    stem = self.stem_word(word.lower(), 0, len(word) - 1)
  File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 591, in stem_word
    word = self._step1ab(word)
  File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 289, in _step1ab
    if word.endswith("ied"):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)

我试过很多方法，比如用codecs.open，还尝试把每个单词明确编码为utf8，但还是出现同样的错误。

请给点建议。

编辑：

我应该提一下，这段代码在我运行Ubuntu的电脑上运行得很好。最近我换了一个MacBook Pro，现在却出现了这个错误。我检查过我Mac的终端设置，已经设置为utf8编码。

编辑2：

有趣的是，通过这段代码，我找到了问题单词：

for w in x:
    try:
        porter.stem(w)
    except UnicodeDecodeError:
        print w 

#sagittarius”
#instadane…
#bleedblue”
#pr챕cieux
#على_شرفة_الماضي
#exploringsf…
#fishing…
#sindhubestfriend…
#الإستعداد_لإنهيار_ال_سعود
#jaredpreslar…
#femalepains”
#gobillings”
#juicing…
#instamood…

看起来它们的共同点是单词末尾有多余的标点符号，除了单词#pr챕cieux。

错误处理文本处理 unicode 自然语言处理编码问题 nltk 词干提取标点符号

Python中文字词提取时的UnicodeDecodeError

2 个回答

撰写回答