<p>很可能<code>\uFEFF</code>字符是从文件读取的内容的一部分。我怀疑是代币商插入的。^文件开头的{<cd1>}是不推荐使用的<a href="http://en.wikipedia.org/wiki/Byte_Order_Mark" rel="noreferrer">Byte Order Mark</a>形式。如果它出现在其他任何地方,则将其视为<a href="http://en.wikipedia.org/wiki/Zero-width_non-breaking_space" rel="noreferrer">zero width non-break space</a>。</p>
<p>文件是由微软记事本写的吗?来自<a href="http://docs.python.org/library/codecs.html#encodings-and-unicode" rel="noreferrer">the codecs module docs</a>:</p>
<blockquote>
<p>To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.</p>
</blockquote>
<p>尝试使用<a href="http://docs.python.org/library/codecs.html#codecs.open" rel="noreferrer">^{<cd3>}</a>读取文件。注意使用BOM的<code>"utf-8-sig"</code>编码。</p>
<pre><code>import codecs
f = codecs.open('C:\Python26\text.txt', 'r', 'utf-8-sig')
text = f.read()
a = nltk.word_tokenize(text)
</code></pre>
<p>实验:</p>
<pre><code>>>> open("x.txt", "r").read().decode("utf-8")
u'\ufeffm\xfcsli'
>>> import codecs
>>> codecs.open("x.txt", "r", "utf-8-sig").read()
u'm\xfcsli'
>>>
</code></pre>