撇号变成\x92

15 投票

1 回答

30780 浏览

提问于 2025-04-17 19:58

mycorpus.txt

Human where's machine interface for lab abc computer applications   
A where's survey of user opinion of computer system response time

stopwords.txt

let's
ain't
there's

下面的代码

corpus = set()
for line in open("path\\to\\mycorpus.txt"):
    corpus.update(set(line.lower().split()))
print corpus

stoplist = set()
for line in open("C:\\Users\\Pankaj\\Desktop\\BTP\\stopwords_new.txt"):
    stoplist.add(line.lower().strip())
print stoplist

会产生以下输出

set(['a', "where's", 'abc', 'for', 'of', 'system', 'lab', 'machine', 'applications', 'computer', 'survey', 'user', 'human', 'time', 'interface', 'opinion', 'response'])
set(['let\x92s', 'ain\x92t', 'there\x92s'])

为什么在第二组中，撇号变成了 \x92 呢？

文本处理字符编码自然语言处理文本预处理

1 个回答

在windows-1252编码中，代码点92（十六进制）对应的Unicode代码点是2019（十六进制），这个字符是“右单引号”。它看起来和撇号很像，可能就是你在stopwords.txt文件中实际使用的字符。我根据Python对它的解释猜测，这个字符可能是用windows-1252编码的，或者是某种与ASCII共享’代码点值的编码。

' 和 ’

回答于 2025-04-17 由 Python大师

分享举报

撇号变成\x92

1 个回答

撰写回答