在语料库Python中查找损坏的文件

import nltk corpus=nltk.corpus.TaggedCorpusReader("filepath", '.*.txt', encoding='utf-8') #I added the encoding when I saw some answer about that, but it doesn't seem to help words=corpus.words() for w in words: print(w)

1条回答

网友

1楼 · 发布于 2024-06-09 09:59:50

您可以通过一次读取一个文件来标识该文件，如下所示：

corpus = nltk.corpus.TaggedCorpusReader("filepath", r'.*\.txt', encoding='utf-8')

try: 
    for filename in corpus.fileids():
        words_ = corpus.words(filename)
except UnicodeDecodeError:
    print("UnicodeDecodeError in", filename)

（或者您可以在阅读之前打印每个文件名，甚至不必费心捕捉错误。）

一旦找到该文件，就必须找出问题的根源。你的语料库真的是utf-8编码的吗？也许它在使用另一种8位编码，比如拉丁语-1或其他什么。指定8位编码不会给您带来错误（在这些格式中没有错误检查），但是您可以要求python打印一些行，看看所选的编码是否正确。在

如果您的语料库几乎全部是英语，您可以在文件中搜索包含非ascii字符的行并只打印以下内容：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章

在语料库Python中查找损坏的文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >