无法使用nltk.data.load加载english.pickle

189 投票

18 回答

228528 浏览

提问于 2025-04-16 11:02

当你尝试加载 punkt 分词器的时候...

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

...出现了一个 LookupError 错误：

> LookupError: 
>     *********************************************************************   
> Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource: nltk.download().   Searched in:
>         - 'C:\\Users\\Martinos/nltk_data'
>         - 'C:\\nltk_data'
>         - 'D:\\nltk_data'
>         - 'E:\\nltk_data'
>         - 'E:\\Python26\\nltk_data'
>         - 'E:\\Python26\\lib\\nltk_data'
>         - 'C:\\Users\\Martinos\\AppData\\Roaming\\nltk_data'
>     **********************************************************************

数据加载 nltk 分词器

18 个回答

这是我刚刚用过的方法：

# Do this in a separate python interpreter session, since you only have to do it once
import nltk
nltk.download('punkt')

# Do this in your ipython notebook or analysis script
from nltk.tokenize import word_tokenize

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

sentences_tokenized = []
for s in sentences:
    sentences_tokenized.append(word_tokenize(s))

sentences_tokenized 是一个包含多个列表的列表，每个列表里都是一些小的词语：

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.'],
['Professor', 'Plum', 'has', 'a', 'green', 'plant', 'in', 'his', 'study', '.'],
['Miss', 'Scarlett', 'watered', 'Professor', 'Plum', "'s", 'green', 'plant', 'while', 'he', 'was', 'away', 'from', 'his', 'office', 'last', 'week', '.']]

这些句子是从一本书的示例中提取的，书名是《Mining the Social Web, 2nd Edition》，你可以在这个 ipython notebook 找到相关内容。

回答于 2025-04-16 由 Python大师

分享举报

126

你看到这个错误的主要原因是，nltk找不到punkt这个包。因为nltk的内容比较多，安装时并不会默认下载所有的包。

你可以这样下载punkt包。

import nltk
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize

在最近的版本中，错误信息里也推荐了这个方法：

LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  Searched in:
    - '/root/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/nltk_data'
    - '/usr/lib/nltk_data'
    - ''
**********************************************************************

如果你在调用download函数时不传任何参数，它会下载所有的包，比如chunkers、grammars、misc、sentiment、taggers、corpora、help、models、stemmers、tokenizers。

nltk.download()

上面的函数会把包保存到一个特定的目录。你可以在这里找到那个目录的位置。https://github.com/nltk/nltk/blob/67ad86524d42a3a86b1f5983868fd2990b59f1ba/nltk/downloader.py#L1051

回答于 2025-04-16 由 Python大师

分享举报

308

我也遇到过这个问题。你可以打开一个Python的命令行界面，然后输入：

>>> import nltk
>>> nltk.download()

接着会出现一个安装窗口。找到“模型”这个标签，然后在“标识符”这一列里选择“punkt”。然后点击下载，它就会安装所需的文件。这样就应该可以正常使用了！

回答于 2025-04-16 由 Python大师

分享举报

无法使用nltk.data.load加载english.pickle

18 个回答

撰写回答