查找错误：未找到资源 'corpora/stopwords

4 投票

2 回答

9695 浏览

提问于 2025-04-18 09:02

我正在尝试在Heroku上运行一个使用Flask的网页应用。这个网页应用是用Python编写的，并且使用了NLTK（自然语言工具包库）。

其中一个文件的开头部分是这样的：

import nltk, json, operator
from nltk.corpus import stopwords 
from nltk.tokenize import RegexpTokenizer

当我调用包含停用词代码的网页时，出现了以下错误：

LookupError: 
**********************************************************************
  Resource 'corpora/stopwords' not found.  Please use the NLTK  
  Downloader to obtain the resource:  >>> nltk.download()  
  Searched in:  
    - '/app/nltk_data'  
    - '/usr/share/nltk_data'  
    - '/usr/local/share/nltk_data'  
    - '/usr/lib/nltk_data'  
    - '/usr/local/lib/nltk_data'  
**********************************************************************

我使用的具体代码是：

#remove punctuation  
toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True) 
data = toker.tokenize(data)  

#remove stop words and digits 
stopword = stopwords.words('english')  
data = [w for w in data if w not in stopword and not w.isdigit()]

在Heroku上的网页应用中，当我把stopword = stopwords.words('english')这一行注释掉时，就不会出现查找错误。

在我的本地电脑上，这段代码运行得非常顺利。我已经在我的电脑上安装了所需的库，使用的是：

pip install requirements.txt

我在测试代码时，Heroku提供的虚拟环境是正在运行的。

我还尝试过从两个不同的来源获取NLTK，但LookupError错误依然存在。我使用的两个来源是：
http://pypi.python.org/packages/source/n/nltk/nltk-2.0.1rc4.zip
https://github.com/nltk/nltk.git

错误处理代码调试自然语言处理虚拟环境 nltk flask heroku 停用词

2 个回答

更新

正如Kenneth Reitz所指出的，heroku-python-buildpack中增加了一个更简单的解决方案。只需在你的根目录下添加一个 nltk.txt 文件，并在里面列出你的语料库。详细信息请查看https://devcenter.heroku.com/articles/python-nltk。

原始回答

这里有一个更简洁的解决方案，可以让你在Heroku上直接安装NLTK数据，而不需要把它添加到你的git仓库里。

我使用类似的步骤在Heroku上安装了Textblob，它依赖于NLTK。我对我原来的代码在第3和第4步做了一些小调整，这些调整应该适用于仅安装NLTK的情况。

默认的heroku构建包包含一个post_compile步骤，这个步骤会在所有默认构建步骤完成后运行：

# post_compile
#!/usr/bin/env bash

if [ -f bin/post_compile ]; then
    echo "-----> Running post-compile hook"
    chmod +x bin/post_compile
    sub-env bin/post_compile
fi

如你所见，它会在你的项目目录中查找自己的post_compile文件，位置在bin目录下，如果存在就会运行。你可以利用这个钩子来安装nltk数据。

在你本地项目的根目录下创建一个bin目录。

在bin目录中添加你自己的post_compile文件。

# bin/post_compile
#!/usr/bin/env bash

if [ -f bin/install_nltk_data ]; then
    echo "-----> Running install_nltk_data"
    chmod +x bin/install_nltk_data
    bin/install_nltk_data
fi

echo "-----> Post-compile done"

在bin目录中添加你自己的install_nltk_data文件。

# bin/install_nltk_data
#!/usr/bin/env bash

source $BIN_DIR/utils

echo "-----> Starting nltk data installation"

# Assumes NLTK_DATA environment variable is already set
# $ heroku config:set NLTK_DATA='/app/nltk_data'

# Install the nltk data
# NOTE: The following command installs the stopwords corpora, 
# so you may want to change for your specific needs.  
# See http://www.nltk.org/data.html
python -m nltk.downloader stopwords

# If using Textblob, use this instead:
# python -m textblob.download_corpora lite

# Open the NLTK_DATA directory
cd ${NLTK_DATA}

# Delete all of the zip files
find . -name "*.zip" -type f -delete

echo "-----> Finished nltk data installation"

在你的requirements.txt文件中添加nltk（如果你使用的是Textblob，则添加textblob）。
将所有这些更改提交到你的仓库。
在你的Heroku应用上设置NLTK_DATA环境变量。
```
$ heroku config:set NLTK_DATA='/app/nltk_data'
```
部署到Heroku。你会看到在部署结束时触发post_compile步骤，接着是nltk的下载。

希望你觉得这些信息有帮助！祝你好运！

回答于 2025-04-18 由 Python大师

分享举报

问题在于，语料库（在这个例子中是“停用词”）没有上传到Heroku。你的代码在本地机器上能正常工作，因为它已经有了NLTK的语料库。请按照以下步骤来解决这个问题。

在你的项目中创建一个新的文件夹（我们叫它'nltk_data'）
在这个文件夹中下载NLTK的语料库。下载的时候你需要进行一些配置。
告诉nltk去这个特定的路径找文件。只需在实际使用nltk的Python文件中添加 nltk.data.path.append('path_to_nltk_data') 这行代码。
现在把应用程序推送到Heroku。

希望这样能解决问题。对我来说是有效的！

回答于 2025-04-18 由 Python大师

分享举报

查找错误：未找到资源 'corpora/stopwords

2 个回答

更新

原始回答

撰写回答