LookupError:未找到资源“corpora/stopwords”

LookupError: ********************************************************************** Resource 'corpora/stopwords' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/app/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' **********************************************************************

#remove punctuation toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True) data = toker.tokenize(data) #remove stop words and digits stopword = stopwords.words('english') data = [w for w in data if w not in stopword and not w.isdigit()]

2条回答

网友

1楼 · 编辑于 2024-06-11 19:56:51

问题是语料库（本例中为“stopwords”）没有上传到Heroku。您的代码在本地计算机上工作，因为它已经有了NLTK语料库。请按照以下步骤解决问题。

在项目中创建一个新目录（我们称之为“nltk_data”）
下载该目录下的NLTK语料库。您必须在下载过程中进行配置。
告诉nltk寻找这个特定的路径。只需将nltk.data.path.append('path_to_nltk_data')添加到实际使用nltk的Python文件。
现在将应用程序推到Heroku。

希望能解决问题。为我工作！

网友

2楼 · 编辑于 2024-06-11 19:56:51

更新

As Kenneth Reitz pointed out，一个更简单的解决方案已经添加到heroku python构建包中。将nltk.txt文件添加到根目录并在其中列出您的语料库。有关详细信息，请参见https://devcenter.heroku.com/articles/python-nltk。

原始答案

这里有一个更干净的解决方案，允许您直接在Heroku上安装NLTK数据，而无需将其添加到git repo中。

我使用类似的步骤在Heroku上安装Textblob，Heroku使用NLTK作为依赖项。在步骤3和步骤4中，我对原来的代码做了一些小的调整，这些调整只适用于NLTK安装。

默认的heroku构建包包含一个^{} step，它在所有默认构建步骤完成后运行：

# post_compile
#!/usr/bin/env bash

if [ -f bin/post_compile ]; then
    echo "-----> Running post-compile hook"
    chmod +x bin/post_compile
    sub-env bin/post_compile
fi

如您所见，它在您的项目目录中查找您自己的post_compile文件（位于bin目录中），如果它存在，它将运行它。您可以使用这个钩子来安装nltk数据。

将您自己的post_compile文件添加到bin目录。

# bin/post_compile
#!/usr/bin/env bash

if [ -f bin/install_nltk_data ]; then
    echo "-----> Running install_nltk_data"
    chmod +x bin/install_nltk_data
    bin/install_nltk_data
fi

echo "-----> Post-compile done"

将您自己的install_nltk_data文件添加到bin目录。

# bin/install_nltk_data
#!/usr/bin/env bash

source $BIN_DIR/utils

echo "-----> Starting nltk data installation"

# Assumes NLTK_DATA environment variable is already set
# $ heroku config:set NLTK_DATA='/app/nltk_data'

# Install the nltk data
# NOTE: The following command installs the stopwords corpora, 
# so you may want to change for your specific needs.  
# See http://www.nltk.org/data.html
python -m nltk.downloader stopwords

# If using Textblob, use this instead:
# python -m textblob.download_corpora lite

# Open the NLTK_DATA directory
cd ${NLTK_DATA}

# Delete all of the zip files
find . -name "*.zip" -type f -delete

echo "-----> Finished nltk data installation"

将nltk添加到requirements.txt文件中（如果使用的是Textblob，则添加textblob）。
将所有这些更改提交到您的回购协议。
在heroku应用程序上设置NLTK_数据环境变量。
```
$ heroku config:set NLTK_DATA='/app/nltk_data'
```
部署到Heroku。您将在部署结束时看到post_compile步骤触发器，然后是nltk下载。

希望你能帮上忙！享受吧！

更新

原始答案

相关问题更多 >

编程相关推荐

热门问题

热门文章