NLTK的XMLCorpusReader可以用于多文件语料库吗？

4 投票

3 回答

3847 浏览

提问于 2025-04-16 22:19

我正在尝试使用NLTK来处理一个叫做纽约时报注释语料库的东西，这个语料库里每篇文章都有一个XML文件（用新闻行业文本格式NITF）。

我可以很顺利地解析单个文档，像这样：

from nltk.corpus.reader import XMLCorpusReader
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')

不过，我需要处理整个语料库。我尝试这样做：

reader = XMLCorpusReader('corpora/nytimes', r'.*')

但是这样并没有创建一个可以使用的读取对象。例如，

len(reader.words())

返回的是

raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string

我该如何将这个语料库读入NLTK呢？

我对NLTK还很陌生，所以任何帮助都非常感谢。

文本处理数据解析自然语言处理机器学习 nltk 语料库 xmlcorpusreader nltf格式

3 个回答

是的，你可以指定多个文件。（来源： http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.xmldocs.XMLCorpusReader-class.html）

这里的问题是，我怀疑你的所有文件都放在一个类似于 corpora/nytimes/year/month/date 的文件结构里。XMLCorpusReader 并不会自动深入到子文件夹中去查找文件。也就是说，使用你上面的代码 XMLCorpusReader('corpora/nytimes', r'.*') 时，XMLCorpusReader 只会看到 corpora/nytimes/ 这个文件夹里的 xml 文件（实际上没有，因为里面只有文件夹），而不会去查看 corpora/nytimes 下面的子文件夹。此外，你可能是想用 *.xml 作为你的第二个参数。

我建议你自己遍历这些文件夹，构建绝对路径（上面的文档说明 fileids 参数可以使用明确的路径），或者如果你有年份、月份和日期的组合列表，可以利用这些信息。

回答于 2025-04-16 由 Python大师

分享举报

这是根据machine yearning和waffle paradox的评论给出的解决方案。首先，使用glob来创建一个文章列表，然后把这个列表传递给XMLCorpusReader：

from glob import glob
import re
years = glob('nltk_data/corpora/nytimes_test/*')
year_months = []
for year in years:
    year_months += glob(year+'/*')
    print year_months
days = []
for year_month in year_months:
    days += glob(year_month+'/*')
articles = []
for day in days:
    articles += glob(day+'/*.xml')
file_ids = []
for article in articles:
    file_ids.append(re.sub('nltk_data/corpora/nytimes_test','',article))
reader = XMLCorpusReader('nltk_data/corpora/nytimes_test', articles)

回答于 2025-04-16 由 Python大师

分享举报

我不是NLTK的专家，所以可能有更简单的方法，但我建议你可以使用Python的glob模块。这个模块可以支持类似Unix的路径模式扩展。

from glob import glob
texts = glob('nltk_data/corpora/nytimes/*')

这样你就能得到与指定表达式匹配的文件名，结果是一个列表。然后根据你想同时打开多少个文件，你可以这样做：

from nltk.corpus.reader import XMLCorpusReader
for item_path in texts:
    reader = XMLCorpusReader('nltk_data/corpora/nytimes/', item_path)

正如@waffle paradox提到的，你还可以根据自己的具体需求来缩减这个texts列表。

回答于 2025-04-16 由 Python大师

分享举报

NLTK的XMLCorpusReader可以用于多文件语料库吗？

3 个回答

撰写回答