在NLTK的路透社语料库中,这些类别的含义是什么

2024-06-02 05:56:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我在做文本主题分类时遇到了问题。

我在NLTK“路透社”的语料库里找到了数据。。

但是当我尝试“reuters.categories()”时

结果是

['acq'、'alum'、'大麦'、'bop'、'carcas'、'castor oil'、'cococoa'、'co椰'、'cocoal'、'co椰油'、'coffee'、'copper'、'copra cake'、'corn'、'cotton'、'cotton oil'、'cpi'、'cpu'、'roud'、'dfl'、'dlr'、'dmk'、'earn'、'fuel'、'gas'、'gnp,‘ipi’、‘钢铁’、‘喷气式飞机’、‘工作’、‘l-牛’、‘铅’、‘lei’、‘林油’、‘牲畜’、‘木材’、‘饲料’、‘货币外汇’、‘货币供应’、‘石脑油’、‘天然气’、‘镍’、‘nkr’、‘nzdlr’、‘燕麦’、‘油籽’、‘橙色’、‘钯’、‘棕榈油’、‘棕榈仁’、‘宠物化学’、‘铂’、‘土豆’、‘丙烷’、‘兰德’、‘菜籽油,‘零售’、‘大米’、‘橡胶’、‘黑麦’、‘轮船’、‘白银’、‘高粱’、‘豆粕’、‘豆油’、‘大豆’、‘战略金属’、‘糖’、‘太阳粉’、‘太阳油’、‘太阳籽’、‘茶’、‘锡’、‘贸易’、‘植物油’、‘小麦’、‘wpi’、‘日元’、‘锌’]

我几乎不知道每个词的意思,我能找到一些解释吗?


Tags: 数据文本主题货币分类categories语料库oil
1条回答
网友
1楼 · 发布于 2024-06-02 05:56:55

关于NLTK语料库API中路透社语料库的信息:

  • Reuters-21578“ApteMod”语料库是为文本分类而构建的。

  • ApteMod收集了来自路透社的10788份文件 财经新闻通讯社

  • 在ApteMod语料库中,每个文档都属于一个或多个类别。语料库中有90个类别。

文件ID到类别的映射可以在~/nltk_data/corpora/reuters/cats.txt中找到

from os.path import expanduser
from collections import defaultdict
from nltk.corpus import reuters

home = expanduser("~")
id2cat = defaultdict(list)

for line in open(home+'/nltk_data/corpora/reuters/cats.txt','r'):
    fid, _, cats = line.partition(' ')
    id2cat[fid] = cats.split()

for fileid in reuters.fileids():
    for sent in reuters.sents(fileid):
        print id2cat[fileid], sent

[出局]:

['trade'] ['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
...

您可以在此文件中找到有关类别的信息:~/nltk_data/corpora/reuters/README

  The Reuters-21578 benchmark corpus, ApteMod version

This is a publically available version of the well-known Reuters-21578 "ApteMod" corpus for text categorization. It has been used in publications like these:

ApteMod is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents. The total size of the corpus is about 43 MB. It is also available for download from http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html , which includes a more extensive history of the data revisions.

The distribution of categories in the ApteMod corpus is highly skewed, with 36.7% of the documents in the most common category, and only 0.0185% (2 documents) in each of the five least common categories. In fact, the original data source is even more skewed---in creating the corpus, any categories that did not contain at least one document in the training set and one document in the test set were removed from the corpus by its original creator.

In the ApteMod corpus, each document belongs to one or more categories. There are 90 categories in the corpus. The average number of categories per document is 1.235, and the average number of documents per category is about 148, or 1.37% of the corpus.

-Ken Williams ken@mathforum.org

     Copyright & Notification 

(extracted from the README at the UCI address above)

The copyright for the text of newswire articles and Reuters annotations in the Reuters-21578 collection resides with Reuters Ltd. Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free distribution of this data for research purposes only.

If you publish results based on this data set, please acknowledge its use, refer to the data set by the name "Reuters-21578, Distribution 1.0", and inform your readers of the current location of the data set (see "Availability & Questions").

相关问题 更多 >