有人有适用于NLTK的分类XML语料库读取器吗?

2 投票
2 回答
2203 浏览
提问于 2025-04-16 22:23

有没有人为NLTK写过一个分类的XML语料库读取器?

我正在使用带注释的纽约时报语料库。这个语料库是XML格式的。我可以用XMLCorpusReader来读取这些文件,但我想利用NLTK的一些分类功能。这里有一个不错的教程,教你如何对NLTK的读取器进行子类化。我可以自己动手写这个,但如果有人已经做过的话,我希望能节省一些时间。

如果没有的话,我会把我写的东西发上来。

2 个回答

0

抱歉,NAD,但我只能通过发一个新问题来讨论这段代码。我也在使用这个代码,并且在尝试用 words() 方法处理分类时发现了一个小错误。在这里:https://github.com/nltk/nltk/issues/250#issuecomment-5273102

你之前遇到过这个问题吗?另外,你有没有对它做过其他修改,让分类可以正常工作?如果你想私下聊聊,我的邮箱在我的个人资料页面上 :-)

1

这是一个为NLTK(自然语言工具包)准备的分类XML语料库读取器。它是基于这个教程制作的。

这个工具可以让你在像《纽约时报注释语料库》这样的XML语料库上使用NLTK的分类功能。

把这个文件命名为CategorizedXMLCorpusReader.py,然后可以这样导入:

import imp                                                                                                                                                                                                                     
CatXMLReader = imp.load_source('CategorizedXMLCorpusReader','PATH_TO_THIS_FILE/CategorizedXMLCorpusReader.py')  

之后你就可以像使用其他NLTK读取器一样使用它。例如:

CatXMLReader = CatXMLReader.CategorizedXMLCorpusReader('.../nltk_data/corpora/nytimes', file_ids, cat_file='PATH_TO_CATEGORIES_FILE')

我还在学习NLTK,所以如果有任何修改建议,欢迎提出。

# Categorized XML Corpus Reader                                                                                                                                                                                                  

from nltk.corpus.reader import CategorizedCorpusReader, XMLCorpusReader
class CategorizedXMLCorpusReader(CategorizedCorpusReader, XMLCorpusReader):
    def __init__(self, *args, **kwargs):
        CategorizedCorpusReader.__init__(self, kwargs)
        XMLCorpusReader.__init__(self, *args, **kwargs)
    def _resolve(self, fileids, categories):
        if fileids is not None and categories is not None:
            raise ValueError('Specify fileids or categories, not both')
        if categories is not None:
            return self.fileids(categories)
        else:
            return fileids

        # All of the following methods call the corresponding function in ChunkedCorpusReader                                                                                                                                    
        # with the value returned from _resolve(). We'll start with the plain text methods.                                                                                                                                      
    def raw(self, fileids=None, categories=None):
        return XMLCorpusReader.raw(self, self._resolve(fileids, categories))

    def words(self, fileids=None, categories=None):
        #return CategorizedCorpusReader.words(self, self._resolve(fileids, categories))                                                                                                                                          
        # Can I just concat words over each file in a file list?                                                                                                                                                                 
        words=[]
        fileids = self._resolve(fileids, categories)
        # XMLCorpusReader.words works on one file at a time. Concatenate them here.                                                                                                                                              
        for fileid in fileids:
            words+=XMLCorpusReader.words(self, fileid)
        return words

    # This returns a string of the text of the XML docs without any markup                                                                                                                                                       
    def text(self, fileids=None, categories=None):
        fileids = self._resolve(fileids, categories)
        text = ""
        for fileid in fileids:
            for i in self.xml(fileid).getiterator():
                if i.text:
                    text += i.text
        return text

    # This returns all text for a specified xml field                                                                                                                                                                            
    def fieldtext(self, fileids=None, categories=None):
        # NEEDS TO BE WRITTEN                                                                                                                                                                                                    
        return

    def sents(self, fileids=None, categories=None):
        #return CategorizedCorpusReader.sents(self, self._resolve(fileids, categories))                                                                                                                                          
        text = self.words(fileids, categories)
        sents=nltk.PunktSentenceTokenizer().tokenize(text)
        return sents

    def paras(self, fileids=None, categories=None):
        return CategorizedCorpusReader.paras(self, self._resolve(fileids, categories))

撰写回答