NLTK:如何用python获得循环中数组的特定内容?

2024-05-29 04:37:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用NLTK获取语料库的统计数据,我想知道如何获取特定标记旁边的标记列表。 例如,我希望在标记DTDEF之后有标记列表

我试着按照https://www.nltk.org/book/ch05.html上的教程进行操作,并根据自己的需要进行调整。你知道吗

这里,代码在数组“tags”中存储单词'ny'之后的所有标记,或者我想在标记DTDEF之后存储标记(DTDEF是单词'ny'的标记)。你知道吗

import nltk
from nltk.corpus.reader import TaggedCorpusReader
reader = TaggedCorpusReader('cookbook', r'.*\.pos')
train_sents=reader.tagged_sents()
for sent in train_sents:
    tags = [tag[1] for (word, tag) in nltk.bigrams(sent) if word[0]=='ny']

    #0 is for the word and 1 is for the tag, so tag[0] get you the word and 
    #tag[1] the tag, the same with word[0] and word[1]


fd = nltk.FreqDist(tags)
fd.tabulate()

为了得到我想要的结果,我将代码改为:

import nltk
from nltk.corpus.reader import TaggedCorpusReader
reader = TaggedCorpusReader('cookbook', r'.*\.pos')
train_sents=reader.tagged_sents()
for sent in train_sents:
    #i change the line here
    tags = [tag[1] for (word, tag) in nltk.bigrams(sent) if tag[1]=='DTDEF']

fd = nltk.FreqDist(tags)
fd.tabulate()

我期望在tag DTDEF后面有tag的列表,但是我得到了tag DTDEF的所有出现。 DTDEF公司 150个

所以我尝试了这个,但是python的问题是我不能做这样的事情:

import nltk
from nltk.corpus.reader import TaggedCorpusReader
reader = TaggedCorpusReader('cookbook', r'.*\.pos')
train_sents=reader.tagged_sents()
tags=[]
count=0
for sent in train_sents:
    for (word,tag) in sent:
        #if tag is DTDEF i want to get the tag after it
        if tag=="DTDEF":
            tags[count]=tag[acutalIndex+1]
            count+=1


fd = nltk.FreqDist(tags)
fd.tabulate()

所以这就是我提问的原因。你知道吗

提前感谢您的回答和建议。你知道吗


Tags: thein标记importfortagtagstrain
2条回答

感谢#CrazySqueak的帮助,我使用了他的代码并编辑了一些部分来获得:

import nltk
from nltk.corpus.reader import TaggedCorpusReader
reader = TaggedCorpusReader('cookbook', r'.*\.pos')
train_sents=reader.tagged_sents()
tags = []
foundit=False
for sent in train_sents:
    #i change the line here
    for (word,tag) in nltk.bigrams(sent):
        if foundit: #If the entry is after 'DTDEF'
            tags.append(tag[1]) #Add it to the resulting list of tags, i change
                                #tag [1] instead, if you use only tag, it will 
                                #store not only the tag but the word as well 
            #of foundit
            foundit=False #I need to make it false again, cause it will store again even 
                          #if the tag is != of DTDEF
        if tag[1]=='DTDEF': #If the entry is 'DTDEF'
            foundit=True #Set the 'After DTDEF' flag.

fd = nltk.FreqDist(tags)
fd.tabulate()

再次感谢你的建议和回答。你知道吗

我不是100%确定我能理解,但是如果您希望在一个特定条目之后获得列表中的所有条目,最简单的方法是:

foundthing=False
result = []
for i in list:
    if foundthing:
        result.append(i)
    if i == "Thing I'm Looking For":
        foundthing = True

将此添加到代码中会导致:

import nltk
from nltk.corpus.reader import TaggedCorpusReader
reader = TaggedCorpusReader('cookbook', r'.*\.pos')
train_sents=reader.tagged_sents()
tags = []
foundit=False
for sent in train_sents:
    #i change the line here
    for (word,tag) in nltk.bigrams(sent):
        if foundit: #If the entry is after 'DTDEF'
            tags.append(foundit) #Add it to the resulting list of tags.
        if tag[1]=='DTDEF': #If the entry is 'DTDEF'
            foundit=True #Set the 'After DTDEF' flag.

fd = nltk.FreqDist(tags)
fd.tabulate()

希望这有帮助。你知道吗

相关问题 更多 >

    热门问题