如何在Python中查找数组元素并向i添加更多信息

2024-04-27 00:09:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用nltk模块来标记一个句子。但是,我需要帮助向令牌添加更多信息,即

  • 将NNP重写为名词,NN重写为非名词(忽略其他;VBD,IN,…)
  • 将“大写”添加到标记为NNP的单词中
  • 将“小写”添加到标记为NN的单词

下面是一个例子

sentences = "John wrote His name as Ishmael"

def findPOS(input):
    tagged = nltk.sent_tokenize(input.strip())        
    tagged = [nltk.word_tokenize(sent) for sent in tagged]        
    tagged = [nltk.pos_tag(sent) for sent in tagged ]         
    print tagged

findPOS(sentences)

>> [[('John', 'NNP'), ('wrote', 'VBD'), ('His', 'NNP'), ('name', 'NN'), ('as', 'IN'), ('Ishmael', 'NNP')]]

#extra information added and printed: 

(John CAPITALIZED noun)
(wrote non-noun)
(His CAPITALIZED noun)
(name LOWERCASE non-noun)
(as non-noun)
(Ishmael CAPITALIZED noun)

Tags: name标记asnnjohnsentnounnon
1条回答
网友
1楼 · 发布于 2024-04-27 00:09:16

压实度(不推荐):

>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> sent = "John write His name as Ishmael"
>>> [pos_tag(word_tokenize(i)) for i in sent_tokenize(sent)]
[[('John', 'NNP'), ('write', 'VBD'), ('His', 'NNP'), ('name', 'NN'), ('as', 'IN'), ('Ishmael', 'NNP')]]
>>> tagged_sent = [pos_tag(word_tokenize(i)) for i in sent_tokenize(sent)]
>>> [[(word,"CAPITALIZED" if word[0].isupper() else None, "noun" if word[1][0] == "N" else "non-noun") for word,pos in sentence] for sentence in tagged_sent]
[[('John', 'CAPITALIZED', 'non-noun'), ('write', None, 'non-noun'), ('His', 'CAPITALIZED', 'non-noun'), ('name', None, 'non-noun'), ('as', None, 'non-noun'), ('Ishmael', 'CAPITALIZED', 'non-noun')]]

更具可读性的代码:

>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> sent = "John write His name as Ishmael"
>>> tagged_sents = [pos_tag(word_tokenize(i)) for i in sent_tokenize(sent)]
>>> added_annotation_sents = []
>>> for sentence in tagged_sents:
...     each_sent = []
...     for word, pos in sentence:
...             caps = "CAPITALIZED" if word[0].isupper() else None
...             isnoun = "noun" if word[1][0] == "N" else "non-noun"
...             each_sent.append((word,caps,isnoun))
...     added_annotation_sents.append(each_sent)
... 
>>> added_annotation_sents
[[('John', 'CAPITALIZED', 'non-noun'), ('write', None, 'non-noun'), ('His', 'CAPITALIZED', 'non-noun'), ('name', None, 'non-noun'), ('as', None, 'non-noun'), ('Ishmael', 'CAPITALIZED', 'non-noun')]]

如果您坚持要删除None元素(如果该元素未大写):

>>> [[tuple([ann for ann in word if ann is not None]) for word in sent] for sent in added_annotation_sents]
[[('John', 'CAPITALIZED', 'non-noun'), ('write', 'non-noun'), ('His', 'CAPITALIZED', 'non-noun'), ('name', 'non-noun'), ('as', 'non-noun'), ('Ishmael', 'CAPITALIZED', 'non-noun')]]

相关问题 更多 >