Python如何将编码的一个热向量分配给字符串值

2024-05-16 23:23:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我不确定我的问题是否正确。我对POS标签进行了如下编码

from sklearn.preprocessing import LabelBinarizer

encoder = LabelBinarizer()
transfomed_label = encoder.fit_transform(["CC","CD","DT","EX","FW","IN","JJ","JJR","JJS","LS","MD","NN","NNS","NNP","NNPS","PDT","POS","PRP","PRP$","RB","RBR","RBS","RP","SYM","TO","UH","VB","VBD","VBG","VBN","VBP","VBZ","WDT","WP","WP$","WRB"])
#print(transfomed_label)
#START OF This is to get the mapping between the labels and its index
#print(encoder.classes_)
labels = encoder.classes_
mappings = {}
for index, label in zip(range(len(labels)), labels):
  mappings[label]=index
  #print(mappings)
#END OF This is to get the mapping between the labels and its index


for item in transfomed_label:
    print (item)

现在,我有一个句子,我已经掌握了这个句子的词性

^{pr2}$

这给了我[('This', 'DT'), ('is', 'VBZ'), ('Timothy', 'NNP'), ('learning', 'VBG'), ('python', 'NN')]

我希望这个句子被编码为

[[000001000],[100000000],[010000000],[000000001],[000100000]]

*以上向量具有代表性

有谁能帮我做一个对应于输入句子的向量数组吗。在


Tags: thepos编码encoderindexlabelsisdt
2条回答

首先让我们获取nltk包中的所有pos标记。(小心!!这取决于你使用的语言的宾州树库)。在

 pos_tags_list = ['CC', 'CD', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS','NNP', 'NNPS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'] 

现在制作两个地图词典

^{pr2}$

现在,您可以从句子中提取所有标记,并使用sklearnone-hot encoder或pandasdummies或kerasto_catgeorical方法对标记进行编码。在

如果我没听错,你想要这样的东西:

res = [transfomed_label[mappings[tagged[j][1]]] for j in xrange(len(tagged))]

相关问题 更多 >