我不确定我的问题是否正确。我对POS标签进行了如下编码
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
transfomed_label = encoder.fit_transform(["CC","CD","DT","EX","FW","IN","JJ","JJR","JJS","LS","MD","NN","NNS","NNP","NNPS","PDT","POS","PRP","PRP$","RB","RBR","RBS","RP","SYM","TO","UH","VB","VBD","VBG","VBN","VBP","VBZ","WDT","WP","WP$","WRB"])
#print(transfomed_label)
#START OF This is to get the mapping between the labels and its index
#print(encoder.classes_)
labels = encoder.classes_
mappings = {}
for index, label in zip(range(len(labels)), labels):
mappings[label]=index
#print(mappings)
#END OF This is to get the mapping between the labels and its index
for item in transfomed_label:
print (item)
现在,我有一个句子,我已经掌握了这个句子的词性
^{pr2}$这给了我[('This', 'DT'), ('is', 'VBZ'), ('Timothy', 'NNP'), ('learning', 'VBG'), ('python', 'NN')]
我希望这个句子被编码为
[[000001000],[100000000],[010000000],[000000001],[000100000]]
*以上向量具有代表性
有谁能帮我做一个对应于输入句子的向量数组吗。在
首先让我们获取nltk包中的所有pos标记。(小心!!这取决于你使用的语言的宾州树库)。在
现在制作两个地图词典
^{pr2}$现在,您可以从句子中提取所有标记,并使用sklearn
one-hot encoder
或pandasdummies
或kerasto_catgeorical
方法对标记进行编码。在如果我没听错,你想要这样的东西:
相关问题 更多 >
编程相关推荐