我有一些句子,来自研究,还有手工提取的单词短语,它们是我想要的句子的关键词。现在要建立一个支持向量机分类器的训练数据,我想矢量化的句子连同每个关键字。参见代码
我在想一本字典和一个应用sklearn图书馆的dictvectorier
Code:
sklearn.feature_extraction import DictVectorizer
v = DictVectorizer()
D = [{"sentence":"the laboratory information system was evaluated",
"keyword":"laboratory information system"},
{"sentence":"the electronic health record system was evaluated",
"keyword":"electronic health record system"}]
X = v.fit_transform(D)
print(X)
content = X.toarray()
print(content)
print(v.get_feature_names())
Results:
(0, 1) 1.0
(0, 3) 1.0
(1, 0) 1.0
(1, 2) 1.0
[[0. 1. 0. 1.]
[1. 0. 1. 0.]]
['keyword=electronic health record system', 'keyword=laboratory information system', 'sentence=the electronic health record system was evaluated', 'sentence=the laboratory information system was evaluated']
这种方法是否正确,或者我如何将每个句子与相应的手动提取的关键字组合起来,以矢量化显示训练数据。非常感谢
我认为这样做不太理想,因为你把整个句子当作一个特征。对于一个大的数据集来说,这将成为一个问题
例如
X
将你可以直接应用scikit的
TfidfVectorizer
来学习句子中的重要单词代码:
相关问题 更多 >
编程相关推荐