如何建立我的训练数据在我的情况下训练一个支持向量机在分类器在scikitlearn?

2024-06-10 05:50:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一些句子,来自研究,还有手工提取的单词短语,它们是我想要的句子的关键词。现在要建立一个支持向量机分类器的训练数据,我想矢量化的句子连同每个关键字。参见代码

我在想一本字典和一个应用sklearn图书馆的dictvectorier

Code:

sklearn.feature_extraction import DictVectorizer

v = DictVectorizer()

D = [{"sentence":"the laboratory information system was evaluated", 
       "keyword":"laboratory information system"},
     {"sentence":"the electronic health record system was evaluated", 
      "keyword":"electronic health record system"}]

X = v.fit_transform(D)

print(X)

content = X.toarray()

print(content)

print(v.get_feature_names())

Results:

 (0, 1) 1.0
  (0, 3)    1.0
  (1, 0)    1.0
  (1, 2)    1.0

[[0. 1. 0. 1.]
 [1. 0. 1. 0.]]

['keyword=electronic health record system', 'keyword=laboratory information system', 'sentence=the electronic health record system was evaluated', 'sentence=the laboratory information system was evaluated']

这种方法是否正确,或者我如何将每个句子与相应的手动提取的关键字组合起来,以矢量化显示训练数据。非常感谢


Tags: the数据informationrecordsystem矢量化keywordsentence
1条回答
网友
1楼 · 发布于 2024-06-10 05:50:13

我认为这样做不太理想,因为你把整个句子当作一个特征。对于一个大的数据集来说,这将成为一个问题

例如

D = [{"sentence":"This is sentence one", 
       "keyword":"key 1"},
     {"sentence":"This is sentence one", 
       "keyword":"key 2"},
     {"sentence":"This is sentence one", 
       "keyword":"key 3"},
     {"sentence":"This is sentence one", 
       "keyword":"key 2"},
     {"sentence":"This is sentence one", 
       "keyword":"key 1"}]

X

[[1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 1. 0. 0. 0. 0.]]

你可以直接应用scikit的TfidfVectorizer来学习句子中的重要单词

代码:

from sklearn.feature_extraction.text import TfidfVectorizer


sentences = [d['sentence'] for d in D]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

相关问题 更多 >