无法计算在使用doc2vec和随机林分类训练的数据集上进行预测所需的格式

2024-04-23 14:55:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图根据一些预定义的数据(tweets和tweets所属的类别,标记为1-16)对数据集进行预测,我用doc2vec构建了一个模型,并在随机森林分类器上进行了训练。在调用clf.predict(tweet)之前,我不知道需要将数据放入什么格式。你知道吗

import csv
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import itertools
from gensim import utils
from gensim.models import Doc2Vec
import gensim  
import numpy as np

#just making the object to put into gensim's doc2vec
class LabeledLineSentence(object):

    def __init__(self, doc_list, labels_list):
            self.labels_list = labels_list
            self.doc_list = doc_list

    def __iter__(self):
            for t, l in itertools.izip(self.doc_list, self.labels_list):
                    #change here
                    t = nltk.word_tokenize(t)
                    #end of change
                    yield gensim.models.doc2vec.LabeledSentence(t, [l])

#predefined
tweets = ["a tweet", "another tweet", ... , "a thousandth tweet"]
labels = [1, 1, ... , 16] #what category the tweet belongs to

training_data = LabeledLineSentence(tweets, labels_list)

#build the doc2vec model
model = Doc2Vec(vector_size=100, min_count=1, dm=1)
model.build_vocab(training_data)
model.train(training_data, total_examples=model.corpus_count, epochs=20)

#put tweets into classifier
train_tweets = []

for i in range(len(tweets)):
    label = labels_list[i]
    train_tweets.append(model[label])

#have to convert to numpy array because that is what clf takes
train_tweets = np.array(train_tweets)
train_labels = np.array(labels_list)

#fit classifier
clf = RandomForestClassifier().fit(train_tweets, train_labels)


#this is the data i am trying to classify into labels
test_data = ["an unseen tweet", "another unseen tweet", ... , "a thousandth unseen tweet"]

#*******change here***************
for t in test_data:
    split = nltk.word_tokenize(t)
    vect = model.infer_vector(split)
    vect = vect.reshape(1, -1)
    print clf.predict(vect)

在这个代码块的末尾,我感到困惑。我非常确定我构建了doc2vec模型并正确地训练了分类器,但是在调用clf.predict之前,我不确定我需要对测试数据中的每个tweet做些什么。我尝试过对字符串进行标记并使用计数向量器,但我不断地遇到错误,即它无法将这些值转换为浮点值。在将测试数据用于预测之前,是否有其他的方法来处理它?你知道吗


Tags: thetoimportselfdatadoclabelsmodel