现在,我想使用为第一个数据集训练的模型来预测第二个数据集标签。如何在Scikit learn中使用从第一个数据集到第二个数据集(看不见的标签)的预训练模型

我尝试的代码片段: 更新了以下评论中的代码:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import pickle

# ----------- Dataset 1: for training ----------- #
# Sample data ONLY
some_text = ['Books are amazing',
             'Harry potter book is awesome. It rocks',
             'Nutrition is very important',
             'Welcome to library, you can find as many book as you like',
             'Food like brocolli has many advantages']
y_variable = [1,1,0,1,0]

# books = 1 : y label
# food = 0 : y label

df = pd.DataFrame({'text':some_text,
                   'y_variable': y_variable

# ------------- TFIDF process -------------#
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df['text']).toarray()
labels = df.y_variable

# ------------- Build Model -------------#
model = LinearSVC()
X_train, X_test, y_train, y_test= train_test_split(features,
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Export model
pickle.dump(model, open('model.pkl', 'wb'))
# Read the Model
model_pre_trained = pickle.load(open('model.pkl','rb'))

# ----------- Dataset 2: UNSEEN DATASET ----------- #

some_text2 = ['Harry potter books are amazing',
             'Gluten free diet is getting popular']

unseen_df = pd.DataFrame({'text':some_text2}) # Notice this doesn't have y_variable. This the is the data set I am trying to predict y_variable labels 1 or 0.

# This is where the ERROR occurs
X_unseen = tfidf.fit_transform(unseen_df['text']).toarray()
y_pred_unseen = model_pre_trained.predict(X_unseen) # error here: 
# ValueError: X has 11 features per sample; expecting 26

print(X_unseen.shape) # prints (2, 11)
print(X_train.shape) # prints (2, 26)

# Looking for an output like this for UNSEEN data
# Looking for results after predicting unseen and no label data. 
text                                   y_variable
Harry potter books are amazing         1
Gluten free diet is getting popular    0


### Do this in the first program
# Dump model and tfidf
pickle.dump(model, open('model.pkl', 'wb'))
pickle.dump(tfidf, open('tfidf.pkl', 'wb'))

### Do this in the second program
model = pickle.load(open('model.pkl', 'rb'))
tfidf = pickle.load(open('tfidf.pkl', 'rb'))

# Use `transform` instead of `fit_transform`
X_unseen = tfidf.transform(unseen_df['text']).toarray()

# Predict on `X_unseen`
y_pred_unseen = model_pre_trained.predict(X_unseen)


想象一下,你们训练了一个人工智能,用发动机、轮子、机翼和飞行员领结的图片来识别飞机。现在你调用同样的人工智能,你让它预测一架只有领结的飞机的模型。这就是scikit learn告诉您的:与X_trainX_test相比,X_unseen中的特性(=列)要少得多

