利用事先训练好的模型预测不可见数据

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix import pandas as pd import pickle # ----------- Dataset 1: for training ----------- # # Sample data ONLY some_text = ['Books are amazing', 'Harry potter book is awesome. It rocks', 'Nutrition is very important', 'Welcome to library, you can find as many book as you like', 'Food like brocolli has many advantages'] y_variable = [1,1,0,1,0] # books = 1 : y label # food = 0 : y label df = pd.DataFrame({'text':some_text, 'y_variable': y_variable }) # ------------- TFIDF process -------------# tfidf = TfidfVectorizer() features = tfidf.fit_transform(df['text']).toarray() labels = df.y_variable features.shape # ------------- Build Model -------------# model = LinearSVC() X_train, X_test, y_train, y_test= train_test_split(features, labels, train_size=0.5, random_state=0) model.fit(X_train, y_train) y_pred = model.predict(X_test) # Export model pickle.dump(model, open('model.pkl', 'wb')) # Read the Model model_pre_trained = pickle.load(open('model.pkl','rb')) # ----------- Dataset 2: UNSEEN DATASET ----------- # some_text2 = ['Harry potter books are amazing', 'Gluten free diet is getting popular'] unseen_df = pd.DataFrame({'text':some_text2}) # Notice this doesn't have y_variable. This the is the data set I am trying to predict y_variable labels 1 or 0. # This is where the ERROR occurs X_unseen = tfidf.fit_transform(unseen_df['text']).toarray() y_pred_unseen = model_pre_trained.predict(X_unseen) # error here: # ValueError: X has 11 features per sample; expecting 26 print(X_unseen.shape) # prints (2, 11) print(X_train.shape) # prints (2, 26) # Looking for an output like this for UNSEEN data # Looking for results after predicting unseen and no label data. text y_variable Harry potter books are amazing 1 Gluten free diet is getting popular 0

3条回答

网友

1楼 · 编辑于 2024-06-02 05:59:59

如您所见，您的第一个tfidf将您的输入转换为26个功能，而第二个tfidf将它们转换为11个功能。因此发生错误是因为X_train与X_unseen的形状不同。提示告诉您X_unseen中的每个观察值的特征数少于model训练接收的特征数

在第二个脚本中加载model后，将为文本安装另一个矢量器。也就是说，来自第一个脚本的tfidf和来自第二个脚本的tfidf是不同的对象。为了使用model进行预测，需要使用原始tfidf转换X_unseen。为此，必须导出原始矢量器，将其加载到新脚本中，并在将其传递到model之前使用它转换新数据

### Do this in the first program
# Dump model and tfidf
pickle.dump(model, open('model.pkl', 'wb'))
pickle.dump(tfidf, open('tfidf.pkl', 'wb'))

### Do this in the second program
model = pickle.load(open('model.pkl', 'rb'))
tfidf = pickle.load(open('tfidf.pkl', 'rb'))

# Use `transform` instead of `fit_transform`
X_unseen = tfidf.transform(unseen_df['text']).toarray()

# Predict on `X_unseen`
y_pred_unseen = model_pre_trained.predict(X_unseen)

网友

2楼 · 编辑于 2024-06-02 05:59:59

忽略第二个数据集，并使用train_test_split创建测试集

网友

3楼 · 编辑于 2024-06-02 05:59:59

想象一下，你们训练了一个人工智能，用发动机、轮子、机翼和飞行员领结的图片来识别飞机。现在你调用同样的人工智能，你让它预测一架只有领结的飞机的模型。这就是scikit learn告诉您的：与X_train或X_test相比，X_unseen中的特性（=列）要少得多

相关问题更多 >

编程相关推荐

热门问题

热门文章