scikitlearn预测新输入的训练模型

| "Consignor Code" | "Consignee Code" | "Origin" | "Destination" | "Carrier Code" | |------------------|------------------|----------|---------------|----------------| | "6402106844" | "66903717" | "DKCPH" | "CNPVG" | "6402746387" | | "6402106844" | "66903717" | "DKCPH" | "CNPVG" | "6402746387" | | "6402106844" | "6404814143" | "DKCPH" | "CNPVG" | "6402746387" | | "6402107662" | "66974631" | "DKCPH" | "VNSGN" | "6402746393" | | "6402107662" | "6404518090" | "DKCPH" | "THBKK" | "6402746393" | | "6402107662" | "6404518090" | "DKBLL" | "THBKK" | "6402746393" | | "6408507648" | "6403601344" | "DKCPH" | "USTPA" | "66565231" |

#Import the dependencies from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import make_scorer, accuracy_score from sklearn.model_selection import cross_val_score, train_test_split from sklearn.externals import joblib from sklearn import preprocessing import pandas as pd #Import the dataset (A CSV file) dataset = pd.read_csv('shipments.csv', header=0, skip_blank_lines=True) #Drop any rows containing NaN values dataset.dropna(subset=['Consignor Code', 'Consignee Code', 'Origin', 'Destination', 'Carrier Code'], inplace=True) #Convert the numeric only cells to strings dataset['Consignor Code'] = dataset['Consignor Code'].astype('int64') dataset['Consignee Code'] = dataset['Consignee Code'].astype('int64') dataset['Carrier Code'] = dataset['Carrier Code'].astype('int64') #Define our target (What we want to be able to predict) target = dataset.pop('Destination') #Convert all our data to numeric values, so we can use the .fit function. #For that, we use LabelEncoder le = preprocessing.LabelEncoder() target = le.fit_transform(list(target)) dataset['Origin'] = le.fit_transform(list(dataset['Origin'])) dataset['Consignor Code'] = le.fit_transform(list(dataset['Consignor Code'])) dataset['Consignee Code'] = le.fit_transform(list(dataset['Consignee Code'])) dataset['Carrier Code'] = le.fit_transform(list(dataset['Carrier Code'])) #Prepare the dataset. X_train, X_test, y_train, y_test = train_test_split( dataset, target, test_size=0.3, random_state=0) #Prepare the model and .fit it. model = RandomForestClassifier() model.fit(X_train, y_train) #Make a prediction on the test set. predictions = model.predict(X_test) #Print the accuracy score. print("Accuracy score: {}".format(accuracy_score(y_test, predictions)))

2条回答

网友

1楼 · 编辑于 2024-05-21 04:44:26

这里有一个包含预测的完整工作示例。最重要的部分是为每个特征定义不同的标签编码器，以便您可以使用相同的编码来拟合新数据，否则您将遇到错误（现在可能会显示错误，但在计算精度时您会注意到）：

dataset = pd.DataFrame({'Consignor Code':["6402106844","6402106844","6402106844","6402107662","6402107662","6402107662","6408507648"],
                   'Consignee Code': ["66903717","66903717","6404814143","66974631","6404518090","6404518090","6403601344"],
                   'Origin':["DKCPH","DKCPH","DKCPH","DKCPH","DKCPH","DKBLL","DKCPH"],
                   'Destination':["CNPVG","CNPVG","CNPVG","VNSGN","THBKK","THBKK","USTPA"],
                   'Carrier Code':["6402746387","6402746387","6402746387","6402746393","6402746393","6402746393","66565231"]})

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.externals import joblib
from sklearn import preprocessing
import pandas as pd

#Import the dataset (A CSV file)
#Drop any rows containing NaN values
dataset.dropna(subset=['Consignor Code', 'Consignee Code',
                       'Origin', 'Destination', 'Carrier Code'], inplace=True)


#Define our target (What we want to be able to predict)
target = dataset.pop('Destination')

#Convert all our data to numeric values, so we can use the .fit function.
#For that, we use LabelEncoder
le_origin = preprocessing.LabelEncoder()
le_consignor = preprocessing.LabelEncoder()
le_consignee = preprocessing.LabelEncoder()
le_carrier = preprocessing.LabelEncoder()
le_target = preprocessing.LabelEncoder()
target = le_target.fit_transform(list(target))
dataset['Origin'] = le_origin.fit_transform(list(dataset['Origin']))
dataset['Consignor Code'] = le_consignor.fit_transform(list(dataset['Consignor Code']))
dataset['Consignee Code'] = le_consignee.fit_transform(list(dataset['Consignee Code']))
dataset['Carrier Code'] = le_carrier.fit_transform(list(dataset['Carrier Code']))

#Prepare the dataset.
X_train, X_test, y_train, y_test = train_test_split(
    dataset, target, test_size=0.3, random_state=42)


#Prepare the model and .fit it.
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

#Make a prediction on the test set.
predictions = model.predict(X_test)

#Print the accuracy score.
print("Accuracy score: {}".format(accuracy_score(y_test, predictions)))

new_input = ["6408507648","6403601344","DKCPH","66565231"]
fitted_new_input = np.array([le_consignor.transform([new_input[0]])[0],
                                le_consignee.transform([new_input[1]])[0],
                                le_origin.transform([new_input[2]])[0],
                                le_carrier.transform([new_input[3]])[0]])
new_predictions = model.predict(fitted_new_input.reshape(1,-1))

print(le_target.inverse_transform(new_predictions))

最后，您的树预测：

['THBKK']

网友

2楼 · 编辑于 2024-05-21 04:44:26

这里有一些快速的例子来说明这一点。在实践中我不会这样做，可能会有一些错误。例如，我认为如果测试集中存在看不见的类，那么这将失败

#Prepare the dataset.
X_train, X_test, y_train, y_test = train_test_split(
    dataset, target, test_size=0.3, random_state=0)

#Convert all our data to numeric values, so we can use the .fit function.
#For that, we use LabelEncoder
le_target = preprocessing.LabelEncoder()
y_train = le_target.fit_transform(y_train)
y_test = le_target.transform(y_test)

# Now create a separate encoder for each of your features:
encoders = {}
for feature in ["Origin", "Consignor Code", "Consignee Code", "Carrier Code"]:
# NOTE: The LabelEncoder docs state clearly at the start that you shouldn't be using it on your inputs. I'm not going to get into that here though but just be aware that it's not a good encoding.
    encoders[feature] = preprocessing.LabelEncoder()
    X_train[feature] = encoders[feature].fit_transform(X_train[feature])
    X_test[feature] = encoders[feature].transform(X_test[feature])    

#Prepare the model and .fit it.
model = RandomForestClassifier()
model.fit(X_train, y_train)

#Make a prediction on the test set.
predictions = model.predict(X_test)

le_target.inverse_transform(predictions)

这里的关键概念是为您的功能使用单独的编码器，因为这些编码器对象记住如何对该功能进行编码。这是在fit阶段完成的。然后，您需要对任何新数据调用transform，以正确编码该数据

最后，您的树预测：

相关问题更多 >

编程相关推荐

热门问题

热门文章