训练集和测试集的随机森林回归精度差异

import tkinter as tk # Required for enabling GUI options from tkinter import messagebox # Required for pop-up window from tkinter import filedialog # Required for getting full path of file import pandas as pd # Required for data handling from sklearn.model_selection import train_test_split # Required for splitting data into training and test set from sklearn.ensemble import RandomForestRegressor # Required to build random forest #------------------------------------------------------------------------------------------------------------------------# # Create an instance of tkinter and hide the window root = tk.Tk() # Create an instance of tkinter root.withdraw() # Hides root window #root.lift() # Required for pop-up window management root.attributes("-topmost", True) # To make pop-up window stay on top of all other windows #------------------------------------------------------------------------------------------------------------------------# # This block of code reads input file using tkinter GUI options print("Reading input file...") # Pop up window to ask user the input file File_Checker = messagebox.askokcancel("Random Forest Regression Prompt", "At The Prompt, Enter 'Abalone_Data.csv' File.") # Kill the execution if user selects "Cancel" in the above pop-up window if (File_Checker == False): quit() else: del(File_Checker) file_loop = 0 while (file_loop == 0): # Get path of base file file_path = filedialog.askopenfilename(initialdir = "/", title = "File Selection Prompt", filetypes = (("XLSX Files","*.*"), )) # Condition to check if user selected a file or not if (len(file_path) < 1): # Pop-up window to warn uer that no file was selected result = messagebox.askretrycancel("File Selection Prompt Error", "No file has been selected. \nWhat do you want to do?") # Condition to repeat the loop or quit program execution if (result == True): continue else: quit() # Get file name file_name = file_path.split("/") # Splits the file with "/" as the delimiter and returns a list file_name = file_name[-1] # extracts the last element of the list # Condition to check if correct file was selected or not if (file_name != "Abalone_Data.csv"): result = messagebox.askretrycancel("File Selection Prompt Error", "Incorrect file selected. \nWhat do you want to do?") # Condition to repeat the loop or quit program execution if (result == True): continue else: quit() # Read the base file input_file = pd.read_csv(file_path, sep = ',', encoding = 'utf-8', low_memory = True) break # Delete unwanted files del(file_loop, file_name) #------------------------------------------------------------------------------------------------------------------------# print("Preparing dependent and independent variables...") # Create Separate dataframe consisting of only dependent variable y = pd.DataFrame(input_file['Rings']) # Create Separate dataframe consisting of only independent variable X = input_file.drop(columns = ['Rings'], inplace = False, axis = 1) #------------------------------------------------------------------------------------------------------------------------# print("Handling Dummy Variable Trap...") # Create a new dataframe to handle categorical data # This method splits the dategorical data column into separate columns # This is to ensure we get rid of the dummy variable trap dummy_Sex = pd.get_dummies(X['Sex'], prefix = 'Sex', prefix_sep = '_', drop_first = True) # Remove the speciic columns from the dataframe # These are the categorical data columns which split into separae columns in the previous step X.drop(columns = ['Sex'], inplace = True, axis = 1) # Merge the new columns to the original dataframe X = pd.concat([X, dummy_sex], axis = 1) #------------------------------------------------------------------------------------------------------------------------# y = y.values X = X.values #------------------------------------------------------------------------------------------------------------------------# print("Splitting datasets to training and test sets...") # Splitting the data into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) #------------------------------------------------------------------------------------------------------------------------# print("Fitting Random Forest regression on training set") # Fitting the regression model to the dataset regressor = RandomForestRegressor(n_estimators = 100, random_state = 50) regressor.fit(X_train, y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message #------------------------------------------------------------------------------------------------------------------------# print("Predicting Values") # Predicting a new result with regression y_pred = regressor.predict(X_test) # Enter values for new prediction as a Dictionary test_values = {'Sex_I' : 0, 'Sex_M' : 0, 'Length' : 0.5, 'Diameter' : 0.35, 'Height' : 0.8, 'Whole_Weight' : 0.223, 'Shucked_Weight' : 0.09, 'Viscera_Weight' : 0.05, 'Shell_Weight' : 0.07} # Convert dictionary into dataframe test_values = pd.DataFrame(test_values, index = [0]) # Rearranging columns as required test_values = test_values[['Length','Diameter','Height','Whole_Weight','Shucked_Weight','Viscera_Weight', 'Viscera_Weight', 'Sex_I', 'Sex_M']] # Applying feature scaling #test_values = sc_X.transform(test_values) # Predicting values of new data new_pred = regressor.predict(test_values) #------------------------------------------------------------------------------------------------------------------------# """ print("Building Confusion Matrix...") # Making the confusion matrix cm = confusion_matrix(y_test, y_pred) """ #------------------------------------------------------------------------------------------------------------------------# print("\n") print("Getting Model Accuracy...") # Get regression details #print("Estimated Coefficient = ", regressor.coef_) #print("Estimated Intercept = ", regressor.intercept_) print("Training Accuracy = ", regressor.score(X_train, y_train)) print("Test Accuracy = ", regressor.score(X_test, y_test)) print("\n") print("Printing predicted result...") print("Result_of_Treatment = ", new_pred)

2条回答

网友

1楼 · 编辑于 2024-05-14 10:43:29

在试图回答您的观点之前，请给出一条评论：我看到您正在使用一个精确的回归器作为度量。但是精度是分类问题中使用的一个度量；在回归模型中，通常使用其他度量，如均方误差（MSE）。见here。

如果您只是切换到一个更适合的度量，也许您会发现您的模型并没有那么糟糕。

我无论如何都要回答你的问题。

为什么训练精度和测试精度如此之远？ 这意味着你过度拟合了你的训练样本：你的模型在预测训练数据集的数据方面非常强大，但无法推广。就像让一个模特在一组猫的图片上训练，这些图片只相信那些图片是猫，而所有其他猫的图片都不是猫。实际上，你对测试集的准确度是0.5，这基本上是随机猜测。

我如何知道此型号是否安装过度/不足？ 正好形成了这两组数据在准确性上的差异。它们越接近，模型就越能概括。你已经知道怎么穿的。由于这两组数据的精确度都很低，因此通常可以识别出不合适。

随机森林回归是正确的模型吗？如果没有，我如何为这个用例确定正确的模型？ 没有正确的模型可供使用。在处理结构化数据时，Random Forest和所有基于树的模型（LightGBM、XGBoost）都是机器学习的瑞士军刀，因为它们简单可靠。基于深度学习的模型在理论上表现较好，但建立起来要复杂得多。

如何使用我创建的变量构建混淆矩阵？ 在构建分类模型时，可以创建混淆矩阵，并基于模型的输出进行构建。你用的是回归器，它没有很多意义。

如何验证模型的性能？ 一般来说，为了对性能进行可靠的验证，您将数据分成三部分：在一个（也称为“训练集”）上进行训练，在第二个（称为“验证集”）上调整模型，最后，当您对模型及其超参数感到满意时，在第三个（也称为“测试集”）上进行测试，不要与你称之为测试集）。最后一个告诉你你的模型是否能很好地推广。这是因为当您选择并优化模型时，您也可以对验证集（您称之为测试集）进行过拟合，可能会选择一组仅在该集上运行良好的超参数。此外，还必须选择可靠的度量，这取决于数据和模型。通过回归，MSE相当好。

网友

2楼 · 编辑于 2024-05-14 10:43:29

有了树和合奏，你必须要有一些设置。在你的例子中，不同之处在于“过度装配”。这意味着，您的模型已经“太多”了解了您的培训数据，无法将其推广到其他数据。

一件重要的事情是要限制树木的深度。每棵树的分枝因子都是2。这意味着在深度d，你会有2^d分支。

Let's imagine you have 1000 training values. If you don't limit depth (or/and min_samples_leaf), you can learn your complete dataset with a depth of 10 (because 2^10 = 1024 > N_training).

你所能做的是比较一个深度范围内的训练精度和测试精度（比如从3到基2中的对数（n））。如果深度太低，两种精度都会很低，因为您需要更多分支来正确学习数据，它将上升一个峰值，然后训练数据将继续上升，但测试值将下降。它应该类似于下面的图片，模型的复杂性就是你的深度。

您还可以使用min_samples_split和/或min_samples_leaf进行操作，这有助于您仅当此分支中有多个数据时才使用split。因此，这也将限制深度，并允许每个分支具有不同深度的树。如前所述，您可以使用该值来寻找最佳值（使用网格搜索）。

我希望能帮上忙

相关问题更多 >

编程相关推荐

热门问题

热门文章