python中的非线性特征变换

import pandas as pd import numpy as np from sklearn import linear_model #import the training data and extract the features and labels from it DATAPATH = 'train.csv' data = pd.read_csv(DATAPATH) features = data.drop(['Id', 'y'], axis=1) labels = data[['y']] features['x6'] = features['x1']**2 features['x7'] = features['x2']**2 features['x8'] = features['x3']**2 features['x9'] = np.exp(features['x1']) features['x10'] = np.exp(features['x2']) features['x11'] = np.exp(features['x3']) features['x12'] = np.cos(features['x1']) features['x13'] = np.cos(features['x2']) features['x14'] = np.cos(features['x3']) regr = linear_model.LinearRegression() regr.fit(features, labels)

1条回答

网友

1楼 · 发布于 2024-05-16 04:59:10

首先，我认为有一种更好的方法来转换所有列。一种选择是：

# Define list of transformation
trans = [lambda a: a, np.square, np.exp, np.cos]

# Apply and concatenate transformations
features = pd.concat([t(features) for t in trans], axis=1)

# Rename column names
features.columns = [f'x{i}' for i in range(1, len(list(features))+1)]

关于模型的性能，正如@warped在评论中所说，缩放所有数据是一种惯例。根据您的数据分布，您可以使用不同类型的定标器（关于它的讨论standard vs minmax scaler）

由于您使用的是非线性变换，即使您的初始数据可能是正态分布的，但在变换之后，它们将失去这种特性。因此，最好使用MinMaxScaler

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(features.to_numpy())
scaled_features = scaler.transform(features.to_numpy())

现在scaled_features的每一列的范围从0到1

注意，如果在使用类似train_test_split的东西之前应用scaler，就会发生数据泄漏，这对模型也不好

相关问题更多 >

编程相关推荐

热门问题

热门文章