python-sklearn Logistic回归预测所有0

2024-04-27 00:58:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我为汽车贷款建立了一个逻辑回归模型,其中包含“贷款违约是还是否”作为二元因变量,我使用了大约20个自变量,数据集包含3327条记录。在

我将底层数据分为训练集和测试集。然而,当我将模型拟合到训练数据上并要求它预测测试数据之后,我得到了所有“0”的输出,当那里应该有一些“1”输出时,假设训练集有大约12%的时间是二进制默认变量“1”或没有默认变量。在

我看了测试集和训练集,它们在拆分前后看起来都很好(没有丢失的值,类别变量是虚拟的,而且训练/测试子集随机正确地选择了记录,所以据我所见,它们没有崩溃)。在

有趣的是,函数“predict_proba”显示,对于每个输出元素,预测得到“0”的概率总是很高(0.7-0.9概率)。我不确定如何最好地纠正这一点,因为我宁愿将默认阈值设置为0.5,但我不确定如何清除这一混乱。在

是因为我需要更多的独立变量,还是我遗漏了什么/做错了什么?在

谢谢!在

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.cross_validation import train_test_split
import statsmodels.api as sm
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)


#open the file
data = pd.read_csv(r"log reg test Lending club 2007-2011 and 2014 car only no dummy trap.csv")
print(data.shape)
##print(list(data.columns))

print(data['Distressed'].value_counts()) ## check number of defaulted car loans is binary

sns.countplot(x='Distressed', data=data, palette='hls')
print(plt.show()) ## confrim dependent variable is binary


##basic numerical analysis of variables to check feasibility for model
## we will need to create dummy variables for strings
#print(data.groupby('Distressed').mean()) ##numerical variable means
#print(data.groupby('grade').mean()) ## string variable means
#print(data.groupby('sub_grade').mean())
#print(data.groupby('emp_length').mean())
#print(data.groupby('home_ownership').mean())

##testing for nulls in dataset
print('Table showing cumulative number of missing data points', data.isnull().sum()) 
scrub_data=data.drop(['mths_since_last_delinq'],1) ## this variable is not statistically significant

print('Here is the sample showing no missing data')
print(scrub_data.isnull().sum()) ## removed records of missing info, sample still sufficiently large
#scrub_data['intercept']=0 
print(list(scrub_data.columns))
print(scrub_data.head())

##convert categorical variables to dummies completed in csv file
## Agrade and Own dummies removed to avoid dummy variable trap and are treated as the base case here

X=scrub_data.ix[:,(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,22)].values
y=scrub_data.ix[:,0].values




X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3, random_state=0) 


print('Here are the X components', X) 
print('Here are the y components', y) 
print('Here are the X values of the training', X_train) 
print('Here are the y train values', y_train)
print('Here are the y test values', y_test) 

model=LogisticRegression()
model.fit(X_train,y_train) ##Model is learning the relationship between X_train and y_train
y1_pred=model.predict(X_train)
print('y predict of train data', y1_pred)

print('Here is the Model Score', model.score(X_train,y_train)) ##check accuracy of training set
print('What percentage defaulted', y_train.mean()) ##what percentage defaulted
print('What percentage of test set defaulted', y_test.mean()) ##what percentage defaulted

print('X test values', X_test) ## check test subset values
y_pred=model.predict(X_test) 
probs=model.predict_proba(X_test)

Tags: ofthetestimportdatamodelhereis