我已经从一个scikit学习分类器生成了一个概率数据帧,如下所示:
def preprocess_category_series(series, key):
if series.dtype != 'category':
return series
if series.cat.ordered:
s = pd.Series(series.cat.codes, name=key)
mode = s.mode()[0]
s[s<0] = mode
return s
else:
return pd.get_dummies(series, drop_first=True, prefix=key)
data = df[df.year == 2012]
factors = pd.concat([preprocess_category_series(data[k], k) for k in factor_keys], axis=1)
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(factors)])
我现在想把这些概率附加回我原来的数据帧。但是,上面生成的predictions
数据帧在保持data
中项目顺序的同时,丢失了data
的索引。我以为我能做到
pd.concat([data, predictions], axis=1, ignore_index=True)
但这会产生一个错误:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
我看到,如果列名重复,有时会出现这种情况,但在本例中,没有列名重复。那是什么错误?将这些数据帧重新缝合在一起的最佳方法是什么。你知道吗
data.head()
:
year serial hwtfinl region statefip \
cpsid
20121000000100 2012 1 3796.85 East South Central Division Alabama
20121000000100 2012 1 3796.85 East South Central Division Alabama
20121000000100 2012 1 3796.85 East South Central Division Alabama
20120800000500 2012 6 2814.24 East South Central Division Alabama
20120800000600 2012 7 2828.42 East South Central Division Alabama
county month pernum cpsidp wtsupp ... \
cpsid ...
20121000000100 0 11 1 20121000000101 3208.1213 ...
20121000000100 0 11 2 20121000000102 3796.8506 ...
20121000000100 0 11 3 20121000000103 3386.4305 ...
20120800000500 0 11 1 20120800000501 2814.2417 ...
20120800000600 1097 11 1 20120800000601 2828.4193 ...
race hispan educ votereg \
cpsid
20121000000100 White Not Hispanic 111 Voted
20121000000100 White Not Hispanic 111 Did not register
20121000000100 White Not Hispanic 111 Voted
20120800000500 White Not Hispanic 92 Voted
20120800000600 White Not Hispanic 73 Did not register
educ_parsed age4 educ4 \
cpsid
20121000000100 Bachelor's degree 65+ College grad
20121000000100 Bachelor's degree 65+ College grad
20121000000100 Bachelor's degree Under 30 College grad
20120800000500 Associate's degree, academic program 45-64 College grad
20120800000600 High school diploma or equivalent 65+ HS or less
race4 region4 gender
cpsid
20121000000100 White South Male
20121000000100 White South Female
20121000000100 White South Female
20120800000500 White South Female
20120800000600 White South Female
predictions.head()
:
a b c d e f
0 0.119534 0.336761 0.188023 0.136651 0.095342 0.123689
1 0.148409 0.346429 0.134852 0.169661 0.087556 0.113093
2 0.389586 0.195802 0.101738 0.085705 0.114612 0.112557
3 0.277783 0.262079 0.180037 0.102030 0.071171 0.106900
4 0.158404 0.396487 0.088064 0.079058 0.171540 0.106447
只是为了好玩,我专门用头排来试这个:
pd.concat([data_2012.iloc[0:5], predictions.iloc[0:5]], axis=1, ignore_index=True)
同样的错误也出现了。你知道吗
事实证明,有一个相对简单的解决方案:
现在它工作得很好。不知道为什么它不能像我最初尝试的那样工作。你知道吗
我也是0.18.0。这就是我尝试过的,而且成功了。这就是你在做的吗?你知道吗
相关问题 更多 >
编程相关推荐