Pandas中不同索引的数据帧组合

2024-03-28 22:23:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经从一个scikit学习分类器生成了一个概率数据帧,如下所示:

def preprocess_category_series(series, key):
    if series.dtype != 'category':
        return series
    if series.cat.ordered:
        s = pd.Series(series.cat.codes, name=key)
        mode = s.mode()[0]
        s[s<0] = mode
        return s
    else:
        return pd.get_dummies(series, drop_first=True, prefix=key)

data = df[df.year == 2012]
factors = pd.concat([preprocess_category_series(data[k], k) for k in factor_keys], axis=1)
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(factors)])

我现在想把这些概率附加回我原来的数据帧。但是,上面生成的predictions数据帧在保持data中项目顺序的同时,丢失了data的索引。我以为我能做到

pd.concat([data, predictions], axis=1, ignore_index=True)

但这会产生一个错误:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

我看到,如果列名重复,有时会出现这种情况,但在本例中,没有列名重复。那是什么错误?将这些数据帧重新缝合在一起的最佳方法是什么。你知道吗

data.head()

                year serial  hwtfinl                       region statefip  \
cpsid                                                                        
20121000000100  2012      1  3796.85  East South Central Division  Alabama   
20121000000100  2012      1  3796.85  East South Central Division  Alabama   
20121000000100  2012      1  3796.85  East South Central Division  Alabama   
20120800000500  2012      6  2814.24  East South Central Division  Alabama   
20120800000600  2012      7  2828.42  East South Central Division  Alabama   

                county  month  pernum          cpsidp     wtsupp   ...    \
cpsid                                                              ...     
20121000000100       0     11       1  20121000000101  3208.1213   ...     
20121000000100       0     11       2  20121000000102  3796.8506   ...     
20121000000100       0     11       3  20121000000103  3386.4305   ...     
20120800000500       0     11       1  20120800000501  2814.2417   ...     
20120800000600    1097     11       1  20120800000601  2828.4193   ...     

                 race        hispan educ           votereg  \
cpsid                                                        
20121000000100  White  Not Hispanic  111             Voted   
20121000000100  White  Not Hispanic  111  Did not register   
20121000000100  White  Not Hispanic  111             Voted   
20120800000500  White  Not Hispanic   92             Voted   
20120800000600  White  Not Hispanic   73  Did not register   

                                         educ_parsed      age4         educ4  \
cpsid                                                                          
20121000000100                     Bachelor's degree       65+  College grad   
20121000000100                     Bachelor's degree       65+  College grad   
20121000000100                     Bachelor's degree  Under 30  College grad   
20120800000500  Associate's degree, academic program     45-64  College grad   
20120800000600     High school diploma or equivalent       65+    HS or less   

                race4 region4  gender  
cpsid                                  
20121000000100  White   South    Male  
20121000000100  White   South  Female  
20121000000100  White   South  Female  
20120800000500  White   South  Female  
20120800000600  White   South  Female  

predictions.head()

          a         b         c         d         e         f
0  0.119534  0.336761  0.188023  0.136651  0.095342  0.123689
1  0.148409  0.346429  0.134852  0.169661  0.087556  0.113093
2  0.389586  0.195802  0.101738  0.085705  0.114612  0.112557
3  0.277783  0.262079  0.180037  0.102030  0.071171  0.106900
4  0.158404  0.396487  0.088064  0.079058  0.171540  0.106447

只是为了好玩,我专门用头排来试这个:

pd.concat([data_2012.iloc[0:5], predictions.iloc[0:5]], axis=1, ignore_index=True)

同样的错误也出现了。你知道吗


Tags: 数据datanotdivisionseriespdcentralwhite
2条回答

事实证明,有一个相对简单的解决方案:

predictions.index = data.index
pd.concat([data, predictions], axis=1)

现在它工作得很好。不知道为什么它不能像我最初尝试的那样工作。你知道吗

我也是0.18.0。这就是我尝试过的,而且成功了。这就是你在做的吗?你知道吗

import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X,Y)
import pandas as pd
data = pd.DataFrame(X)
data['y']=Y
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(X)])
pd.concat([data, predictions], axis=1, ignore_index=True)
0  1  2             3             4
0 -1 -1  1  1.000000e+00  1.522998e-08
1 -2 -1  1  1.000000e+00  3.775135e-11
2 -3 -2  1  1.000000e+00  5.749523e-19
3  1  1  2  1.522998e-08  1.000000e+00
4  2  1  2  3.775135e-11  1.000000e+00
5  3  2  2  5.749523e-19  1.000000e+00

相关问题 更多 >