pandas为同一值压缩多个列

2024-05-16 21:26:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图集成多个数据源,但我发现要简洁地组合我的列有点棘手。我有一个数据帧,看起来像这样:

df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
     [ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
                  [ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
                 columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])

print df

       Address1      City1      State1      Zip1      Address2      City2/  
0  Address Data  City Data  State Data  Zip Data           NaN        NaN   
1           NaN        NaN         NaN       NaN  Address Data  City Data   
2           NaN        NaN         NaN       NaN           NaN        NaN   

       State2      Zip2      Address3      City3      State3      Zip3  
0         NaN       NaN           NaN        NaN         NaN       NaN  
1  State Data  Zip Data           NaN        NaN         NaN       NaN  
2         NaN       NaN  Address Data  City Data  State Data  Zip Data 

我试着结合成这样:

^{pr2}$

我希望我可以用这样的行同时设置多个列:

df.loc[df['State1'].notnull(),['Address','city','state','State','Zip']] = df.loc[df['State1'].notnull(),['Address1','City1','State1','Zip1']].values

但是,.loc不允许同时创建多个列。任何关于如何组合数据的想法都将不胜感激!在

根据下面的解决方案进行编辑

这个解决方案的问题是,我的数据可能看起来更像这样:

df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', 'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
 [ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
              [ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
             columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])

在这种情况下,下面的解决方案给了我4行,而我只需要原来的3行。我想说的是“如果第1部分有数据,则使用第1部分;否则,如果第2部分有数据,则使用第2部分;否则,如果第3部分有数据,则使用第3部分”。每一行都需要保持唯一性,因为我没有合并行的许多其他属性。谢谢!在


Tags: 数据citydfdataaddressnpnanzip
3条回答

我认为解决方案完全不同,所以我决定创建新的答案:

import pandas as pd
import numpy as np

#random dataframe
np.random.seed(1)
df1 = pd.DataFrame(np.random.randint(10, size=(3,9)))
df1.columns = ['a1','b1','c1','a2','b2','c2','a3','b3','c3']

df1.loc[[1,2],['a1','b1','c1']] = np.nan
print (df1)
    a1   b1   c1  a2  b2  c2  a3  b3  c3
0  5.0  8.0  9.0   5   0   0   1   7   6
1  NaN  NaN  NaN   5   2   4   2   4   7
2  NaN  NaN  NaN   7   0   6   9   9   7

#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)

#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')
#remove index and columns names (pandas 0.18.0+)
df = df.rename_axis(None).rename_axis(None, axis=1)

print (df)
     a    b    c
0  5.0  8.0  9.0
1  5.0  2.0  4.0
2  9.0  9.0  7.0

编辑:我试着修改一下你的样本:

^{pr2}$
#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)

#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')

df = df.rename_axis(None).rename_axis(None, axis=1)
print (df)
         Address        City        State        Zip
0  Address Data1  City data1  State Data1  Zip Data1
1  Address Data3  City data3  State Data3  Zip Data3
2  Address Data4  City data4  State Data4  Zip Data4

IIUC您可以先通过list理解来选择列名,然后使用pd.lreshape

a = ([col for col in df.columns if col.startswith('Address')])
c = ([col for col in df.columns if col.startswith('City')])
s = ([col for col in df.columns if col.startswith('State')])
z = ([col for col in df.columns if col.startswith('Zip')])

print (a)
print (c)
print (s)
print (z)
['Address1', 'Address2', 'Address3']
['City1', 'City2', 'City3']
['State1', 'State2', 'State3']
['Zip1', 'Zip2', 'Zip3']

df1 = pd.lreshape(df, {'Address':a,  'City' :c, 'State':s,  'Zip' :z}) 
print (df1)

        Address       State       City       Zip
0  Address Data  State Data  City data  Zip Data
1  Address Data  State Data  City data  Zip Data
2  Address Data  State Data  City data  Zip Data

编辑:

如果需要而不是删除具有NaN值的行,请添加参数dropna=False

^{pr2}$

另一个numpy解决方案,但数据可以很好地排序(在示例中是的,实际上可能不是):

print (pd.DataFrame((df.values.reshape(9,4)), columns=['Address','City','State','Zip'])
         .dropna())

        Address       City       State       Zip
0  Address Data  City data  State Data  Zip Data
4  Address Data  City data  State Data  Zip Data
8  Address Data  City data  State Data  Zip Data

这个怎么样?在

uniq_cols = df.columns.str.replace(r'\d+$', '').unique()

new = pd.DataFrame(columns=uniq_cols)

for col in uniq_cols:
    new[col] = df.filter(like=col).apply(lambda x: x.dropna()[0], axis=1)

输出:

^{pr2}$

说明:

In [283]: uniq_cols
Out[283]: array(['Address', 'City', 'State', 'Zip'], dtype=object)

In [284]: df.filter(like='Address').apply(lambda x: x.dropna()[0], axis=1)
Out[284]:
0    Address Data
1    Address Data
2    Address Data
dtype: object

相关问题 更多 >