如何在groupby之后合并列并选择数据帧中其他列的第一个有效值？

ID col_1 col_2 col_3 Date 1 20 40 1/1/2018 1 10 1/2/2018 1 50 60 1/3/2018 3 40 10 90 1/1/2018 4 80 80 1/1/2018

ID first_col_1 Date_col_1 first_col_2 Date_col_2 first_col_3 Date_col_3 1 10 1/2/2018 20 1/1/2018 40 1/1/2018 3 40 1/1/2018 10 1/1/2018 90 1/1/2018 4 1/1/2018 80 1/1/2018 80 1/1/2018

3条回答

网友

1楼 · 编辑于 2024-04-27 05:01:12

你不需要循环，但是你需要在你的组之前通过操作“融化”你的数据帧。你知道吗

因此，首先：

from io import StringIO
import pandas
f = StringIO("""\
ID,col_1,col_2,col_3,Date
1,,20,40,1/1/2018
1,10,,,1/2/2018
1,50,,60,1/3/2018
3,40,10,90,1/1/2018
4,,80,80,1/1/2018
""")

df = pandas.read_csv(f)

然后您可以：

print(
    df.melt(id_vars=['ID', 'Date'], value_vars=['col_1', 'col_2', 'col_3'], value_name='first')
      .groupby(by=['ID', 'variable'])
      .first()
      .unstack(level='variable')
)

这给了你：

              Date                     first            
variable     col_1     col_2     col_3 col_1 col_2 col_3
ID                                                      
1         1/1/2018  1/1/2018  1/1/2018  10.0  20.0  40.0
3         1/1/2018  1/1/2018  1/1/2018  40.0  10.0  90.0
4         1/1/2018  1/1/2018  1/1/2018   NaN  80.0  80.0

这些列是多级的，所以如果您需要，我们可以对它们进行一些修饰：

def flatten_columns(df, sep='_'):
    newcols = [sep.join(_) for _ in df.columns]
    return df.set_axis(newcols, axis='columns', inplace=False)

print(
    df.melt(id_vars=['ID', 'Date'], value_vars=['col_1', 'col_2', 'col_3'], value_name='first')
      .groupby(by=['ID', 'variable'])
      .first()
      .unstack(level='variable')
      .sort_index(level='variable', axis='columns')
      .pipe(flatten_columns)
)

这给了你一些列顺序和你的例子不太一样的东西，但是我觉得它很接近。你知道吗

   Date_col_1  first_col_1 Date_col_2  first_col_2 Date_col_3  first_col_3
ID                                                                        
1    1/1/2018         10.0   1/1/2018         20.0   1/1/2018         40.0
3    1/1/2018         40.0   1/1/2018         10.0   1/1/2018         90.0
4    1/1/2018          NaN   1/1/2018         80.0   1/1/2018         80.0

网友

2楼 · 编辑于 2024-04-27 05:01:12

我认为您必须在列上循环，并在连接之前提取每个列的第一个值。我找不到更简单的方法了。你知道吗

# Create a list to store the dataframes you want for each column
sub_df = [pd.DataFrame(df['ID'].unique(), columns=['ID'])]  # Init this list with IDs

for col in df.columns[1:-1]:  # loop over the columns (except ID and Date)

    # Determine the first valid rows indexes for this column (group by ID)
    valid_rows = df.groupby('ID')[col].apply(lambda sub_df: sub_df.first_valid_index())

    # Extracting the values and dates corresponding to these rows
    new_sub_df = df[[col, 'Date']].ix[valid_rows].reset_index(drop=True)

    # Append to the list of sub DataFrames
    sub_df.append(new_sub_df)

# Concatenate all these DataFrames.
new_df = pd.concat(sub_df, axis=1)

网友

3楼 · 编辑于 2024-04-27 05:01:12

IIUC在groupby之前使用melt

newdf=df.melt(['ID','Date']).loc[lambda x : x.value!='']

newdf=  newdf.groupby(['ID','variable']).first().unstack().sort_index(level=1,axis=1)

newdf.columns=newdf.columns.map('_'.join)
newdf
   Date_col_1  value_col_1 Date_col_2  value_col_2 Date_col_3  value_col_3
ID                                                                        
1    1/2/2018         10.0   1/1/2018         20.0   1/1/2018         40.0
3    1/1/2018         40.0   1/1/2018         10.0   1/1/2018         90.0
4        None          NaN   1/1/2018         80.0   1/1/2018         80.0

相关问题更多 >

编程相关推荐

热门问题

热门文章