Python:如何根据其他列的最大值来查找时差?

2024-05-15 06:31:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我一直在试图找出这个数据集中每个人所花费的大部分参与活动的时间:

              name  activity           timestamp  money_spent
0    Chandler Bing     party 2017-08-04 08:00:00           51
1    Chandler Bing     party 2017-08-04 13:00:00           60
2    Chandler Bing     party 2017-08-04 15:00:00           59
5       Harry Kane     party 2017-08-04 07:00:00           68
4       Harry Kane     party 2017-08-04 11:00:00           90
3       Harry Kane  football 2017-08-04 13:00:00           80
11  Joey Tribbiani  football 2017-08-04 08:00:00           84
9   Joey Tribbiani     party 2017-08-04 09:00:00           54
10  Joey Tribbiani     party 2017-08-04 10:00:00           67
6         John Doe     beach 2017-08-04 07:00:00           63
7         John Doe     beach 2017-08-04 12:00:00           61
8         John Doe     beach 2017-08-04 14:00:00           65
12   Monica Geller    travel 2017-08-04 07:00:00           90
13   Monica Geller    travel 2017-08-04 08:00:00           96
14   Monica Geller    travel 2017-08-04 09:00:00           74
15   Phoebe Buffey    travel 2017-08-04 10:00:00           52
16   Phoebe Buffey    travel 2017-08-04 12:00:00           84
17   Phoebe Buffey  football 2017-08-04 15:00:00           58
18     Ross Geller     party 2017-08-04 09:00:00           96
19     Ross Geller     party 2017-08-04 11:00:00           81
20     Ross Geller    travel 2017-08-04 14:00:00           60

df['timestamp'] = pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')

df # party day 2017-08-04 for some guys.
# find most involved activity and time spent on that activity per person.

所需输出:

                activity_num activity time_diff
name                                           
Chandler Bing            1.0    party  07:00:00
Harry Kane               2.0    party  04:00:00
Joey Tribbiani           2.0    party  02:00:00
John Doe                 1.0    beach  07:00:00
Monica Geller            1.0   travel  02:00:00
Phoebe Buffey            2.0   travel  03:00:00
Ross Geller              2.0   travel  03:00:00

注:哈里·凯恩从早上7点到11点参加派对,所以他的回答是4小时。你知道吗

df.head()
              name  activity           timestamp  money_spent
0    Chandler Bing     party 2017-08-04 08:00:00           51
1    Chandler Bing     party 2017-08-04 13:00:00           60
2    Chandler Bing     party 2017-08-04 15:00:00           59
3       Harry Kane  football 2017-08-04 13:00:00           80
4       Harry Kane     party 2017-08-04 11:00:00           90
5       Harry Kane     party 2017-08-04 07:00:00           68

我的尝试:

df.groupby(['name','activity'])['timestamp'].max() # no idea

Tags: namedfpartyactivityjohntimestampbingchandler
3条回答

这绝对(可能)不是这样做的,但我们来看看:

### Get the max of the timestampe into separate dataframes
df_max = df.loc[df.groupby(['name','activity',])['timestamp'].idxmax()].reset_index(drop=True)
df_min = df.loc[df.groupby(['name','activity',])['timestamp'].idxmin()].reset_index(drop=True)

### Merge those puppies on the index values
df_tot = df_max.merge(df_min, how='outer', left_index=True, right_index=True, suffixes= ('_max', '_min'))

### Subtract the max timestamp from the minimum timestamp
df_tot['net time'] = df_tot['timestamp_max'] - df_tot['timestamp_min']

### Drop unnecessary columns
df_tot.drop(['name_min','activity_min','timestamp_min','money_spent_min', 'money_spent_max','timestamp_max'], axis=1, inplace=True)

### Rename our columns
df_tot = df_tot.rename(columns={i:i.replace('_max', '') for i in df_tot.columns.values.tolist()})

### Set activity_number as the cumulative count of name
df_tot['activity_number'] = df_tot.groupby('name').cumcount() + 1

### Get the max of that result
df_tot = df_tot.loc[df_tot.groupby(['name',])['net time'].idxmax()].reset_index(drop=True)

### Rearrange our results
df_tot = df_tot.reindex(columns=['name','activity_number', 'net time']).copy()

输出:

             name  activity_number net time
0   Chandler Bing                1 07:00:00
1      Harry Kane                2 04:00:00
2  Joey Tribbiani                2 01:00:00
3        John Doe                1 07:00:00
4   Monica Geller                1 02:00:00
5   Phoebe Buffey                2 02:00:00
6     Ross Geller                1 02:00:00

检查下面

s=df.groupby(['name','activity']).timestamp.apply(pd.Series.ptp).reset_index()
#s=df.groupby(['name','activity']).timestamp.apply(np.ptp).reset_index()
uni=s.groupby('name').activity.nunique()
s=s.sort_values('timestamp').drop_duplicates('name',keep='last')
s['numberofact']=s.name.map(uni)
s
             name activity timestamp  numberofact
4  Joey Tribbiani    party  01:00:00            2
6   Monica Geller   travel  02:00:00            1
8   Phoebe Buffey   travel  02:00:00            2
9     Ross Geller    party  02:00:00            2
2      Harry Kane    party  04:00:00            2
0   Chandler Bing    party  07:00:00            1
5        John Doe    beach  07:00:00            1

试试这个:

gb = df.groupby(['name', 'activity'])['timestamp']

print((gb.max() - gb.min()).sort_values(ascending=False).reset_index().drop_duplicates(subset='name'))

输出:

             name activity timestamp
0        John Doe    beach  07:00:00
1   Chandler Bing    party  07:00:00
2      Harry Kane    party  04:00:00
3     Ross Geller    party  02:00:00
4   Phoebe Buffey   travel  02:00:00
5   Monica Geller   travel  02:00:00
6  Joey Tribbiani    party  01:00:00

相关问题 更多 >

    热门问题