按重现日期分组

2024-06-12 01:24:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在分析熊猫的数据集,并试图根据日期条件将百老汇的节目分组。下面是这个问题的简化数据集

+-------------------+-------------+---------+-------+
|       Show        | Week_ending | Theatre | gross |
+-------------------+-------------+---------+-------+
| Mamma Mia         | 05/01/2020  | T1      |   100 |
| Mamma Mia         | 12/01/2020  | T1      |   200 |
| Mamma Mia         | 19/01/2020  | T1      |   150 |
| Shrek The Musical | 19/01/2020  | T2      |   100 |
| Mamma Mia         | 11/08/2019  | T3      |   100 |
| Mamma Mia         | 18/08/2019  | T3      |   100 |
| Mamma Mia         | 27/12/2009  | T1      |   100 |
| Mamma Mia         | 03/10/2010  | T1      |   100 |
| Mamma Mia         | 10/01/2010  | T1      |   100 |
+-------------------+-------------+---------+-------+

我想做的是将这些节目分组,以便进一步比较。在不同剧院上演的同名节目应该分开,我是用

df2 = df.groupby(['show', 'theatre']).mean()

但是,这会按此方式对项目进行分组

+-------------------+---------+------------+
|       Show        | Theatre | mean_gross |
+-------------------+---------+------------+
| Mamma Mia         | T1      |        100 |
| Shrek The Musical | T2      |        100 |
| Mamma Mia         | T3      |        100 |
+-------------------+---------+------------+

这是一个问题,因为它将发生在同一剧院(T1)的两个《妈妈咪呀》节目分组在一起,尽管它们相隔10年。期望的结果将是这样的(只要它们是快速的,(2009)不一定是必需的)

+-------------------+---------+------------+
|       Show        | Theatre | mean_gross |
+-------------------+---------+------------+
| Mamma Mia         | T1      |        100 |
| Shrek The Musical | T2      |        100 |
| Mamma Mia         | T3      |        100 |
| Mamma Mia (2009)  | T1      |        100 |
+-------------------+---------+------------+

如果一个节目在同一个剧院,但在没有记录信息的情况下,两个节目之间有一周的间隔,则可以将其视为单独的节目。仅仅按年份分开是行不通的,因为一个节目可能会从11月持续到3月,而把它分成两个独立的节目是没有意义的


Tags: the数据showmean节目t1t3t2
1条回答
网友
1楼 · 发布于 2024-06-12 01:24:36

基本上,您需要创建另一列来标记日期更改超过7天的情况

import pandas as pd

df = pd.DataFrame({'Show':['Mamma Mia', 'Mamma Mia', 'Mamma Mia', 'Shrek The Musical',
                   'Mamma Mia', 'Mamma Mia', 'Mamma Mia', 'Mamma Mia', 'Mamma Mia'] , 
                   'Week_ending':['05/01/2020', '12/01/2020', '19/01/2020', '19/01/2020', '11/08/2019', '18/08/2019', '27/12/2009', '03/10/2010', '10/01/2010'],
                   'Theatre':['T1', 'T1', 'T1', 'T2', 'T3', 'T3', 'T1', 'T1','T1'],
                  'gross':[100, 200, 150, 100, 100, 100, 100, 100, 100]})

# Change `Week_ending` to datetime
df['Week_ending'] = pd.to_datetime(df['Week_ending'], format="%d/%m/%Y")

# sort
df = df.sort_values(['Show', 'Theatre', 'Week_ending'])

# find the difference in days between two consequitive dates within the same group
df['days_diff'] = df.groupby(['Show', 'Theatre'])['Week_ending'].diff().dt.days.fillna(0)
# Check if the days difference is more than 7 (days)
df['days_diff'] = df['days_diff'] > 7 

# create a key column that increment by 1 everytime the days are more than 7
df['key'] = df.groupby(['Show', 'Theatre'])['days_diff'].cumsum()

# resulting dataframe
df_final = df.groupby(['Show', 'Theatre', 'key'])[['gross']].mean().reset_index().drop(columns=['key'])

df_final

+         -+    -+      +
|       Show        | Theatre | mean_gross |
+         -+    -+      +
| Mamma Mia         | T1      |        100 |
| Mamma Mia         | T1      |        100 |
| Mamma Mia         | T1      |        100 |
| Mamma Mia         | T1      |        150 |
| Mamma Mia         | T3      |        100 |
| Shrek The Musical | T1      |        100 |
+         -+    -+      +

相关问题 更多 >