Pandas：选择一周中最高的一天，不包括周末，除非有一个记录

dates = pd.Series(data=['2018-11-05', '2018-11-06', '2018-11-07', '2018-11-08', '2018-11-09', '2018-11-12', '2018-11-13', '2018-11-14', '2018-11-15', '2018-11-17', '2018-11-19', '2018-12-01', ]) nums = np.random.randint(50, 100, 12) # nums # array([95, 80, 81, 51, 98, 62, 50, 55, 59, 77, 69]) df = pd.DataFrame(data={'dates': dates, 'nums': nums}) df['dates'] = pd.to_datetime(df['dates'])

2条回答

网友

1楼 · 编辑于 2024-06-16 09:42:58

创建一个新的工作日层次结构，其中周六和周日的优先级最低。然后sort_values在这个新排名上+groupby+.tail(1)

import numpy as np

wd_map = dict(zip(np.arange(0,7,1), np.roll(np.arange(0,7,1),-2)))
# {0: 2, 1: 3, 2: 4, 3: 5, 4: 6, 5: 0, 6: 1}
df = df.assign(day_mapped = df.dates.dt.weekday.map(wd_map)).sort_values('day_mapped')

df.groupby(df.dates.dt.week).tail(1).sort_index()

输出

        dates  nums  day_mapped
4  2018-11-09    57           6
8  2018-11-15    83           5
10 2018-11-19    96           2
11 2018-12-01    66           0

如果您的数据跨越多年，则需要在Year+week上分组

网友

2楼 · 编辑于 2024-06-16 09:42:58

我编写了一个函数来选择本周的有效最高记录，这需要在每周groupby上使用：

def last_valid_report(recs):
    if len(recs) == 1:
        return recs
    recs = recs.copy()
    # recs = recs[recs['dates'].dt.weekday <= 4].nlargest(1, recs['dates'].dt.weekday)  # doesn't work
    recs['weekday'] = recs['dates'].dt.weekday  # because nlargest() needs a column name
    recs = recs[recs['weekday'] <= 4].nlargest(1, 'weekday')
    del recs['weekday']
    return recs
    # could have also done:
    # return recs[recs['weekday'] <= 4].nlargest(1, 'weekday').drop('weekday', axis=1)

用正确的小组打电话，我得到：

In [155]: df2 = df.groupby(df['dates'].dt.week).apply(last_valid_report)

In [156]: df2
Out[156]:
              dates  nums
dates
45    4  2018-11-09    63
46    8  2018-11-15    90
47    10 2018-11-19    80
48    11 2018-12-01    94

有几个问题：

如果我不放recs.copy()，我得到ValueError: Shape of passed values is (3, 12), indices imply (3, 4)
pandas' ^{}只使用列名，不使用表达式
- 所以我需要在函数中创建一个额外的列，并在返回它之前删除它<我也可以在原始的df中创建它，并将它放在.apply()之后

我从groupby+apply得到一个额外的索引列'dates'，，需要是explicitly dropped：

In [157]: df2.index = df2.index.droplevel(); df2
Out[157]:
        dates  nums
4  2018-11-09    63
8  2018-11-15    90
10 2018-11-19    80
11 2018-12-01    94

如果我得到一个包含星期六和星期天数据（2天）的记录，我需要添加一个检查recs[recs['weekday'] <= 4]是否为空，然后只使用.nlargest(1, 'weekday')而不过滤weekday <= 4；但这不是问题的重点

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章