如何通过tim累积每个ID的唯一行值数

2024-05-16 00:48:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据集,包括日期,车号和目的地。你知道吗

对于每一行,我想要每个车号的唯一目的地的累计数量。重要的是计数器在最早的日期开始。你知道吗

所需的输出是“唯一目的地”列:

          date  car_id   destination  unique_destinations
0   01/01/2019       1        Boston                    1
1   01/01/2019       2         Miami                    1
2   02/01/2019       1        Boston                    1
3   02/01/2019       2       Orlando                    2
4   03/01/2019       1      New York                    2
5   03/01/2019       2         Tampa                    3
6   04/01/2019       1        Boston                    2
7   04/01/2019       2         Miami                    3
8   05/01/2019       1    Washington                    3
9   05/01/2019       2  Jacksonville                    4
10  06/01/2019       1      New York                    3
11  06/02/2019       2       Atlanta                    5

Tags: 数据idnew数量date计数器carboston
3条回答

好吧,这可能效率不高,但这是一种方法:)

def check(data):
    seen = []
    flag = 0
    for index,row in data.iterrows():
        if row['destination'] not in seen:
            flag+=1
            data['unique_destinations'][index] = flag
            seen.append(row['destination'])
        else:
            data['unique_destinations'][index] = flag
    return data

df['unique_destinations'] = 0
df.groupby('car_id').apply(check)

输出

0     1
1     1
2     1
3     2
4     2
5     3
6     2
7     3
8     3
9     4
10    3
11    5
Name: unique_destinations, dtype: int64

我们还可以按车辆ID拆分数据,然后运行如下自定义函数:

def create_uniques(df):
    dests = []
    uniques = []
    counter = 0
    for ix, r in df.iterrows():
        if r['destination'] not in dests:
            counter += 1
            dests.append(r['destination'])
            uniques.append(counter)
        else:
            uniques.append(counter)

    df['unique_destinations'] = uniques

    return df

df1 = df[df['car_id'] == 1].reset_index(drop=True)
df2 = df[df['car_id'] == 2].reset_index(drop=True)

df_final = pd.concat([create_uniques(df1), create_uniques(df2)], ignore_index=True).sort_values('date')

输出:

print(df_final)
         date  car_id   destination  unique_destinations
0  2019-01-01       1        Boston                    1
6  2019-01-01       2         Miami                    1
1  2019-02-01       1        Boston                    1
7  2019-02-01       2       Orlando                    2
2  2019-03-01       1      New York                    2
8  2019-03-01       2         Tampa                    3
3  2019-04-01       1        Boston                    2
9  2019-04-01       2         Miami                    3
4  2019-05-01       1    Washington                    3
10 2019-05-01       2  Jacksonville                    4
5  2019-06-01       1      New York                    3
11 2019-06-02       2       Atlanta                    5

计时其他答案:
回答:

%%timeit

def create_uniques(df):
    dests = []
    uniques = []
    counter = 0
    for ix, r in df.iterrows():
        if r['destination'] not in dests:
            counter += 1
            dests.append(r['destination'])
            uniques.append(counter)
        else:
            uniques.append(counter)

    df['unique_destinations'] = uniques

    return df

df1 = df[df['car_id'] == 1].reset_index(drop=True)
df2 = df[df['car_id'] == 2].reset_index(drop=True)

df_final = pd.concat([create_uniques(df1), create_uniques(df2)], ignore_index=True).sort_values('date')

11 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

伊姆克劳斯回答:

%%timeit

def check(data):
    seen = []
    flag = 0
    for index,row in data.iterrows():
        if row['destination'] not in seen:
            flag+=1
            data['unique_destinations'][index] = flag
            seen.append(row['destination'])
        else:
            data['unique_destinations'][index] = flag
    return data

df['unique_destinations'] = 0
df.groupby('car_id').apply(check)

15.3 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nikhilbalwani的回答:

%%timeit
for index, row in df.iterrows():
    unique_before_date = df[df['date'] <= row['date']].groupby(['car_id'])['destination'].nunique()
    df['unique_destinations'][index] = int(unique_before_date[row['car_id']])

839 ms ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

请尝试以下简短而甜蜜的代码:

for index, row in df.iterrows():
    unique_before_date = df[df['date'] <= row['date']].groupby(['car_id'])['destination'].nunique()
    df['unique_destinations'][index] = int(unique_before_date[row['car_id']])

print(df)

它产生以下输出:

         date  car_id   destination unique_destinations
0  2019-01-01       1        Boston                   1
1  2019-01-01       2         Miami                   1
2  2019-01-02       1        Boston                   1
3  2019-01-02       2       Orlando                   2
4  2019-01-03       1      New York                   2
5  2019-01-03       2         Tampa                   3
6  2019-01-04       1        Boston                   2
7  2019-01-04       2         Miami                   3
8  2019-01-05       1    Washington                   3
9  2019-01-05       2  Jacksonville                   4
10 2019-01-06       1      New York                   3
11 2019-02-06       2       Atlanta                   5

相关问题 更多 >