大Pandas在每日普查中发现独特的条目

2024-05-14 10:16:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个月的人口普查数据,看起来像这样,我想知道这个月有多少独特的囚犯。这些信息每天都会被获取,所以有很多倍。你知道吗

  _id,Date,Gender,Race,Age at Booking,Current Age
    1,2016-06-01,M,W,32,33
    2,2016-06-01,M,B,25,27
    3,2016-06-01,M,W,31,33

我现在的方法是按天对它们进行分组,然后将那些未计入数据帧的数据添加到数据帧中。我的问题是如何解释两个拥有相同信息的人。它们都不会被添加到新的数据帧中,因为其中一个已经存在?我想知道这段时间监狱里总共有多少人。你知道吗

\u id是增量的,例如这里是第二天的一些数据

2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39

链接到数据集:https://data.wprdc.org/dataset/allegheny-county-jail-daily-census


Tags: 数据方法信息idagedatecurrent囚犯
2条回答

我认为这里的诀窍是尽可能多地分组,并在一个月内检查这些(小)组的差异:

inmates = pd.read_csv('inmates.csv')

# group by everything except _id and count number of entries
grouped = inmates.groupby(
    ['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()

# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)

# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()

# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]

# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)

# sum total column
diffed['total'].sum()  # 3393

您可以使用df.drop_duplicates()返回只有唯一值的数据帧,然后对条目进行计数。你知道吗

这样的方法应该有用:

import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)

uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)

结果:

>> 11845

Pandas drop_duplicates Documentation

Inmates June 2016 CSV

这种方法/数据的问题在于,可能会有许多年龄/性别/种族相同的囚犯被过滤掉。你知道吗

相关问题 更多 >

    热门问题