识别某个分位数的观测：数据科学

id year rap cohort jobs year_of_life rap_new 1 2009 0 NaN 10 NaN 0 2 2012 0 2012 12 0 0 3 2013 0 2012 12 1 1 4 2014 0 2012 13 2 1 5 2015 1 2012 15 3 1 6 2016 0 2012 17 4 0 7 2017 0 2012 19 5 0 8 2009 0 2009 15 0 1 9 2010 0 2009 2 1 1 10 2011 0 2009 3 2 1 11 2012 1 2009 3 3 0 12 2013 0 2009 15 4 0 13 2014 0 2009 12 5 0 14 2015 0 2009 13 6 0 15 2016 0 2009 13 7 0 16 2011 0 2009 3 2 1 17 2012 1 2009 3 3 0 18 2013 0 2009 18 4 0 19 2014 0 2009 12 5 0 20 2015 0 2009 13 6 0 ..... 100 2009 0 2007 5 6 1

id year rap cohort jobs year_of_life rap_new new_var 1 2009 0 NaN 10 NaN 0 0 2 2012 0 2012 12 0 0 0 3 2013 0 2012 12 1 1 0 4 2014 0 2012 13 2 1 0 5 2015 1 2012 15 3 1 0 6 2016 0 2012 17 4 0 0 7 2017 0 2012 18 5 0 0 8 2009 0 2009 15 0 1 0 9 2010 0 2009 2 1 1 0 10 2011 0 2009 3 2 1 0 11 2012 1 2009 3 3 0 0 12 2013 0 2009 15 4 0 0 13 2014 0 2009 12 5 0 0 14 2015 0 2009 13 6 0 0 15 2016 0 2009 13 7 0 0 16 2011 0 2009 3 2 1 0 17 2012 1 2009 3 3 0 0 18 2013 0 2009 19 4 0 1 19 2014 0 2009 12 5 0 0 20 2015 0 2009 13 6 0 0 ..... 100 2009 0 2007 5 6 1 0

2条回答

网友

1楼 · 编辑于 2024-04-19 01:04:25

pandas附带了一个rank方法，用于获取rank或percentile。您可能需要：

In [8]: df['percentile'] = df.jobs.rank(pct=True)

In [9]: df
Out[9]:
    id  year  rap  cohort  jobs  year_of_life  rap_new  percentile
0    1  2009    0     NaN    10           NaN        0       0.300
1    2  2012    0  2012.0    12           0.0        0       0.425
2    3  2013    0  2012.0    12           1.0        1       0.425
3    4  2014    0  2012.0    13           2.0        1       0.625
4    5  2015    1  2012.0    15           3.0        1       0.800
5    6  2016    0  2012.0    17           4.0        0       0.900
6    7  2017    0  2012.0    19           5.0        0       1.000
7    8  2009    0  2009.0    15           0.0        1       0.800
8    9  2010    0  2009.0     2           1.0        1       0.050
9   10  2011    0  2009.0     3           2.0        1       0.175
10  11  2012    1  2009.0     3           3.0        0       0.175
11  12  2013    0  2009.0    15           4.0        0       0.800
12  13  2014    0  2009.0    12           5.0        0       0.425
13  14  2015    0  2009.0    13           6.0        0       0.625
14  15  2016    0  2009.0    13           7.0        0       0.625
15  16  2011    0  2009.0     3           2.0        1       0.175
16  17  2012    1  2009.0     3           3.0        0       0.175
17  18  2013    0  2009.0    18           4.0        0       0.950
18  19  2014    0  2009.0    12           5.0        0       0.425
19  20  2015    0  2009.0    13           6.0        0       0.625

所以要筛选前1%中的行：

In [10]: df[df.percentile > 0.99]
Out[10]:
   id  year  rap  cohort  jobs  year_of_life  rap_new  percentile
6   7  2017    0  2012.0    19           5.0        0         1.0

或前50%：

In [12]: df[df.percentile > 0.50]
Out[12]:
    id  year  rap  cohort  jobs  year_of_life  rap_new  percentile
3    4  2014    0  2012.0    13           2.0        1       0.625
4    5  2015    1  2012.0    15           3.0        1       0.800
5    6  2016    0  2012.0    17           4.0        0       0.900
6    7  2017    0  2012.0    19           5.0        0       1.000
7    8  2009    0  2009.0    15           0.0        1       0.800
11  12  2013    0  2009.0    15           4.0        0       0.800
13  14  2015    0  2009.0    13           6.0        0       0.625
14  15  2016    0  2009.0    13           7.0        0       0.625
17  18  2013    0  2009.0    18           4.0        0       0.950
19  20  2015    0  2009.0    13           6.0        0       0.625

网友

2楼 · 编辑于 2024-04-19 01:04:25

您可以使用pd.Series.quantile来标识截止线

设置

import numpy as np
import pandas as pd

np.random.seed([3, 1415])
df = pd.DataFrame(dict(
    id=range(1, 201),
    jobs=np.random.randint(100, 10000, size=200)
))

解决方案

df[df.jobs >= df.jobs.quantile(.99)]

      id  jobs
23    24  9768
182  183  9965

相关问题更多 >

编程相关推荐

热门问题

热门文章