pandas dataframe根据另一列值的范围插入值

2024-04-28 06:11:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我有下面这样的数据框,我想根据sic2列中的值插入一个“string”。在

        conm            sic2
115466  ALLEGION PLC    34.0
115471  AGILITY HEALTH INC  80.0
115473  NORDIC AMERICAN OFFSHORE    44.0
115474  AAD             54.0
115477  DORIAN LPG LTD  44.0
115484  NOMAD FOODS LTD 20.0
115486  ATHENE HOLDING LTD  63.0
115490  MIDATECH PHARMA PLC 28.0
115495  MOTIF BIO PLC   28.0

字符串中sic2数字的范围如下。在

^{pr2}$

如何使pandas.DataFrame看起来像这样应用整个大型数据集?在

我试过几个条件码,但总是失败。在

        conm            sic2                industry
115466  ALLEGION PLC    34.0                Manufacturing
115471  AGILITY HEALTH INC  80.0            Services
115473  NORDIC AMERICAN OFFSHORE    44.0    Transportation, Communications, Electric, Gas and Sanitary service
115474  AAD             54.0                Retail Trade

Tags: 数据stringincplcamericanhealthltdaad
2条回答
#Save your mapping table to a data frame

df2 = pd.DataFrame({'id_end': {0: 9,  1: 14,  2: 17,  3: 19,  4: 39,  5: 49,  6: 51,  7: 59,  8: 67,  9: 89,  10: 97,  11: 99,  12: 1},
 'id_start': {0: 1,  1: 10,  2: 15,  3: 18,  4: 20,  5: 40,  6: 50,  7: 52,  8: 60,  9: 70,  10: 91,  11: 99,  12: 0},
 'industry': {0: 'Agriculture, Forestry and Fishing',  1: 'Mining',  2: 'Construction',  3: 'not used',  4: 'Manufacturing',
  5: 'Transportation, Communications, Electric, Gas and Sanitary service',
  6: 'Wholesale Trade',  7: 'Retail Trade',  8: 'Finance, Insurance and Real Estate',  9: 'Services',  
  10: 'Public Administration',  11: 'Nonclassifiable',  12: 'Agricultural Production Crops'}})

df2 = df2.sort_values(by='id_end')

Out[354]: 
    id_end  id_start                                           industry
12       1         0                      Agricultural Production Crops
0        9         1                  Agriculture, Forestry and Fishing
1       14        10                                             Mining
2       17        15                                       Construction
3       19        18                                           not used
4       39        20                                      Manufacturing
5       49        40  Transportation, Communications, Electric, Gas ...
6       51        50                                    Wholesale Trade
7       59        52                                       Retail Trade
8       67        60                 Finance, Insurance and Real Estate
9       89        70                                           Services
10      97        91                              Public Administration
11      99        99                                    Nonclassifiable

#Map sic2 number to industry names
df['industry'] = df['sic2'].astype(np.int).apply(lambda x: df2.loc[df2.id_end>=x,'industry'].iloc[0])


Out[352]: 
                            conm  sic2                                             industry
115466              ALLEGION PLC  34.0                                        Manufacturing 
115471        AGILITY HEALTH INC  80.0                                             Services 
115473  NORDIC AMERICAN OFFSHORE  44.0    Transportation, Communications, Electric, Gas ... 
115474                       AAD  54.0                                         Retail Trade 
115477            DORIAN LPG LTD  44.0    Transportation, Communications, Electric, Gas ... 
115484           NOMAD FOODS LTD  20.0                                        Manufacturing 
115486        ATHENE HOLDING LTD  63.0                   Finance, Insurance and Real Estate 
115490       MIDATECH PHARMA PLC  28.0                                        Manufacturing 
115495             MOTIF BIO PLC  28.0                                        Manufacturing 

如果您将sics数字转换成字典,那么根据需要查找行业就相当简单了:

代码:

sic = [x.strip().split(' ', 1) for x in """
    1-9 Agriculture, Forestry and Fishing
    10-14 Mining
    15-17 Construction
    18-19 not used
    20-39 Manufacturing
    40-49 Transportation, Communications, ...
    50-51 Wholesale Trade
    52-59 Retail Trade
    60-67 Finance, Insurance and Real Estate
    70-89 Services
    91-97 Public Administration
    99-99 Nonclassifiable
""".split('\n')[1:-1]]

sic_dict = dict(sum([[(x, z) for x in
                      range(*[int(y) for y in v.split('-')])]
                     for v, z in sic], []))

测试代码:

^{pr2}$

结果:

   number                      conm  sic2                             industry
0  115466              ALLEGION PLC  34.0                        Manufacturing
1  115471        AGILITY HEALTH INC  80.0                             Services
2  115473  NORDIC AMERICAN OFFSHORE  44.0  Transportation, Communications, ...
3  115474                       AAD  54.0                         Retail Trade
4  115477            DORIAN LPG LTD  44.0  Transportation, Communications, ...
5  115484           NOMAD FOODS LTD  20.0                        Manufacturing
6  115486        ATHENE HOLDING LTD  63.0   Finance, Insurance and Real Estate
7  115490       MIDATECH PHARMA PLC  28.0                        Manufacturing
8  115495             MOTIF BIO PLC  28.0                        Manufacturing

相关问题 更多 >