寻找将连续变量转换为category的方法

2024-06-17 09:15:49 发布

您现在位置:Python中文网/ 问答频道 /正文

样本数据:

id      val1   val2   val3  val4  val5  val6  val7
///+8yr NaN    0.0    2.0     NaN   1     3   23    
///1vjh NaN    NaN    NaN     NaN   NaN   7   62
///4wu  3      NaN    6       NaN   7     8   180

本质上,我希望能够获取这些行中超过5的每个值,并将它们替换为一些类别变量(即“greaterthan5”)。对于val7,我希望根据30的间隔对它们进行分组,例如,0-30组合在一起,31-60组合在一起。在

我可以做一个for循环,但我不知道是否有更有效的方法。在


Tags: 数据id间隔nan类别样本val1本质
1条回答
网友
1楼 · 发布于 2024-06-17 09:15:49

你有两个问题。用'larger than 5'替换大于5的值的第一个问题可以通过布尔索引来实现,第二个问题的分组可以用pd.cut()来实现

演示:

d = pd.read_clipboard()

第1部分

获取不满足大于5标准的值

^{pr2}$

获取大于5的值

larger_than_5=d.loc[:,'val1':'val6'][d.loc[:,'val1':'val6'] >5]
print(larger_than_5)

   val1  val2  val3  val4  val5  val6
0   NaN   NaN   NaN   NaN   NaN   NaN
1   NaN   NaN   NaN   NaN   NaN   7.0
2   NaN   NaN   6.0   NaN   7.0   8.0

更新你的逻辑

larger_than_5[larger_than_5.notnull()] ='Larger than 5'
print(larger_than_5)

   val1  val2           val3  val4           val5           val6
0   NaN   NaN            NaN   NaN            NaN            NaN
1   NaN   NaN            NaN   NaN            NaN  Larger than 5
2   NaN   NaN  Larger than 5   NaN  Larger than 5  Larger than 5

用逻辑更新rest

rest.update(larger_than_5)
print(rest)

   val1  val2           val3  val4           val5           val6
0   NaN   0.0              2   NaN              1              3
1   NaN   NaN            NaN   NaN            NaN  Larger than 5
2   3.0   NaN  Larger than 5   NaN  Larger than 5  Larger than 5

根据逻辑1,用更新后的值替换原始数据框的值

d.loc[:,'val1':'val6'] = rest
print(d)

        id  val1  val2           val3  val4           val5           val6  \
0  ///+8yr   NaN   0.0              2   NaN              1              3   
1  ///1vjh   NaN   NaN            NaN   NaN            NaN  Larger than 5   
2   ///4wu   3.0   NaN  Larger than 5   NaN  Larger than 5  Larger than 5   

   val7  
0    23  
1    62  
2   180  

第二部分

获取垃圾箱

bins = np.arange(0, d['val7'].max()+1, 30)
bins

array([  0,  30,  60,  90, 120, 150, 180], dtype=int64)

创建新系列

val7_groups = pd.cut(d['val7'], bins)
val7_groups

0       (0, 30]
1      (60, 90]
2    (150, 180]

将其添加到数据帧中

d['val7_groups'] = val7_groups
print(d)

        id  val1  val2           val3  val4           val5           val6  \
0  ///+8yr   NaN   0.0              2   NaN              1              3   
1  ///1vjh   NaN   NaN            NaN   NaN            NaN  Larger than 5   
2   ///4wu   3.0   NaN  Larger than 5   NaN  Larger than 5  Larger than 5   

   val7 val7_groups  
0    23     (0, 30]  
1    62    (60, 90]  
2   180  (150, 180] 

也可以通过将值传递给pd.cut()中的labels参数来设置组标签

相关问题 更多 >