如何将项目分组到110个桶中?

2024-05-01 21:56:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在测试一行非常基本的代码

modDF['RatingDecile'] = pd.cut(modDF['RatingScore'], 10)

这给了我10个级别的评分范围。我怎样才能看到1、2、3等等,而不是范围,最多10个

所以,不是这个

      Score RatingQuantile  
0     (26.3, 29.0]  
6     (23.6, 26.3]  
7     (23.6, 26.3]  
8     (26.3, 29.0]  
10    (18.2, 20.9]  
       ...       ...  
9763  (23.6, 26.3]  
9769  (20.9, 23.6]  
9829  (20.9, 23.6]  
9889  (23.6, 26.3]  
9949  (20.9, 23.6] 

我怎么能得到这样的东西

      Score RatingQuantile  
0     10  
6     8 
7     8 
8     10  
10    6  
       ...      ...  
9763  8  
9769  5  
9829  5 
9889  5  
9949  5 

我试过这个

modDF['DecileRank'] = pd.qcut(modDF['RatingScore'],10,labels=False)

我犯了这个错误

ValueError: Bin edges must be unique: array([ 2., 20., 25., 27., 27., 27., 27., 27., 27., 27., 29.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

这个错误对我来说是有道理的。我只是不知道这个问题的解决方法。想法


Tags: 代码labels错误级别评分pdscorecut
2条回答

我想你要找的是:

modDF['RatingDecile'] = pd.cut(modDF['RatingScore'], 10, labels=range(1,11))
# or
modDF['RatingDecile'] = pd.cut(modDF['RatingScore'], 10, labels=False)

docs开始:

labels : array or bool, optional
Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container (see below). This argument is ignored when bins is an IntervalIndex.

此外,如果要“覆盖”整个间隔[0,30],请指定箱子边缘:

import numpy as np

modDF['RatingDecile'] = pd.cut(modDF['RatingScore'], 
                               bins=np.linspace(0, 30, 11), labels=False)

警告:注意^{} is not the same as ^{}

如果传递一个序列,我不会遇到使用qcut()的问题。我假设你的数据看起来像我正在使用的数据

import pandas as pd
import numpy as np
data = {'values':np.random.randint(1,30,size=1000)}
df = pd.DataFrame(data)
df['ranks'] = pd.qcut(df['values'],10,labels=False)
print(df)

输出:

     values  ranks
0        18      5
1        22      7
2         5      1
3        12      3
4        14      4
..      ...    ...
995      22      7
996      13      4
997      26      8
998       3      0
999      22      7

之后,您可以使用groupby()或其他一组函数检查简单操作(例如箱子的限制):

df_info = df.groupby('ranks').agg(
        min_score=pd.NamedAgg(column='values',aggfunc='min'),
        max_score=pd.NamedAgg(column='values',aggfunc='max'),
        count_cases=pd.NamedAgg(column='values',aggfunc='count'))
print(df_info)

输出:

       min_score  max_score  count_cases
ranks                                   
0              1          3          137
1              4          5           72
2              6          8          105
3              9         11           96
4             12         14           98
5             15         17          107
6             18         20           91
7             21         23           99
8             24         27          121
9             28         29           74

相关问题 更多 >