Python:为groupby值生成值范围

-1 投票
1 回答
76 浏览
提问于 2025-04-13 17:49

我正在尝试用Python生成测试数据。请告诉我是否可以实现这样的功能。

我会输入一个数字,比如20,我需要生成20行数据。其中一列包含每组的值范围。这些值应该在这个范围内生成。(有些没有值范围,只有字符串或者用逗号分隔的值,这些应该被分开)下面是示例输入和预期输出。

示例输入:

  VarName           Label  Score
0    LtoV  0.00<=to>=0.68     20
1    LtoV     0.69<=to>=2     33
2    LtoV         2toHigh     40
3     Age           0to20     36
4     Age          21to40     15
5     Age          41to60     50
6     Indicator     A         30
7     Indicator     B         40
8     Indicator     C         40
9     Oc Code       100,20,30 10
10    Oc Code       5,10,16   20

输出:

当我输入20时,我需要为LtoV生成20行数据,第二列的值应该在0.00到某个高值之间,并根据这个值给出相应的分数。对于字符串列,值应该在20行中重复。对于用逗号分隔的值的列,应该把这些值分开,每个值放在一行。同样,年龄也需要生成20行。

SNo    VarName    column2     Score
    
  1       LtoV       0.00        20     
  2       LtoV       0.42        20
  3       LtoV       1           33
  4       LtoV       2.5         40
...
 20       LtoV      50           40
`1`    Indicator     A         30
 2    Indicator     B         40
 3    Indicator     C         40
 4    Indicator     A         30
 5    Indicator     B         40
 6     Indicator     C         40
 7     Indicator     A         30
 8     Indicator     B         40
....
 20     Indicator     C         40`
 1    Oc Code       100        10
 2    Oc Code       20         10
 3    Oc Code       30         10
 4    Oc Code       5          20
5    Oc Code       10          20
6    Oc Code       15         20
....
20    Oc Code       5          20


generate 20 such values for LtoV and 20 such values for Age and based on the range corresponding value for score.
Is this feasible?

********************

1 个回答

0

你可以先找出范围的下限和上限,如果没有的话,可以选择设置一个默认的低值和高值(否则就用dropna来处理缺失值)。接着,你可以用sample从每个组中随机抽取20行,设置参数replace=True,然后用numpy.random.uniform来生成随机值:

df = pd.DataFrame({'VarName': ['LtoV', 'LtoV', 'LtoV', 'Age', 'Age', 'Age'],
                   'Label': ['0.00<=to>=0.68', '0.69<=to>=2', '2toHigh', '0to20', '21to40', '41to60'],
                   'Score': [20, 33, 40, 36, 15, 50]})

ranges = (df['Label']
          .str.extract(r'^(\d+\.?\d*)?\D+(\d+\.?\d*)?$')
          .astype(float)
          .fillna({0: 0, 1: 100}) # default low/high
          .groupby(df['VarName']).sample(n=20, replace=True)
         )

low, high = ranges.to_numpy().T

out = (df.reindex(ranges.index)
         .assign(column2=np.random.uniform(low, high))
         .sort_index()
      )

示例输出:

  VarName           Label  Score    column2
0    LtoV  0.00<=to>=0.68     20   0.670814
0    LtoV  0.00<=to>=0.68     20   0.222490
0    LtoV  0.00<=to>=0.68     20   0.364282
0    LtoV  0.00<=to>=0.68     20   0.607478
0    LtoV  0.00<=to>=0.68     20   0.633055
0    LtoV  0.00<=to>=0.68     20   0.675987
0    LtoV  0.00<=to>=0.68     20   0.531158
1    LtoV     0.69<=to>=2     33   1.843399
1    LtoV     0.69<=to>=2     33   1.940886
1    LtoV     0.69<=to>=2     33   1.760578
1    LtoV     0.69<=to>=2     33   1.110946
1    LtoV     0.69<=to>=2     33   1.724802
2    LtoV         2toHigh     40  19.508126
2    LtoV         2toHigh     40  43.618950
2    LtoV         2toHigh     40  62.808270
2    LtoV         2toHigh     40  95.246445
2    LtoV         2toHigh     40   2.536909
2    LtoV         2toHigh     40  29.841093
2    LtoV         2toHigh     40  73.958545
2    LtoV         2toHigh     40  89.782031
3     Age           0to20     36  12.593195
3     Age           0to20     36   0.110518
3     Age           0to20     36   5.308487
3     Age           0to20     36  12.914926
3     Age           0to20     36  17.198194
3     Age           0to20     36  17.529561
3     Age           0to20     36   0.954200
4     Age          21to40     15  28.722856
4     Age          21to40     15  30.446264
4     Age          21to40     15  26.812708
4     Age          21to40     15  34.001827
4     Age          21to40     15  30.052331
4     Age          21to40     15  28.470239
4     Age          21to40     15  30.371726
5     Age          41to60     50  50.095150
5     Age          41to60     50  57.618241
5     Age          41to60     50  56.214911
5     Age          41to60     50  58.106158
5     Age          41to60     50  41.220661
5     Age          41to60     50  49.831084

中间结果:

(df['Label']
.str.extract(r'^(\d+\.?\d*)?\D+(\d+\.?\d*)?$')
.astype(float)
.fillna({0: 0, 1: 100}) # default low/high
)

       0       1
0   0.00    0.68
1   0.69    2.00
2   2.00  100.00
3   0.00   20.00
4  21.00   40.00
5  41.00   60.00

撰写回答