Python：为groupby值生成值范围

-1 投票

1 回答

76 浏览

提问于 2025-04-13 17:49

我正在尝试用Python生成测试数据。请告诉我是否可以实现这样的功能。

我会输入一个数字，比如20，我需要生成20行数据。其中一列包含每组的值范围。这些值应该在这个范围内生成。（有些没有值范围，只有字符串或者用逗号分隔的值，这些应该被分开）下面是示例输入和预期输出。

示例输入：

  VarName           Label  Score
0    LtoV  0.00<=to>=0.68     20
1    LtoV     0.69<=to>=2     33
2    LtoV         2toHigh     40
3     Age           0to20     36
4     Age          21to40     15
5     Age          41to60     50
6     Indicator     A         30
7     Indicator     B         40
8     Indicator     C         40
9     Oc Code       100,20,30 10
10    Oc Code       5,10,16   20

输出：

当我输入20时，我需要为LtoV生成20行数据，第二列的值应该在0.00到某个高值之间，并根据这个值给出相应的分数。对于字符串列，值应该在20行中重复。对于用逗号分隔的值的列，应该把这些值分开，每个值放在一行。同样，年龄也需要生成20行。

SNo    VarName    column2     Score
    
  1       LtoV       0.00        20     
  2       LtoV       0.42        20
  3       LtoV       1           33
  4       LtoV       2.5         40
...
 20       LtoV      50           40
`1`    Indicator     A         30
 2    Indicator     B         40
 3    Indicator     C         40
 4    Indicator     A         30
 5    Indicator     B         40
 6     Indicator     C         40
 7     Indicator     A         30
 8     Indicator     B         40
....
 20     Indicator     C         40`
 1    Oc Code       100        10
 2    Oc Code       20         10
 3    Oc Code       30         10
 4    Oc Code       5          20
5    Oc Code       10          20
6    Oc Code       15         20
....
20    Oc Code       5          20


generate 20 such values for LtoV and 20 such values for Age and based on the range corresponding value for score.
Is this feasible?

********************

字符串处理数据处理随机数生成数据格式化分组数据生成值范围数据重复

1 个回答

你可以先找出范围的下限和上限，如果没有的话，可以选择设置一个默认的低值和高值（否则就用dropna来处理缺失值）。接着，你可以用sample从每个组中随机抽取20行，设置参数replace=True，然后用numpy.random.uniform来生成随机值：

df = pd.DataFrame({'VarName': ['LtoV', 'LtoV', 'LtoV', 'Age', 'Age', 'Age'],
                   'Label': ['0.00<=to>=0.68', '0.69<=to>=2', '2toHigh', '0to20', '21to40', '41to60'],
                   'Score': [20, 33, 40, 36, 15, 50]})

ranges = (df['Label']
          .str.extract(r'^(\d+\.?\d*)?\D+(\d+\.?\d*)?$')
          .astype(float)
          .fillna({0: 0, 1: 100}) # default low/high
          .groupby(df['VarName']).sample(n=20, replace=True)
         )

low, high = ranges.to_numpy().T

out = (df.reindex(ranges.index)
         .assign(column2=np.random.uniform(low, high))
         .sort_index()
      )

示例输出：

  VarName           Label  Score    column2
0    LtoV  0.00<=to>=0.68     20   0.670814
0    LtoV  0.00<=to>=0.68     20   0.222490
0    LtoV  0.00<=to>=0.68     20   0.364282
0    LtoV  0.00<=to>=0.68     20   0.607478
0    LtoV  0.00<=to>=0.68     20   0.633055
0    LtoV  0.00<=to>=0.68     20   0.675987
0    LtoV  0.00<=to>=0.68     20   0.531158
1    LtoV     0.69<=to>=2     33   1.843399
1    LtoV     0.69<=to>=2     33   1.940886
1    LtoV     0.69<=to>=2     33   1.760578
1    LtoV     0.69<=to>=2     33   1.110946
1    LtoV     0.69<=to>=2     33   1.724802
2    LtoV         2toHigh     40  19.508126
2    LtoV         2toHigh     40  43.618950
2    LtoV         2toHigh     40  62.808270
2    LtoV         2toHigh     40  95.246445
2    LtoV         2toHigh     40   2.536909
2    LtoV         2toHigh     40  29.841093
2    LtoV         2toHigh     40  73.958545
2    LtoV         2toHigh     40  89.782031
3     Age           0to20     36  12.593195
3     Age           0to20     36   0.110518
3     Age           0to20     36   5.308487
3     Age           0to20     36  12.914926
3     Age           0to20     36  17.198194
3     Age           0to20     36  17.529561
3     Age           0to20     36   0.954200
4     Age          21to40     15  28.722856
4     Age          21to40     15  30.446264
4     Age          21to40     15  26.812708
4     Age          21to40     15  34.001827
4     Age          21to40     15  30.052331
4     Age          21to40     15  28.470239
4     Age          21to40     15  30.371726
5     Age          41to60     50  50.095150
5     Age          41to60     50  57.618241
5     Age          41to60     50  56.214911
5     Age          41to60     50  58.106158
5     Age          41to60     50  41.220661
5     Age          41to60     50  49.831084

中间结果：

(df['Label']
.str.extract(r'^(\d+\.?\d*)?\D+(\d+\.?\d*)?$')
.astype(float)
.fillna({0: 0, 1: 100}) # default low/high
)

       0       1
0   0.00    0.68
1   0.69    2.00
2   2.00  100.00
3   0.00   20.00
4  21.00   40.00
5  41.00   60.00

回答于 2025-04-13 由 Python大师

分享举报

Python：为groupby值生成值范围

1 个回答

撰写回答