Python:为groupby值生成值范围
我正在尝试用Python生成测试数据。请告诉我是否可以实现这样的功能。
我会输入一个数字,比如20,我需要生成20行数据。其中一列包含每组的值范围。这些值应该在这个范围内生成。(有些没有值范围,只有字符串或者用逗号分隔的值,这些应该被分开)下面是示例输入和预期输出。
示例输入:
VarName Label Score
0 LtoV 0.00<=to>=0.68 20
1 LtoV 0.69<=to>=2 33
2 LtoV 2toHigh 40
3 Age 0to20 36
4 Age 21to40 15
5 Age 41to60 50
6 Indicator A 30
7 Indicator B 40
8 Indicator C 40
9 Oc Code 100,20,30 10
10 Oc Code 5,10,16 20
输出:
当我输入20时,我需要为LtoV生成20行数据,第二列的值应该在0.00到某个高值之间,并根据这个值给出相应的分数。对于字符串列,值应该在20行中重复。对于用逗号分隔的值的列,应该把这些值分开,每个值放在一行。同样,年龄也需要生成20行。
SNo VarName column2 Score
1 LtoV 0.00 20
2 LtoV 0.42 20
3 LtoV 1 33
4 LtoV 2.5 40
...
20 LtoV 50 40
`1` Indicator A 30
2 Indicator B 40
3 Indicator C 40
4 Indicator A 30
5 Indicator B 40
6 Indicator C 40
7 Indicator A 30
8 Indicator B 40
....
20 Indicator C 40`
1 Oc Code 100 10
2 Oc Code 20 10
3 Oc Code 30 10
4 Oc Code 5 20
5 Oc Code 10 20
6 Oc Code 15 20
....
20 Oc Code 5 20
generate 20 such values for LtoV and 20 such values for Age and based on the range corresponding value for score.
Is this feasible?
********************
1 个回答
0
你可以先找出范围的下限和上限,如果没有的话,可以选择设置一个默认的低值和高值(否则就用dropna
来处理缺失值)。接着,你可以用sample
从每个组中随机抽取20行,设置参数replace=True
,然后用numpy.random.uniform
来生成随机值:
df = pd.DataFrame({'VarName': ['LtoV', 'LtoV', 'LtoV', 'Age', 'Age', 'Age'],
'Label': ['0.00<=to>=0.68', '0.69<=to>=2', '2toHigh', '0to20', '21to40', '41to60'],
'Score': [20, 33, 40, 36, 15, 50]})
ranges = (df['Label']
.str.extract(r'^(\d+\.?\d*)?\D+(\d+\.?\d*)?$')
.astype(float)
.fillna({0: 0, 1: 100}) # default low/high
.groupby(df['VarName']).sample(n=20, replace=True)
)
low, high = ranges.to_numpy().T
out = (df.reindex(ranges.index)
.assign(column2=np.random.uniform(low, high))
.sort_index()
)
示例输出:
VarName Label Score column2
0 LtoV 0.00<=to>=0.68 20 0.670814
0 LtoV 0.00<=to>=0.68 20 0.222490
0 LtoV 0.00<=to>=0.68 20 0.364282
0 LtoV 0.00<=to>=0.68 20 0.607478
0 LtoV 0.00<=to>=0.68 20 0.633055
0 LtoV 0.00<=to>=0.68 20 0.675987
0 LtoV 0.00<=to>=0.68 20 0.531158
1 LtoV 0.69<=to>=2 33 1.843399
1 LtoV 0.69<=to>=2 33 1.940886
1 LtoV 0.69<=to>=2 33 1.760578
1 LtoV 0.69<=to>=2 33 1.110946
1 LtoV 0.69<=to>=2 33 1.724802
2 LtoV 2toHigh 40 19.508126
2 LtoV 2toHigh 40 43.618950
2 LtoV 2toHigh 40 62.808270
2 LtoV 2toHigh 40 95.246445
2 LtoV 2toHigh 40 2.536909
2 LtoV 2toHigh 40 29.841093
2 LtoV 2toHigh 40 73.958545
2 LtoV 2toHigh 40 89.782031
3 Age 0to20 36 12.593195
3 Age 0to20 36 0.110518
3 Age 0to20 36 5.308487
3 Age 0to20 36 12.914926
3 Age 0to20 36 17.198194
3 Age 0to20 36 17.529561
3 Age 0to20 36 0.954200
4 Age 21to40 15 28.722856
4 Age 21to40 15 30.446264
4 Age 21to40 15 26.812708
4 Age 21to40 15 34.001827
4 Age 21to40 15 30.052331
4 Age 21to40 15 28.470239
4 Age 21to40 15 30.371726
5 Age 41to60 50 50.095150
5 Age 41to60 50 57.618241
5 Age 41to60 50 56.214911
5 Age 41to60 50 58.106158
5 Age 41to60 50 41.220661
5 Age 41to60 50 49.831084
中间结果:
(df['Label']
.str.extract(r'^(\d+\.?\d*)?\D+(\d+\.?\d*)?$')
.astype(float)
.fillna({0: 0, 1: 100}) # default low/high
)
0 1
0 0.00 0.68
1 0.69 2.00
2 2.00 100.00
3 0.00 20.00
4 21.00 40.00
5 41.00 60.00