如何提高Pandas DataFrame中随机列选择和赋值的效率?
我正在做一个项目,需要根据一个已有的数据表(DataFrame)创建一个新的数据表。在新数据表的每一行中,我会随机选择某些列,并根据这些列中的数字的比例来分配值。
不过,我现在的做法似乎效率不高,特别是在处理大数据集的时候。我想寻求一些建议,看看如何能优化这个过程,提高性能。
以下是我目前做的一个简化版本:
import pandas as pd
import numpy as np
# Sample DataFrame
data = {
'dog': [1, 2, 3, 4],
'cat': [5, 6, 7, 8],
'parrot': [9, 10, 11, 12],
'owner': ['fred', 'bob', 'jim', 'jannet']
}
df = pd.DataFrame(data)
# List of relevant columns
relevant_col_list = ['dog', 'cat', 'parrot']
# New DataFrame with the same number of rows
new_df = df.copy()
# Create 'iteration_1' column in new_df
new_df['iteration_1'] = ""
# Iterate over rows
for index, row in new_df.iterrows():
# Copy columns not in relevant_col_list
for column in new_df.columns:
if column not in relevant_col_list and column != 'iteration_1':
new_df.at[index, column] = row[column]
# Randomly select a column from relevant_col_list with probability proportional to the number in the column
probabilities = df[relevant_col_list ].iloc[index] / df[relevant_col_list ].iloc[index].sum()
chosen_column = np.random.choice(relevant_col_list , p=probabilities)
# Write the name of the chosen column in the 'iteration_1' column
new_df.at[index, 'iteration_1'] = chosen_column
print(new_df)
我该如何加快这个过程呢?
1 个回答
3
你可以先调整一下你的数据表,选择你感兴趣的列,然后把权重进行标准化,接着创建一个cumsum
。
tmp = (df[relevant_col_list]
.pipe(lambda x: x.div(x.sum(axis=1), axis=0))
.cumsum(axis=1).to_numpy()
)
# cumulated probabilities
array([[0.06666667, 0.4 , 1. ],
[0.11111111, 0.44444444, 1. ],
[0.14285714, 0.47619048, 1. ],
[0.16666667, 0.5 , 1. ]])
之后生成n个在0到1之间的随机数,并用argmax
找出第一个大于这个随机数的列:
r = np.random.random(len(df))
df['iteration_1'] = np.array(relevant_col_list)[(tmp > r[:,None]).argmax(axis=1)]
输出结果:
col1 col2 col3 col4 iteration_1
0 1 5 9 13 col3
1 2 6 10 14 col3
2 3 7 11 15 col1
3 4 8 12 16 col2
中间结果:
# for reproducibility
np.random.seed(1)
tmp
# array([[0.06666667, 0.4 , 1. ],
# [0.11111111, 0.44444444, 1. ],
# [0.14285714, 0.47619048, 1. ],
# [0.16666667, 0.5 , 1. ]])
r = np.random.random(len(df))
# array([0.417022005, 0.720324493, 0.114374817, 0.302332573])
(tmp > r[:,None])
# array([[False, False, True],
# [False, False, True],
# [ True, True, True],
# [False, True, True]])
(tmp > r[:,None]).argmax(axis=1)
# array([2, 2, 0, 1])
np.array(relevant_col_list)[(tmp > r[:,None]).argmax(axis=1)]
# array(['col3', 'col3', 'col1', 'col2'], dtype='<U4')