如何提高Pandas DataFrame中随机列选择和赋值的效率?

3 投票
1 回答
48 浏览
提问于 2025-04-14 15:48

我正在做一个项目,需要根据一个已有的数据表(DataFrame)创建一个新的数据表。在新数据表的每一行中,我会随机选择某些列,并根据这些列中的数字的比例来分配值。

不过,我现在的做法似乎效率不高,特别是在处理大数据集的时候。我想寻求一些建议,看看如何能优化这个过程,提高性能。

以下是我目前做的一个简化版本:

import pandas as pd
import numpy as np

# Sample DataFrame
data = {
    'dog': [1, 2, 3, 4],
    'cat': [5, 6, 7, 8],
    'parrot': [9, 10, 11, 12],
    'owner': ['fred', 'bob', 'jim', 'jannet']
}
df = pd.DataFrame(data)

# List of relevant columns
relevant_col_list = ['dog', 'cat', 'parrot']

# New DataFrame with the same number of rows
new_df = df.copy()

# Create 'iteration_1' column in new_df
new_df['iteration_1'] = ""

# Iterate over rows
for index, row in new_df.iterrows():
    # Copy columns not in relevant_col_list 
    for column in new_df.columns:
        if column not in relevant_col_list and column != 'iteration_1':
            new_df.at[index, column] = row[column]
    
    # Randomly select a column from relevant_col_list with probability proportional to the number in the column
    probabilities = df[relevant_col_list ].iloc[index] / df[relevant_col_list ].iloc[index].sum()
    chosen_column = np.random.choice(relevant_col_list , p=probabilities)
    
    # Write the name of the chosen column in the 'iteration_1' column
    new_df.at[index, 'iteration_1'] = chosen_column

print(new_df)

我该如何加快这个过程呢?

1 个回答

3

你可以先调整一下你的数据表,选择你感兴趣的列,然后把权重进行标准化,接着创建一个cumsum

tmp = (df[relevant_col_list]
       .pipe(lambda x: x.div(x.sum(axis=1), axis=0))
       .cumsum(axis=1).to_numpy()
      )

# cumulated probabilities
array([[0.06666667, 0.4       , 1.        ],
       [0.11111111, 0.44444444, 1.        ],
       [0.14285714, 0.47619048, 1.        ],
       [0.16666667, 0.5       , 1.        ]])

之后生成n个在0到1之间的随机数,并用argmax找出第一个大于这个随机数的列:

r = np.random.random(len(df))

df['iteration_1'] = np.array(relevant_col_list)[(tmp > r[:,None]).argmax(axis=1)]

输出结果:

   col1  col2  col3  col4 iteration_1
0     1     5     9    13        col3
1     2     6    10    14        col3
2     3     7    11    15        col1
3     4     8    12    16        col2

中间结果:

# for reproducibility
np.random.seed(1)

tmp
# array([[0.06666667, 0.4       , 1.        ],
#        [0.11111111, 0.44444444, 1.        ],
#        [0.14285714, 0.47619048, 1.        ],
#        [0.16666667, 0.5       , 1.        ]])

r = np.random.random(len(df))
# array([0.417022005, 0.720324493, 0.114374817, 0.302332573])

(tmp > r[:,None])
# array([[False, False,  True],
#        [False, False,  True],
#        [ True,  True,  True],
#        [False,  True,  True]])

(tmp > r[:,None]).argmax(axis=1)
# array([2, 2, 0, 1])

np.array(relevant_col_list)[(tmp > r[:,None]).argmax(axis=1)]
# array(['col3', 'col3', 'col1', 'col2'], dtype='<U4')

撰写回答