跨多个列进行代表性采样

2024-04-19 03:00:58 发布

您现在位置：Python中文网/ 问答频道 /正文

3204

网友

男 | 程序猿一只，喜欢编程写python代码。

我有一个数据框，它代表一个群体，每一列都表示这个人的不同品质/特征。如何获得数据帧/总体的样本，该样本代表所有特征的总体

假设我有一个数据框，它代表650人的劳动力，如下所示：

import pandas as pd
import numpy as np
c = np.random.choice

colours = ['blue', 'yellow', 'green', 'green... no, blue']
knights = ['Bedevere', 'Galahad', 'Arthur', 'Robin', 'Lancelot']
qualities = ['wise', 'brave', 'pure', 'not quite so brave']

df = pd.DataFrame({'name_id':c(range(3000), 650, replace=False),
              'favourite_colour':c(colours, 650),
              'favourite_knight':c(knights, 650),
              'favourite_quality':c(qualities, 650)})

我可以得到上面的一个样本，它反映了单个列的分布，如下所示：

# Find the distribution of a particular column using value_counts and normalize:
knight_weight = df['favourite_knight'].value_counts(normalize=True)

# Add this to my dataframe as a weights column:
df['knight_weight'] = df['favourite_knight'].apply(lambda x: knight_weight[x])

# Then sample my dataframe using the weights column I just added as the 'weights' argument:
df_sample = df.sample(140, weights=df['knight_weight'])

这将返回一个示例数据帧（df_sample），以便：

df_sample['favourite_knight'].value_counts(normalize=True)
is approximately equal to
df['favourite_knight'].value_counts(normalize=True)

我的问题是：如何生成样本数据帧（df_样本），以使上述数据帧（即：

df_sample[column].value_counts(normalize=True)
is approximately equal to
df[column].value_counts(normalize=True)

是否对所有列（除“name_id”）均为true，而不是仅对其中一列为true？样本量为140的650人的总体规模与我的工作规模大致相同，因此性能不是太大的问题。我很乐意接受需要几分钟运行的解决方案，因为这仍然比手动生成上述示例快得多。谢谢你的帮助

Tags：数据 sample true df value as column 代表

1条回答

网友

1楼 · 发布于 2024-04-19 03:00:58

创建一个组合要素列，对该要素列进行权重设置，并将其作为权重绘制：

df["combined"] = list(zip(df["favourite_colour"],
                          df["favourite_knight"],
                          df["favourite_quality"]))

combined_weight = df['combined'].value_counts(normalize=True)

df['combined_weight'] = df['combined'].apply(lambda x: combined_weight[x])

df_sample = df.sample(140, weights=df['combined_weight'])

跨多个列进行代表性采样

相关问题更多 >

编程相关推荐

热门问题

热门文章

跨多个列进行代表性采样

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >