基于另一个dataframe Python和Pandas从dataframe采样

2024-05-14 00:54:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我希望你们都很好

我有两个不同的数据帧,如下所示

主表:

^{tb1}$

可取样 我想根据这张表取样

^{tb2}$

我尝试了不同的方法,但我想知道如何基于SampleTable随机采样行


Tags: 数据方法tb2tb1主表sampletable
1条回答
网友
1楼 · 发布于 2024-05-14 00:54:13

看看this answer

import pandas as pd

data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
                     'clol2':[45, 66, 6, 6, 1, 432, 3],
                     'class':['A', 'B', 'C', 'C', 'A', 'B', 'B']})

freq = pd.DataFrame({'class':['A', 'B', 'C'],
                     'nostoextract':[2, 2, 2], })

def bootstrap(data, freq):
    freq = freq.set_index('class')

    # This function will be applied on each group of instances of the same
    # class in `data`.
    def sampleClass(classgroup):
        cls = classgroup['class'].iloc[0]
        nDesired = freq.nostoextract[cls]
        nRows = len(classgroup)

        nSamples = min(nRows, nDesired)
        return classgroup.sample(nSamples)

    samples = data.groupby('class').apply(sampleClass)

    # If you want a new index with ascending values
    # samples.index = range(len(samples))

    # If you want an index which is equal to the row in `data` where the sample
    # came from
    samples.index = samples.index.get_level_values(1)

    # If you don't change it then you'll have a multiindex with level 0
    # being the class and level 1 being the row in `data` where
    # the sample came from.

    return samples

print(bootstrap(data,freq))

您可以将“城市”、“类型”和“年份”列合并为一个新列:

准备MainTable

MainTable["combination"] = MainTable["city"] + MainTable["type"] + MainTable["year"]

准备SampleTable

SampleTable["combination"] = SampleTable["city"] + SampleTable["type"] + SampleTable["year"]

然后根据链接答案中的SampleTable["combination"].value_counts()而不是freq["class"]进行采样

相关问题 更多 >