Python从数据帧中采样行而不进行替换

2024-04-26 09:24:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在不替换的情况下对pandas数据帧中的行进行采样。我的意思是这个。在for循环的每次迭代中,我从COMBINED中抽取一定数量的行,而不进行替换。我希望确保超过50000次迭代后,不再对同一行进行采样。我下面的代码试图解决这个采样问题,但我得到了错误

COMBINEDTEMPMERGEDSAMPLESAMPLE_2PROBABILITY_GENERATED_POISSON是数据帧lst是一个列表

请参阅下面我的代码:

#FOR LOOP TO SAMPLE FROM COMBINED BASED ON NUMBER OF EVENTS PER YEAR
#AVOIDING REPEATED SAMPLING OF SAME EVENTS
for i in range(50000):
    #IF THERE ARE NO EVENTS FOR THAT PARTICULAR YEAR, THERE WILL BE NO EVENT NUMBER AND NO LOSS
    if PROBABILITY_GENERATED_POISSON.iloc[i,:].item == 0:
        lst.append(0)
    #IF THERE ARE MORE THAN 0 EVENTS FOR THAT YEAR, FOLLOW THE BELOW PROCESS 
    else:
        SAMPLE = COMBINED.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:], 
                                 replace = False,
                                 weights = LOSS_EVENT_SAMPLE_PROBABILITY,
                                 axis = 0)
        SAMPLE['Sample'] = i
        #CREATE TEMP DATA FRAME WHICH CONSISTS OF ALL ROWS SAMPLED IN PREVIOUS ITERATIONS
        #except FUNCTION IS FOR ERROR HANDLING - IT PREVENTS THE LOOP FROM STOPPING MIDWAY
        try:
            TEMP = pd.DataFrame(lst)
            #PERFORM AN INNER JOIN - SELECTING COMMON ROWS FROM TEMP AND SAMPLE
            MERGED = TEMP.merge(SAMPLE, how = "inner")
            #AVOIDING DUPLICATION WITHIN LIST
            #IF THERE ARE NO COMMON ROWS (nrow(MERGED) == 0), THEN INPUT SAMPLE INTO lst
            if MERGED.shape[0] == 0:
                lst.append(SAMPLE)
            else:
                #IF THERE ARE COMMON ROWS (nrow(MERGED) > 0), THEN SAMPLE AGAIN, BUT AFTER EXCLUDING THE COMMON ROWS FROM 
                #THE COMBINED DATA FRAME. BY EXCLUDING THE COMMON ROWS, WE ENSURE THAT WE ARE NOT SAMPLING ROWS WHICH
                #WERE SAMPLED IN PREVIOUS ITERATIONS.
                COMBINED_2 = COMBINED.subtract(SAMPLE)
                SAMPLE_2 = COMBINED_2.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:], 
                                 replace = False,
                                 weights = LOSS_EVENT_SAMPLE_PROBABILITY,
                                 axis = 0)
                SAMPLE_2['Sample'] = i
                lst.append(SAMPLE_2)
        except:
            continue
    
    print(i)

我得到的错误附在图片上

我想就我的问题得到一些反馈

谢谢


Tags: thesamplefromforcommonmergedtempare
2条回答

以下是两种解决方法:

  1. 使用pandas.sample函数的解决方案
n = 50000
COMBINED.sample(n, replace=False)
  1. 使用与.sample()相同的简单算法的解决方案
# use the diamonds dataset to illustrate and test the algorithm
import seaborn as sns
import pandas as pd

df_input = sns.load_dataset('diamonds')

df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
    sample = df_temp.sample(1)
    df_temp.drop(index=sample.index, inplace=True)
    df = df.append(sample)

assert((df.index.value_counts() > 1).sum() == 0)
df

我修正了错误PROBABILITY_GENERATED_POISSON需要是一个列表

相关问题 更多 >