基于某列非缺失值子集化DataFrame

2 投票
3 回答
39 浏览
提问于 2025-04-12 19:23

我有一个叫做pd dataframe的数据表:

import pandas as pd
column1 = [None,None,None,4,8,9,None,None,None,2,3,5,None]
column2 = [None,None,None,None,5,1,None,None,6,3,3,None,None]
column3 = [None,None,None,3,None,7,None,None,7,None,None,1,None]
df = pd.DataFrame(np.column_stack([column1, column2,column3]),columns=['column1', 'column2', 'column3'])

print(df)
   column1 column2 column3
0     None    None    None
1     None    None    None
2     None    None    None
3        4    None       3
4        8       5    None
5        9       1       7
6     None    None    None
7     None    None    None
8     None       6       7
9        2       3    None
10       3       3    None
11       5    None       1
12    None    None    None

我想从第三列的值之间选出一些行,并且去掉所有空行。我的期望结果是:

print (df1)   
   column1 column2 column3
0        4    None       3
1        8       5    None
2        9       1       7

print(df2)
   column1 column2 column3
0     None       6       7
1        2       3    None
2        3       3    None
3        5    None       1

我并不在乎第三列的具体值。第三列的值只是用来表示“开始”和“结束”。

3 个回答

0

试试这个:

import pandas as pd
import numpy as np

column1 = [None,None,None,4,8,9,None,None,None,2,3,5,None]
column2 = [None,None,None,None,5,1,None,None,6,3,3,None,None]
column3 = [None,None,None,3,None,7,None,None,7,None,None,1,None]
df = pd.DataFrame(np.column_stack([column1, column2,column3]), columns=['column1', 'column2', 'column3'])

# Initialize an empty list to store subsets
subsets = []
subset = []

# Iterate over the DataFrame rows
for index, row in df.iterrows():
    if row['column3'] is not None:  # If the current row indicates start or stop
        if subset:  # If the subset is not empty, append it to subsets
            subsets.append(pd.DataFrame(subset))
            subset = []  # Reset subset for next subset
        if row['column3'] == 3:  # If the current row indicates start
            subset.append(row)  # Start collecting rows in subset
    else:
        if subset:  # If subset is collecting rows
            subset.append(row)  # Add the row to the subset

# Add the last subset to subsets if it's not empty
if subset:
    subsets.append(pd.DataFrame(subset))

for i, subset_df in enumerate(subsets, start=1):
    print(f"Subset {i}:")
    print(subset_df)
    print()
2

试试这个:

x = df.column3.dropna().index
for from_, to_ in zip(x[::2], x[1::2]):
    print(df.loc[from_:to_])

输出结果是:

  column1 column2 column3
3       4    None       3
4       8       5    None
5       9       1       7

   column1 column2 column3
8     None       6       7
9        2       3    None
10       3       3    None
11       5    None       1
1

你可以先找到那些不是空值的数,然后对这些数进行累加,接着对结果取模2,这样就能得到开始位置和结束位置的“组”。再把这个结果向右移动1位,和原来的数据相加,然后限制在0和1之间,这样就能得到开始和结束点的聚集情况。
为了给这些组打标签,你可以先计算一下差值,得到1的地方再限制在0和1之间,然后进行累加,最后把这两个结果相乘。

g_small = (~df.column3.isna()).cumsum().mod(2)
g = (g_small  + g_small .shift(1, fill_value=0)).clip(0,1)

groups = g.diff(1).fillna(0).clip(0,1).cumsum().astype(int) * g

接下来,你可以对数据框进行一个 groupby 操作:

dfs = {i: g for i, g in df.groupby(groups) if i > 0}

dfs
# returns:
{1:
   column1 column2 column3
 3       4    None       3
 4       8       5    None
 5       9       1       7,

 2:
    column1 column2 column3
 8     None       6       7
 9        2       3    None
 10       3       3    None
 11       5    None       1}

撰写回答