基于某列非缺失值子集化DataFrame
我有一个叫做pd dataframe的数据表:
import pandas as pd
column1 = [None,None,None,4,8,9,None,None,None,2,3,5,None]
column2 = [None,None,None,None,5,1,None,None,6,3,3,None,None]
column3 = [None,None,None,3,None,7,None,None,7,None,None,1,None]
df = pd.DataFrame(np.column_stack([column1, column2,column3]),columns=['column1', 'column2', 'column3'])
print(df)
column1 column2 column3
0 None None None
1 None None None
2 None None None
3 4 None 3
4 8 5 None
5 9 1 7
6 None None None
7 None None None
8 None 6 7
9 2 3 None
10 3 3 None
11 5 None 1
12 None None None
我想从第三列的值之间选出一些行,并且去掉所有空行。我的期望结果是:
print (df1)
column1 column2 column3
0 4 None 3
1 8 5 None
2 9 1 7
print(df2)
column1 column2 column3
0 None 6 7
1 2 3 None
2 3 3 None
3 5 None 1
我并不在乎第三列的具体值。第三列的值只是用来表示“开始”和“结束”。
3 个回答
0
试试这个:
import pandas as pd
import numpy as np
column1 = [None,None,None,4,8,9,None,None,None,2,3,5,None]
column2 = [None,None,None,None,5,1,None,None,6,3,3,None,None]
column3 = [None,None,None,3,None,7,None,None,7,None,None,1,None]
df = pd.DataFrame(np.column_stack([column1, column2,column3]), columns=['column1', 'column2', 'column3'])
# Initialize an empty list to store subsets
subsets = []
subset = []
# Iterate over the DataFrame rows
for index, row in df.iterrows():
if row['column3'] is not None: # If the current row indicates start or stop
if subset: # If the subset is not empty, append it to subsets
subsets.append(pd.DataFrame(subset))
subset = [] # Reset subset for next subset
if row['column3'] == 3: # If the current row indicates start
subset.append(row) # Start collecting rows in subset
else:
if subset: # If subset is collecting rows
subset.append(row) # Add the row to the subset
# Add the last subset to subsets if it's not empty
if subset:
subsets.append(pd.DataFrame(subset))
for i, subset_df in enumerate(subsets, start=1):
print(f"Subset {i}:")
print(subset_df)
print()
2
试试这个:
x = df.column3.dropna().index
for from_, to_ in zip(x[::2], x[1::2]):
print(df.loc[from_:to_])
输出结果是:
column1 column2 column3
3 4 None 3
4 8 5 None
5 9 1 7
column1 column2 column3
8 None 6 7
9 2 3 None
10 3 3 None
11 5 None 1
1
你可以先找到那些不是空值的数,然后对这些数进行累加,接着对结果取模2,这样就能得到开始位置和结束位置的“组”。再把这个结果向右移动1位,和原来的数据相加,然后限制在0和1之间,这样就能得到开始和结束点的聚集情况。
为了给这些组打标签,你可以先计算一下差值,得到1的地方再限制在0和1之间,然后进行累加,最后把这两个结果相乘。
g_small = (~df.column3.isna()).cumsum().mod(2)
g = (g_small + g_small .shift(1, fill_value=0)).clip(0,1)
groups = g.diff(1).fillna(0).clip(0,1).cumsum().astype(int) * g
接下来,你可以对数据框进行一个 groupby
操作:
dfs = {i: g for i, g in df.groupby(groups) if i > 0}
dfs
# returns:
{1:
column1 column2 column3
3 4 None 3
4 8 5 None
5 9 1 7,
2:
column1 column2 column3
8 None 6 7
9 2 3 None
10 3 3 None
11 5 None 1}