在Pandas系列中寻找相邻区域

2024-03-29 01:51:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我想选择值大于1的所有区域,如果它们连接到值大于5的元素。 如果两个值被0分隔,则它们不连接。在

以下数据集

pd.Series(data = [0,2,0,2,3,6,3,0])

输出应该是

^{pr2}$

Tags: 数据区域元素dataseriespdpr2
2条回答

好吧,看来我找到了一个使用pandas groupby函数的单行线:

import pandas as pd

ts = pd.Series(data = [0,2,0,2,3,6,3,0])

# The flag column allows me to identify sequences. Here 0s are included 
# in the "sequence", but as you can see in next line doesn't matter 
df = pd.concat([ts, (ts==0).cumsum()], axis = 1, keys = ['val', 'flag'])

#   val  flag
#0    0     1
#1    2     1
#2    0     2
#3    2     2
#4    3     2
#5    6     2
#6    3     2
#7    0     3

# For each group (having the same flag), I do a boolean AND of two conditions:
# any value above 5  AND value above 1  (which excludes zeros) 
df.groupby('flag').transform(lambda x: (x>5).any() * x > 1)

#Out[32]: 
#     val
#0  False
#1  False
#2  False
#3   True
#4   True
#5   True
#6   True
#7  False

如果您想知道,可以将所有内容折叠在一行中:

^{pr2}$

我还是留下来参考我的第一个方法:

import itertools
import pandas as pd

def flatten(l):
    # Util function to flatten a list of lists
    # e.g. [[1], [2,3]] -> [1,2,3]
    return list(itertools.chain(*l))

ts = pd.Series(data = [0,2,0,2,3,6,3,0])
#Get data as list
values = ts.values.tolist()

# From what I understand the 0s delimit subsequences (so numbers are not
# connected if separated by a 0

# Get location of zeros
gap_loc = [idx for (idx, el) in enumerate(values) if el==0]  
# Re-create pandas series
gap_series = pd.Series(False, index = gap_loc)

# Get values and locations of the subsequences (i.e. seperated by zeros)
valid_loc = [range(prev_gap+1,gap) for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])]
list_seq = [values[prev_gap+1:gap] for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])]
# list_seq = [[2], [2, 3, 6, 3]]

# Verify your condition
check_condition = [[el>1 and any(map(lambda x: x>5, sublist)) for el in sublist] 
                     for sublist in list_seq]
# Put results back into a pandas Series
valid_series = pd.Series(flatten(check_condition), index = flatten(valid_loc))

# Put everything together:
result = pd.concat([gap_series, valid_series], axis = 0).sort_index()

#result
#Out[101]: 
#0    False
#1    False
#2    False
#3     True
#4     True
#5     True
#6     True
#7    False
#dtype: bool

我自己解决了一个丑陋的方式,见下文。不过,我还是想知道有没有更好的方法来做这件事。在

test_series = pd.Series(data = [0,2,0,2,3,6,3,0])

bool_df = pd.DataFrame(data= [(test_series>1), (test_series>5)]).T 
bool_df.loc[:,0] = (bool_df.loc[:,0])&(~bool_df.loc[:,1])
# make a boolean DataFrame.
# Column 0 is values between 1 and 5, and column 1 is values above 5.
# the resulting boolean series we are looking for is column 1 after it has been modified in the following way.



k=0 # k is an integer that indexes the bool_df values that are less than 1
while k < len(bool_df.loc[bool_df.loc[:,0],0]):
    i = bool_df.loc[bool_df.loc[:,0],0].index[k] # the bool_df index corresponding to k
    if i > 0: # avoid negative indeces
        if bool_df.loc[i-1,1]: # Check if the previous entry had a value above 5
            bool_df.loc[i,1] = True
            k+=1
        else: 
            j=i
            while bool_df.loc[j,0]: # find the end of the streak of 1<values<5.
                j+=1
            bool_df.loc[i:j,1] = bool_df.loc[j,1] # set the whole streak to the value found at the end, either >5 or <1
            k = sum(bool_df.loc[bool_df.loc[:,0],0].index<j) 
    else:
        k+=1

相关问题 更多 >