在Python中检查dataframe行的部分是否相同

df_in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19... 1 3 4 6 0 2 0 3 0 2 0 3 4 5 6 2 4 5 6 2... . .

for index in range(4, len(df_in.columns)): if bool((df_in.iloc[:, index] == (df_in.iloc[:, index+4]).all()) == True: remove either df_in.iloc[:, index] or df_in.iloc[:, index]+4 and keep one if bool((df_in.iloc[:, index] == (df_in.iloc[:, index+4]).all()) == False: keep df_in.iloc[:, index]

1条回答

网友

1楼 · 发布于 2024-04-29 16:17:29

这看起来是个疯狂的解决方案。主要思想是使用python的hash函数检查重复：

# original data frame
df = pd.DataFrame([1,3,4,6,0,2,0,3,0,2,0,3,4,5,6,2,4,5,6,2]).T

# we will create hash on tuple of every subsequence of length 4
sub4hash = df.iloc[0].rolling(4).apply(lambda s: hash(tuple(s))).shift(-3)

# start of duplication:
dup_start = sub4hash.duplicated()

# and we want all 4, so rolling again:
markers = dup_start.rolling(4).sum().gt(0)

# finally:
df.loc[:, ~markers]

      0    1    2    3    4    5    6    7    12    13    14    15
    -   -   -   -   -   -   -   -                
 0    1    3    4    6    0    2    0    3     4     5     6     2

相关问题更多 >

编程相关推荐

热门问题

热门文章

在Python中检查dataframe行的部分是否相同

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >