如何在分级类别结构中按值对Pandas中的数据帧进行排序

2024-04-20 11:42:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个熊猫的数据框

pd.DataFrame({
    "category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food", "Living : Something", "Living : Anitsomething"],
    "amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100, 1000, -1000]
})

类别和子类别由冒号分隔

我试图按数量(绝对值)的降序对这个数据帧进行排序。同时尊重等级分组。即,排序后的结果应如下所示

Transport                           5000
Transport : Car                     4900
Transport : Train                   100
Household                           1600
Household : Utilities               600
Household : Utilities : Water       400
Household : Utilities : Electric    200
Household : Rent                    400
Living                              250
Living : Something                  1000
Living : Antisomething              -1000
Living : Other                      150
Living : Food                       100

我可以以一种非常低效的方式递归地做这件事。超慢,但它的工作

def sort_hierachical(self, full_df, name_column, sort_column, parent="", level=0):
    result_df = pd.DataFrame(columns=full_df.columns)
    part_df = full_df.loc[(full_df[name_column].str.count(':') == level) & (full_df[name_column].str.startswith(parent)), :]
    part_df['abs'] = part_df[sort_column].abs()
    part_df = part_df.sort_values('abs', ascending=False)
    for _, row in part_df.iterrows():
        category = row[name_column]
        row_df = pd.DataFrame(columns = full_df.columns).append(row)
        child_rows = self.sort_hierachical(full_df, name_column, sort_column, category, level+1)
        if not child_rows.empty:
            result_df = pd.concat([result_df, row_df], sort=False)
            result_df = pd.concat([result_df, child_rows], sort=False)
        else:
            result_df = pd.concat([result_df, row_df], sort=False)
    return result_df

df = self.sort_hierachical(df, "category", "amount")

我的问题是:在熊猫身上有没有一种很好的表演方式来做这样的事情。某种按排序分组或多索引技巧???

好业力将降临到那些能够解决这一挑战性问题的人身上:)

编辑:

这几乎奏效了。。。但是-1000,1000打乱了排序顺序

def _sort_tree_df(self, df, tree_column, sort_column):
    sort_key = sort_column + '_abs'
    df[sort_key] = df[sort_column].abs()
    df.index = pd.MultiIndex.from_frame(df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
    sort_columns = [df[tree_column].values]
    sort_columns.append(df[sort_key].values)
    for x in range(df.index.nlevels, 0, -1):
        group_lvl = list(range(0, x))
        sort_columns.append(df.groupby(level=group_lvl)[sort_key].transform('max').values)
    sort_indexes = np.lexsort(sort_columns)
    df_sorted = df.iloc[sort_indexes[::-1]]
    df_sorted.reset_index(drop=True, inplace=True)
    df_sorted = df_sorted.drop(sort_key, axis=1)
    return df_sorted

Edit2:

好的,我想我已经成功了。我仍然很困惑lexsort是如何工作的。我通过受过教育的反复试验完成了这项工作。如果您理解,请随时解释。也可以随意发布一个更好的方法

def _sort_tree_df(self, df, tree_column, sort_column, delimeter=':'):
    df.index = pd.MultiIndex.from_frame(df[tree_column].str.split(delimeter).apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
    sort_columns = [df[tree_column].values]
    sort_columns.append(df[sort_column].abs().values)
    for x in range(df.index.nlevels, 0, -1):
        group_lvl = list(range(0, x))
        sort_columns.append(df.groupby(level=group_lvl)[sort_column].transform('sum').abs().values)
    sort_indexes = np.lexsort(sort_columns)
    df_sorted = df.iloc[sort_indexes[::-1]]
    df_sorted.reset_index(drop=True, inplace=True)
    return df_sorted

Edit3: 实际上,这并不总是正确排序:(

Edit4 问题是我需要一种方法使th转换(“sum”)仅适用于level=x-1的项

例如:

df['level'] = df[tree_column].str.count(':')

sorting_by = df.groupby(level=group_lvl)[sort_column].transform('sum' if 'level' = x-1).abs().values

sorting_by = df.groupby(level=group_lvl).loc['level' = x-1: sort_column].transform('sum').abs().values

两者都无效

有人知道如何在多索引df上进行这样的条件转换吗


Tags: columnstreedfcolumnabsresultsortlevel
2条回答

我不确定我是否完全理解了这个问题,但我认为您应该将列拆分为子类别,然后根据您想要的层次结构进行值排序。类似下面的内容可能会起作用

使用以下命令创建新列:

for _, row in df.iterrows():
    for item, col in zip(row.category.split(':'), ['cat', 'sub_cat', 'sub_sub_cat']):
        df.loc[_, col] = item

然后把它们分类

df.sort_values(['cat', 'sub_cat', 'sub_sub_cat', 'amount'])

category    amount  cat     sub_cat     sub_sub_cat
3   Household   1100    Household   NaN     NaN
7   Household : Cleaning    100     Household   Cleaning    NaN
8   Household : Cleaning : Bathroom     75  Household   Cleaning    Bathroom
9   Household : Cleaning : Kitchen  25  Household   Cleaning    Kitchen
10  Household : Rent    400     Household   Rent    NaN
4   Household : Utilities   600     Household   Utilities   NaN
6   Household : Utilities : Electric    200     Household   Utilities   Electric
5   Household : Utilities : Water   400     Household   Utilities   Water
11  Living  250     Living  NaN     NaN
15  Living : Anitsomething  -1000   Living  Anitsomething   NaN
13  Living : Food   100     Living  Food    NaN
12  Living : Other  150     Living  Other   NaN
14  Living : Something  1000    Living  Something   NaN
0   Transport   5000    Transport   NaN     NaN
1   Transport : Car     4900    Transport   Car     NaN
2   Transport : Train   100     Transport   Train   Na

好吧,花了一段时间才冷静下来,但现在我很确定这是可行的。也比递归方法快得多

def _sort_tree_df(self, df, tree_column, sort_column, delimeter=':'):
    df=df.copy()
    parts = df[tree_column].str.split(delimeter).apply(lambda x: [y.strip() for y in x]).apply(pd.Series)
    for i, column in enumerate(parts.columns):
        df[column] = parts[column]
    sort_columns = [df[tree_column].values]
    sort_columns.append(df[sort_column].abs().values)
    df['level'] = df[tree_column].str.count(':')
    for x in range(len(parts.columns), 0, -1):
        group_columns = list(range(0, x))
        sorting_by = df.copy()
        sorting_by.loc[sorting_by['level'] != x-1, sort_column] = np.nan
        sorting_by = sorting_by.groupby(group_columns)[sort_column].transform('sum').abs().values
        sort_columns.append(sorting_by)
    sort_indexes = np.lexsort(sort_columns)
    df_sorted = df.iloc[sort_indexes[::-1]]
    df_sorted.reset_index(drop=True, inplace=True)
    df.drop([column for column in parts.columns], inplace=True, axis=1)
    df.drop('level', inplace=True, axis=1)
    return df_sorted

相关问题 更多 >