使用melt或wide_to_long处理命名不一致的大型数据集

3 投票

2 回答

53 浏览

提问于 2025-04-14 16:28

我有一个很大的数据集，没法用普通的方法去分析，得借助一些分析工具。这个数据集的基本结构是这样的，但有16列“ItemType0”，还有16列“ItemType1”、“ItemType2”等等。

这个数据集记录了最多16种不同物品在同一时间点的各种属性，还有这个时间点的一些属性。

时间	ItemType0[0].属性	ItemType0[1].属性	属性
1	1	0	2
2	0	1	2
3	3	3	2

我想要得到的是：

时间	ItemType0.属性	属性
1	1	2
2	0	2
3	3	2
1	0	2
2	1	2
3	3	2

import pandas as pd

wide_df = pd.DataFrame({
    "Time": [1,2,3],
    "ItemType0[0].property": [1,0,3],
    "ItemType0[1].property": [0,1,3],
    "Property": [2,2,2]})

我尝试过的方法：

使用 Melt 方法：
```
ids = [col for col in wide_df.columns if "[" not in col]
inter_df = pd.melt(wide_df, id_vars=ids, var_name="Source")
```
内存错误：无法为形状为 (15,506831712) 和数据类型为 uint32 的数组分配 28.3 GiB 的内存
我甚至不知道从哪里开始使用 pd.wide_to_long，因为所有的东西都不是以相同的方式开始的。

内存管理数据清洗数据分析数据重塑数据集处理 wide_to_long melt方法属性转换

2 个回答

一种选择是使用 pivot_longer，在这里你可以把多个 .value 传递给 names_to 参数，以便与 names_pattern 参数中的多个组匹配：

# pip install pyjanitor
import janitor
import pandas as pd

(wide_df
.pivot_longer(
    column_names="*[*", 
    names_to=('.value','.value'), 
    names_pattern=r"(.+)\[\d+\](.+)"
    )
)
   Time  Property  ItemType0.property
0     1         2                   1
1     2         2                   0
2     3         2                   3
3     1         2                   0
4     2         2                   1
5     3         2                   3

pivot_longer 可以简化数据重塑的过程 - 另一种选择是仅在 Pandas 中，通过重新排列列，然后使用 stack 来实现：

ids = [col for col in wide_df if '[' not in col]
reshaped = wide_df.set_index(ids)
reshaped.columns = (reshaped
                    .columns
                    .str
                    .split(r'(\[\d+\])', expand=True)
                    .set_names([None,'drop',None])
                   )
reshaped = reshaped.stack(level='drop').droplevel('drop')
reshaped.columns = reshaped.columns.map(lambda x: ''.join(x))
reshaped.reset_index()

   Time  Property  ItemType0.property
0     1         2                   1
1     1         2                   0
2     2         2                   0
3     2         2                   1
4     3         2                   3
5     3         2                   3

当然，使用 melt 的话，你会增加行数，这样会消耗更多的内存。根据你的数据集，如果你仍然遇到内存错误，那么你可能需要考虑其他解决方案，避免内存占用过高。你可能需要分享更多关于你最终目标的信息。

回答于 2025-04-14 由 Python大师

分享举报

如果我理解得没错，你可以试着把属性按照 ItemTypeX 来分组，然后再进行展开：

df.columns = df.columns.str.replace(r"\[\d+\]", "", regex=True)

df = df.set_index(["Time", "Property"])
df = df.T.groupby(df.columns).agg(list).T

print(df.reset_index().explode(df.columns.to_list()))

输出结果是：

   Time  Property ItemType0.property
0     1         2                  1
0     1         2                  0
1     2         2                  0
1     2         2                  1
2     3         2                  3
2     3         2                  3

回答于 2025-04-14 由 Python大师

分享举报

使用melt或wide_to_long处理命名不一致的大型数据集

2 个回答

撰写回答