如何使用包含DataFrame索引的数组的pandas系列对DataFrame进行操作

0 投票

1 回答

52 浏览

提问于 2025-04-14 16:26

这是来自于1990年加州住房数据集，这个数据集在Geron的《动手学机器学习》中使用。虽然这可能提供了一些背景信息，但这里主要是关于pandas和numpy的问题。我有一个解决方案，但我在想是否有更好的方法，因为我觉得我的方法不够优雅。

这里涉及到三组数据，每组都有16,512行。第一组是每个房子所在地区的经纬度：

lat_longs = housing.iloc[:, :2]
lat_longs.head()

索引	经度	纬度
13096	-122.42	37.8
14973	-118.38	34.14
3785	-121.98	38.36
14689	-117.11	33.75
20507	-118.15	33.77

第二组是与这些房子相关的价格：

housing_labels.head()

索引	房屋中位价
13096	458300.0
14973	483800.0
3785	101700.0
14689	96100.0
20507	361800.0

第三组是一个数组，大小为16,512 x 5。每一行包含与该行所提到的房子最近的5个房子的索引。（我把它放在一个数据框中，以便在markdown中更容易显示，但它实际上是一个numpy数组。）

idx[:5]

索引	0	1	2	3	4
0	3059	1266	8382	8461	1138
1	5608	1	11080	9372	13446
2	5394	2	3101	14696	2497
3	3	14935	13839	11401	14826
4	5016	5510	4	708	11889

我的目标是获取这五个最近房子的中位房价。我的解决方案如下：

pd.Series(list(idx)).apply(lambda x: np.median(housing_labels.iloc[x]))

索引	0
0	500001.0
1	386700.0
2	111500.0
3	96100.0
4	306300.0

（上面lambda中的x是该行的所有5个索引。）正如我所说，这个方法是有效的，但我在想是否有更好、更快（我觉得apply比较慢）和/或更优雅的解决方案？

这种有一系列数组的模式，每个数组的索引与该行特定条件相关，似乎在数据科学中是一个常见的模式，我希望能找到更好的解决方案。有什么想法吗？

-Joe

有人要求提供一个可复现的例子。当我尝试下面建议的解决方案时，这些小数据框和数组提供了正确的答案。我试了这个，结果是有效的。对于housing_labels：

索引	房屋中位价
13096	458300.0
14973	483800.0
3785	101700.0
14689	96100.0
20507	361800.0
1286	92600.0
18078	349300.0
4396	440900.0
18031	160100.0
6753	183900.0

对于idx：

idx = np.array([ [13096, 20507, 4396],
          [6753, 3785, 14973],
           [14689, 18078, 18031],
           [14973, 20507, 1286]])

正确的输出：

[440900. 183900. 160100. 361800.]

numpy pandas 数据框数组索引数据科学中位数房价预测数据操作

1 个回答

如果需要的话，可以在 housing_labels['median_house_value'] 中添加所有缺失的索引，从 0 开始，一直到最大的索引，并且先把这些缺失的索引的值设置为 0。然后再使用高级的 numpy 索引 b[idx] 和 np.median 来处理数据：

maximal = housing_labels.index.max()
b = housing_labels['median_house_value'].reindex(range(maximal+1), fill_value=0).to_numpy()

out = np.median(b[idx], axis=1)
print (out)

用小数据进行测试 - 比如索引小于 7 的情况：

#if necessary select column median_house_value
#housing_labels = housing_labels['median_house_value']
print (housing_labels)
index
2    45.0
1     4.0
0    10.0
3     9.0
6     3.0
5    10.0
Name: median_house_value, dtype: float64

#create ordered array with indices - here range(7), because maximal index is 6
maximal = housing_labels.index.max()
b = housing_labels.reindex(range(maximal+1), fill_value=0).to_numpy()
print (b)
[10.  4. 45.  9.  0. 10.  3.]

#idx for match indices
print (idx)
[[1 0 3 2 5]
 [1 0 3 6 5]]

#test your solution
out = pd.Series(list(idx)).apply(lambda x: np.median(housing_labels.loc[x]))
print (out)
0    10.0
1     9.0
dtype: float64

#test numpy solution
out = np.median(b[idx], axis=1)
print (out)
0    10.0
1     9.0
dtype: float64

#test matching by indices
print (b[idx])
[[ 4. 10.  9. 45. 10.]
 [ 4. 10.  9.  3. 10.]]

回答于 2025-04-14 由 Python大师

分享举报

如何使用包含DataFrame索引的数组的pandas系列对DataFrame进行操作

1 个回答

撰写回答