将数据分为三个类别的最佳方法

0 投票
1 回答
1785 浏览
提问于 2025-04-18 05:50

我有一个numpy数组,内容是:

[['6.5' '3.2' '5.1' '2.0' 'Iris-virginica'] 
['6.1' '2.8' '4.0' '1.3' 'Iris-versicolor'] 
['4.6' '3.2' '1.4' '0.2' 'Iris-setosa']
['6.0' '2.2' '4.0' '1.0' 'Iris-versicolor']
['4.7' '3.2' '1.3' '0.2' 'Iris-setosa']
['6.7' '3.1' '5.6' '2.4' 'Iris-virginica']]

我想知道最快的方法是怎样把这些数据分成三个不同的numpy数组,分别根据标签'Iris-virginica''Iris-setosa''Iris-versicolor',这样:

Iris-virginica数组只包含 [['6.5' '3.2' '5.1' '2.0']['6.7' '3.1' '5.6' '2.4']]

Iris-setosa数组只包含 [['4.6' '3.2' '1.4' '0.2'] ['4.7' '3.2' '1.3' '0.2']]

Iris-versicolor数组只包含 [['6.1' '2.8' '4.0' '1.3']['6.0' '2.2' '4.0' '1.0']]

1 个回答

1

使用 numpy 和列表 comprehension

import numpy as np

data = [['6.5', '3.2', '5.1', '2.0', 'Iris-virginica'],
['6.1', '2.8', '4.0', '1.3', 'Iris-versicolor'] ,
['4.6', '3.2', '1.4', '0.2', 'Iris-setosa'],
['6.0', '2.2', '4.0', '1.0', 'Iris-versicolor'],
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa'],
['6.7', '3.1', '5.6', '2.4', 'Iris-virginica']]

filtered = [map(float, item[:4]) for item in data if item[4] == 'Iris-virginica']
print 'mean', np.mean(filtered, axis=0)
print 'var ', np.var(filtered, axis=0)

其中 item[4] == 'Iris-virginica' 是用来筛选你想要的数据,而 map(float, item[:3]) 则是把字符串转换成浮点数,接着 np.mean(..., axis=0) 用来计算筛选后数据的平均值。

输出结果是

mean [ 6.6   3.15  5.35]
var  [ 0.01    0.0025  0.0625]

更新

这里是仅使用 numpy 的版本,不过这个似乎比上面的慢。

data = np.array(data)
filtered = data[data[:, 4] == 'Iris-virginica'][:, :3].astype(np.float)
print 'mean', np.mean(filtered, axis=0)
print 'var ', np.var(filtered, axis=0)

timeit 的结果是

In [5]: %timeit filtered = [map(float, item[:4]) for item in data if item[4] == 'Iris-virginica']
100000 loops, best of 3: 1.93 µs per loop

In [6]: data = np.array(data)

In [7]: timeit data[data[:, 4] == 'Iris-virginica'][:, :4].astype(np.float)
100000 loops, best of 3: 15.5 µs per loop

撰写回答