在Python中搜索二维数组

1 投票

1 回答

3991 浏览

提问于 2025-04-17 14:23

我想通过Python从一个很大的数据集中（有900万行，1.4GB）提取特定的行，条件是根据两个或更多的参数。

比如，从这个数据集中：

ID1 2   10  2   2   1   2   2   2   2   2   1

ID2 10  12  2   2   2   2   2   2   2   1   2

ID3 2   22  0   1   0   0   0   0   0   1   2

ID4 14  45  0   0   0   0   1   0   0   1   1

ID5 2   8   1   1   1   1   1   1   1   1   2

假设我有以下参数：

第二列的值必须等于2，并且
第三列的值必须在4到15的范围内

我应该得到：

ID1 2   10  2   2   1   2   2   2   2   2   1

ID5 2   8   1   1   1   1   1   1   1   1   2

问题是我不知道如何在Python中高效地对一个二维数组进行这些操作。

这是我尝试过的方法：

line_list = []

# Loading of the whole file in memory
for line in file:
    line_list.append(line)

# set conditions
i = 2
start_range = 4
end_range = 15

# Iteration through the loaded list and split for each column
for index in data_list:
    data = index.strip().split()
    # now test if the current line matches with conditions
    if(data[1] == i and data[2] >= start_range and data[2] <= end_range):
        print str(data)

我想多次执行这个过程，但我现在的方法真的很慢，即使数据文件已经加载到内存中。

我在考虑使用numpy数组，但我不知道如何根据条件提取行。

谢谢你的帮助！

更新：

根据建议，我使用了关系数据库系统。我选择了Sqlite3，因为它使用起来很简单，部署也很快。

我的文件通过sqlite3的导入功能大约花了4分钟加载完。

我在第二列和第三列上建立了索引，以加快检索信息的速度。

查询是通过Python完成的，使用了“sqlite3”模块。

这样快多了！

性能优化数据处理 numpy 数据提取二维数组 sqlite3 条件查询数据库索引

1 个回答

我会选择你几乎已经写好的内容（虽然还没测试过）：

with open('somefile') as fin:
    rows = (line.split() for line in fin)
    take = (row for row in rows if int(row[1] == 2) and 4 <= int(row[2]) <= 15)
    # data = list(take)
    for row in take:
        pass # do something

回答于 2025-04-17 由 Python大师

分享举报

在Python中搜索二维数组

更新：

1 个回答

撰写回答