Python：什么是按重复值对嵌套数组排序的有效方法？

>>> data = [[4.0, 0.0, 15.0, 67.0], [3.0, 0.0, 15.0, 72.0], [4.0, 0.0, 15.0, 70.0], [1.0, -0.0, 15.0, 90.0], [3.0, -0.0, 15.0, 75.0], [2.0, -0.0, 15.0, 83.0], [3.0, 0.0, 15.0, 74.0], [4.0, 0.0, 15.0, 69.0], [4.0, 0.0, 14.0, 61.0], [3.0, 0.0, 15.0, 74.0], [3.0, 0.0, 15.0, 75.0], [4.0, 0.0, 15.0, 67.0], [5.0, 0.0, 14.0, 45.0], [6.0, 0.0, 13.0, 30.0], [3.0, 0.0, 15.0, 74.0], [4.0, 0.0, 15.0, 55.0], [7.0, 0.0, 13.0, 22.0], [6.0, 0.0, 13.0, 25.0], [1.0, -0.0, 15.0, 83.0], [7.0, 0.0, 13.0, 18.0]] >>> sort(data,7) [[], [[1.0, -0.0, 15.0, 90.0], [1.0, -0.0, 15.0, 83.0]], [[2.0, -0.0, 15.0, 83.0]], [[3.0, 0.0, 15.0, 72.0], [3.0, -0.0, 15.0, 75.0], [3.0, 0.0, 15.0, 74.0], [3.0, 0.0, 15.0, 74.0], [3.0, 0.0, 15.0, 75.0], [3.0, 0.0, 15.0, 74.0]], [[4.0, 0.0, 15.0, 67.0], [4.0, 0.0, 15.0, 70.0], [4.0, 0.0, 15.0, 69.0], [4.0, 0.0, 14.0, 61.0], [4.0, 0.0, 15.0, 67.0], [4.0, 0.0, 15.0, 55.0]], [[5.0, 0.0, 14.0, 45.0]], [[6.0, 0.0, 13.0, 30.0], [6.0, 0.0, 13.0, 25.0]]]

2条回答

网友

1楼 · 编辑于 2024-04-25 00:26:15

优化尝试

假设您想根据每个子列表的第一个值将子列表排序为桶。你知道吗

为简单起见，我使用以下方法生成随机数进行测试：

L = 10
data = [[round(random.random() * 10.0, 2) for _ in range(3)] for _ in range(10)]

首先是关于你的代码，只是为了确保我正确理解你的意图。你知道吗

c = []
d = []
for i in range(L): # Loop over all buckets
    for e in data: # Loop over entire data
        if int(e[0]) == i: # If first float of sublist falls into i-th bucket
            d.append(e) # Append entire sublist to current bucket
    c.append(d) # Append current bucket to list of buckets
    d = [] # Reset

这是低效的，因为您循环了每个bucket的完整数据集。如你所说，如果你有8000桶和2 000 000浮点数列表，你将基本上执行16 000 000 000（160亿）比较。此外，在创建时完全填充bucket列表，而不是重用data变量中的现有列表。所以这会产生尽可能多的数据引用副本。你知道吗

因此，您应该考虑使用数据的索引，例如

bidx = [int(e[0]) for e in data] # Calculate bucket indices for all sublists
buck = []
for i in range(L): # Loop over all buckets
    lidx = [k for k, b in enumerate(bidx) if b == i] # Get sublist indices for this bucket
    buck.append([data[l] for l in lidx]) # Collect list references
print(buck)

这将导致对您的数据进行一次迭代，从而计算适当的bucket索引。然后，只对所有bucket执行一秒钟的迭代，其中从bidx收集相应的bucket索引（您有来拥有这个双循环，但是这可能要快一点），结果lidx保持data中落入当前bucket的子列表的位置。最后，收集bucket列表中的列表引用并存储它。你知道吗

不过，最后一步可能代价高昂，因为它包含大量的引用复制。您应该考虑只存储每个bucket中的索引，而不存储整个数据

lidx = ...
buck.append(lidx)

但是，仅在具有大数据的代码中优化性能是有限制的

如果你的数据很大，所有的线性迭代都会很昂贵。您可以尽量减少它们，但是数据大小本身定义了一个较低的上限！你知道吗

如果必须对数百万条记录执行更多操作，则应考虑更改为另一种数据表示形式或格式。例如，如果您需要在一个脚本中执行频繁的操作，您可能需要考虑树（例如b-trees）。如果要存储它以供进一步处理，可能需要考虑一个具有适当索引的数据库。你知道吗

网友
2楼 · 编辑于 2024-04-25 00:26:15

在Python3中运行时，使用此算法，我获得了比jbndlr高2个数量级的性能：
rl = range(L) # Generate the range list buck = [[] for _ in rl] # Create all the buckets for seq in data: # Loop over entire data try: idx = rl.index(int(seq[0])) # Find the bucket index buck[idx].append(seq) # Append current data in its bucket except ValueError: pass # There is no bucket for that value
将算法与：
L = 1000 data = [[round(random.random() * 1200.0, 2) for _ in range(3)] for _ in range(100000)]
我得到：
yours: 26.66 sec jbndlr: 6.78 sec mine: 0.07 sec

优化尝试

但是，仅在具有大数据的代码中优化性能是有限制的

相关问题更多 >

编程相关推荐

热门问题

热门文章