快速排序对大数排序更快吗？

19 投票

3 回答

4497 浏览

数据工程师

提问于 2025-04-16 11:36

我在玩Python的时候，想练习一下我的排序算法，发现了一些有趣的事情。

我有三组不同的数据：

x = 要排序的数字数量
y = 这些数字的范围（都是随机生成的整数）
z = 排序所花费的总时间

当：
x = 100000 并且
y = (0,100000) 时
z = 0.94182094911 秒

当：
x = 100000 并且
y = (0,100) 时
z = 12.4218382537 秒

当：
x = 100000 并且
y = (0,10) 时
z = 110.267447809 秒

有什么想法吗？

代码：

import time
import random
import sys

#-----Function definitions

def quickSort(array): #random pivot location quicksort. uses extra memory.
    smaller = []
    greater = []
    if len(array) <= 1:
        return array
    pivotVal = array[random.randint(0, len(array)-1)]
    array.remove(pivotVal)
    for items in array:
        if items <= pivotVal:
            smaller.append(items)
        else:
            greater.append(items)
    return concat(quickSort(smaller), pivotVal, quickSort(greater))

def concat(before, pivot, after):
    new = []
    for items in before:
        new.append(items)
    new.append(pivot)
    for things in after:
        new.append(things)
    return new

#-----Variable definitions
list = []
iter = 0
sys.setrecursionlimit(20000)
start = time.clock() #start the clock

#-----Generate the list of numbers to sort
while(iter < 100000):
    list.append(random.randint(0,10))  #modify this to change sorting speed
    iter = iter + 1
timetogenerate = time.clock() - start #current timer - last timer snapshot

#-----Sort the list of numbers
list = quickSort(list)
timetosort = time.clock() - timetogenerate #current timer - last timer snapshot

#-----Write the list of numbers
file = open("C:\output.txt", 'w')
for items in list:
    file.write(str(items))
    file.write("\n")
file.close()
timetowrite = time.clock() - timetosort #current timer - last timer snapshot

#-----Print info
print "time to start: " + str(start)
print "time to generate: " + str(timetogenerate)
print "time to sort: " + str(timetosort)
print "time to write: " + str(timetowrite)
totaltime = timetogenerate + timetosort + start
print "total time: " + str(totaltime)

-------------------修订后的新代码----------------------------

def quickSort(array): #random pivot location quicksort. uses extra memory.
    smaller = []
    greater = []
    equal = []
    if len(array) <= 1:
        return array
    pivotVal = array[random.randint(0, len(array)-1)]
    array.remove(pivotVal)
    equal.append(pivotVal)
    for items in array:
        if items < pivotVal:
            smaller.append(items)
        elif items > pivotVal:
            greater.append(items)
        else:
            equal.append(items)
    return concat(quickSort(smaller), equal, quickSort(greater))

def concat(before, equal, after):
    new = []
    for items in before:
        new.append(items)
    for items in equal:
        new.append(items)
    for items in after:
        new.append(items)
    return new

数据结构时间复杂度排序算法算法优化性能测试随机数生成快速排序大数排序

3 个回答

快速排序算法有一个大家都知道的缺点——当数据大部分已经排好序的时候，它的速度会变得比较慢。比如说，如果你有10万个数字，范围在0到10之间，这些数字会比0到100000之间的10万个数字更接近“基本排好序”的状态。

回答于 2025-04-16 由 Python大师

分享举报

我们知道的事情：

对于无序数组，快速排序的时间复杂度是 O(n*logn)。
如果数组已经排好序了，时间复杂度就会降到 O(n^2)。
前两条并不是完全独立的，也就是说，数组越接近排序好，快速排序的时间复杂度就越接近 O(n^2)；反之，如果我们把数组打乱，复杂度就会接近 O(n*logn)。

现在，让我们看看你的实验：

在所有三个实验中，你使用的元素数量是一样的。所以，我们的 n，你叫它 x，始终是100000。
在你的第一个实验中，你使用了0到100000之间的数字，因此理想情况下，如果有一个完美的随机数生成器，你会得到一个相对无序的列表，里面大多数数字都是不同的，这样就符合 O(n*logn) 的复杂度情况。
在你的第三个实验中，你使用了0到10之间的数字，在一个包含100000个元素的列表中。这意味着你的列表中有很多重复的数字，使得它看起来比第一个实验的列表更接近一个已排序的列表。所以，在这种情况下，时间复杂度就更接近 O(n^2)。

而且在同样足够大的 n 下，你可以说 n*logn > n^2，这实际上也是你通过实验确认的。

回答于 2025-04-16 由 Python大师

分享举报

我觉得这个问题和选择“基准值”有关。根据你如何进行分区步骤，如果你的数据中有很多重复的值，你的算法在处理这些重复值时可能会变得很慢，甚至达到平方级别的性能。举个例子，假设你要对这组数据进行快速排序：

 [0 0 0 0 0 0 0 0 0 0 0 0 0]

如果你在分区步骤中不小心，情况可能会迅速变糟。比如说，你选择第一个0作为基准值，这样就会得到一个数组：

 [0 0 0 0 0 0 0 0 0 0 0 0]

接下来你的算法可能会认为较小的值是这个数组：

 [0 0 0 0 0 0 0 0 0 0 0 0]

而较大的值是这个数组：

[]

这种情况会导致快速排序的性能变成O(n²)，因为每次递归调用只会把输入的大小减少一个（也就是把基准值拿掉）。

我注意到在你的代码中，分区步骤确实是这样做的：

for items in array:
    if items <= pivotVal:
        smaller.append(items)
    else:
        greater.append(items)

如果输入是一堆相同的元素，这样的分区会把所有元素放到一个数组中进行递归排序。

当然，这看起来像是个荒谬的情况——这和减少数组中的值有什么关系呢？——但实际上，当你在排序很多不相同的元素时，这种情况确实会出现。特别是，经过几次分区后，你很可能会把所有相同的元素分到一起，这样就会进入这种情况。

关于如何防止这种情况发生，有一个非常棒的讲座，由Bob Sedgewick和Jon Bentley主讲，讲述了如何修改分区步骤，以便在有重复元素的情况下快速工作。这个问题和Dijkstra的荷兰国旗问题有关，他们的解决方案非常巧妙。

一个有效的方案是把输入分成三组——小于、等于和大于基准值。一旦你这样分好输入，你只需要对小于和大于的组进行排序；等于的组已经是排好序的了。上面的链接讲座展示了如何做到这一点，基本上是就地操作，但因为你已经在使用一种不就地的快速排序，修正应该很简单。以下是我对此的尝试：

for items in array:
    if items < pivotVal:
        smaller.append(items)
    elif items == pivotVal:
        equal.append(items)
    else:
        greater.append(items)

顺便说一下，我一辈子都没写过Python代码，所以这可能是完全不合法的语法。但我希望这个想法是清楚的！:-)

回答于 2025-04-16 由 Python大师

分享举报

快速排序对大数排序更快吗？

3 个回答

撰写回答