实现类SQL的分组与聚合函数的算法?

1 投票
4 回答
1861 浏览
提问于 2025-04-16 08:15

假设你有一个这样的数组:

[
  {'id' : 1, 'closed' : 1 },
  {'id' : 2, 'closed' : 1 },
  {'id' : 5, 'closed' : 1 },
  {'id' : 7, 'closed' : 0 },
  {'id' : 8, 'closed' : 0 },
  {'id' : 9, 'closed' : 1 }
]

我想对这个数据集进行总结(不使用SQL!),并获取每个由行 'closed' 的变化定义的组的 minmax id。最终的输出应该像这样:

[
  {'id__min' : 1, 'id__max' : 5, 'closed' : 1},
  {'id__min' : 7, 'id__max' : 8, 'closed' : 0},
  {'id__min' : 9, 'id__max' : 9, 'closed' : 1}
]

这只是我想做的一个例子。 我想实现一些类似于Python的 itertools.groupby 提供的功能,但希望能更全面一点。(我想定义我自己的聚合函数)。

我在寻找一些提示、伪代码,甚至是PHP、Python或JavaScript的代码,如果可能的话。

谢谢!

4 个回答

0

也许我对这个问题理解有误,但这不就是一个标准的map/reduce问题吗?

1

Ruby代码:

def summarise array_of_hashes
    #first sort the list by id
    arr = array_of_hashes.sort {|a, b| a['id'] <=> b['id'] }
    #create a hash with id_min and id_max set to the id of the first
    #array element and closed to the closed of the first array element
    hash = {}
    hash['id_min'] = hash['id_max'] = arr[0]['id']
    hash['closed'] = arr[0]['closed']
    #prepare an output array
    output = []
    #iterate over the array elements
    arr.each do |el|
        if el['closed'] == hash['closed']
            #update id_max while the id value is the same
            hash['id_max'] = el['id']
        else #once it is different
            output.push hash #add the hash to the output array
            hash = {} #create a new hash in place of the old one
            #and initiate its keys to the appropriate values
            hash['id_min'] = hash['id_max'] = el['id']
            hash['closed'] = el['closed']
        end
    end
    output.push hash #make sure the final hash is added to the output array
    #return the output array
    output
end

通用版本:

def summarise data, condition, group_func
    #store the first hash in a variable to compare t
    pivot = data[0]
    to_group = []
    output = []
    #iterate through array
    data.each do |datum|
        #if the comparison of this datum to the pivot datum fits the condition
        if condition.call(pivot, datum)
            #add this datum to the to_group list
            to_group.push datum
        else #once the condition no longer matches
            #apply the aggregating function to the list to group and add it to the output array
            output.push group_func.call(to_group)
            #reset the to_group list and add this element to it
            to_group = [datum]
            #set the pivot to this element
            pivot = datum
        end
    end
    #make sure the final list to group are grouped and added to the output list
    output.push group_func.call(to_group)
    #return the output list
    output
end

以下代码将适用于你的例子:

my_condition = lambda do |a, b|
    b['closed'] == a['closed']
end

my_group_func = lambda do |to_group|
    {
        'id_min' => to_group[0]['id'],
        'id_max' => to_group[to_group.length-1]['id'],
        'closed' => to_group[0]['closed']
    }
end

summarise(my_array.sort {|a, b| a['id'] <=> b['id']}, my_condition, my_group_func)

这个通用算法可以在任何支持将函数作为参数传递给其他函数的编程语言中使用。如果使用正确的条件和聚合函数,它也可以处理任何数据类型的变量数组。

2

itertools.groupby()这个函数中,key这个参数让你可以传入自己定义的聚合函数。

撰写回答