实现类SQL的分组与聚合函数的算法?
假设你有一个这样的数组:
[
{'id' : 1, 'closed' : 1 },
{'id' : 2, 'closed' : 1 },
{'id' : 5, 'closed' : 1 },
{'id' : 7, 'closed' : 0 },
{'id' : 8, 'closed' : 0 },
{'id' : 9, 'closed' : 1 }
]
我想对这个数据集进行总结(不使用SQL!),并获取每个由行 'closed'
的变化定义的组的 min
和 max
id。最终的输出应该像这样:
[
{'id__min' : 1, 'id__max' : 5, 'closed' : 1},
{'id__min' : 7, 'id__max' : 8, 'closed' : 0},
{'id__min' : 9, 'id__max' : 9, 'closed' : 1}
]
这只是我想做的一个例子。 我想实现一些类似于Python的 itertools.groupby
提供的功能,但希望能更全面一点。(我想定义我自己的聚合函数)。
我在寻找一些提示、伪代码,甚至是PHP、Python或JavaScript的代码,如果可能的话。
谢谢!
4 个回答
0
也许我对这个问题理解有误,但这不就是一个标准的map/reduce问题吗?
1
Ruby代码:
def summarise array_of_hashes
#first sort the list by id
arr = array_of_hashes.sort {|a, b| a['id'] <=> b['id'] }
#create a hash with id_min and id_max set to the id of the first
#array element and closed to the closed of the first array element
hash = {}
hash['id_min'] = hash['id_max'] = arr[0]['id']
hash['closed'] = arr[0]['closed']
#prepare an output array
output = []
#iterate over the array elements
arr.each do |el|
if el['closed'] == hash['closed']
#update id_max while the id value is the same
hash['id_max'] = el['id']
else #once it is different
output.push hash #add the hash to the output array
hash = {} #create a new hash in place of the old one
#and initiate its keys to the appropriate values
hash['id_min'] = hash['id_max'] = el['id']
hash['closed'] = el['closed']
end
end
output.push hash #make sure the final hash is added to the output array
#return the output array
output
end
通用版本:
def summarise data, condition, group_func
#store the first hash in a variable to compare t
pivot = data[0]
to_group = []
output = []
#iterate through array
data.each do |datum|
#if the comparison of this datum to the pivot datum fits the condition
if condition.call(pivot, datum)
#add this datum to the to_group list
to_group.push datum
else #once the condition no longer matches
#apply the aggregating function to the list to group and add it to the output array
output.push group_func.call(to_group)
#reset the to_group list and add this element to it
to_group = [datum]
#set the pivot to this element
pivot = datum
end
end
#make sure the final list to group are grouped and added to the output list
output.push group_func.call(to_group)
#return the output list
output
end
以下代码将适用于你的例子:
my_condition = lambda do |a, b|
b['closed'] == a['closed']
end
my_group_func = lambda do |to_group|
{
'id_min' => to_group[0]['id'],
'id_max' => to_group[to_group.length-1]['id'],
'closed' => to_group[0]['closed']
}
end
summarise(my_array.sort {|a, b| a['id'] <=> b['id']}, my_condition, my_group_func)
这个通用算法可以在任何支持将函数作为参数传递给其他函数的编程语言中使用。如果使用正确的条件和聚合函数,它也可以处理任何数据类型的变量数组。
2
在itertools.groupby()
这个函数中,key
这个参数让你可以传入自己定义的聚合函数。