python：高效处理灵活深度嵌套数据的技巧有哪些？

Question

我的问题不是关于某段具体的代码，而是更一般性的，所以请耐心听我说：

我应该如何组织我正在分析的数据，应该使用哪些工具来管理这些数据呢？

我正在使用Python和Numpy来分析数据。因为Python的文档提到字典在Python中非常高效，而且由于我的数据本身结构很清晰，所以我把它存储在一个层层嵌套的字典里。

这里是字典的一个框架：在层级中的位置决定了元素的性质，每一行新内容代表前一层中一个键的内容：

[AS091209M02] [AS091209M01] [AS090901M06] ... 
[100113] [100211] [100128] [100121] 
[R16] [R17] [R03] [R15] [R05] [R04] [R07] ... 
[1263399103] ... 
[ImageSize] [FilePath] [Trials] [Depth] [Frames] [Responses] ... 
[N01] [N04] ... 
[Sequential] [Randomized] 
[Ch1] [Ch2]

补充说明一下我的数据集：

[individual] ex: [AS091209M02]
[imaging session (date string)] ex: [100113]
[Region imaged] ex: [R16]
[timestamp of file] ex [1263399103]  
[properties of file] ex: [Responses]
[regions of interest in image ] ex [N01]
[format of data] ex [Sequential]
[channel of acquisition: this key indexes an array of values] ex [Ch1]

我进行的操作类型，比如计算数组的属性（在Ch1、Ch2下列出），选择数组来创建一个新集合，比如分析某个个体在不同时间点来自第16区域（R16）的N01的反应等等。

这个结构对我来说效果很好，而且速度很快，正如所承诺的那样。我可以很快分析完整的数据集（而且字典的大小远远不够填满我电脑的内存：只有半个GB）。

我的问题在于，我需要编写字典操作的方式太繁琐了。我经常有一段段代码像这样：

for mk in dic.keys():
    for rgk in dic[mk].keys():
        for nk in dic[mk][rgk].keys():
            for ik in dic[mk][rgk][nk].keys():
                for ek in dic[mk][rgk][nk][ik].keys():
                    #do something

这段代码看起来很丑，繁琐，不可重用，而且不够稳定（每次字典有变化都需要重新编写）。

我尝试使用递归函数，但除了最简单的应用外，我遇到了一些非常棘手的错误和奇怪的行为，浪费了很多时间（在处理深层嵌套的递归函数时，我在ipython中用pdb调试也不太顺利）。最终，我唯一经常使用的递归函数是：

def dicExplorer(dic, depth = -1, stp = 0):
    '''prints the hierarchy of a dictionary.
    if depth not specified, will explore all the dictionary
    '''
    if depth - stp == 0: return
    try : list_keys = dic.keys()
    except AttributeError: return
    stp += 1
    for key in list_keys:
        else: print '+%s> [\'%s\']' %(stp * '---', key)
        dicExplorer(dic[key], depth, stp)

我知道这样做不对，因为我的代码又长又复杂，不能重复使用。我需要使用更好的技术来灵活操作字典，或者把数据放到某种数据库格式中（比如sqlite？）。我的问题是，由于我在编程方面是（不太好）自学的，所以缺乏实践经验和背景知识，无法充分理解可用的选项。我愿意学习新工具（SQL，面向对象编程），无论如何都想完成工作，但我不想把时间和精力投入到对我需求没有帮助的事情上。

那么，你有什么建议来解决这个问题，让我能够以更简洁、灵活和可重用的方式编写我的工具呢？

附加说明：除了对数据字典的某个特定子字典进行操作外，这里有一些我为数据集dic或其子字典实现的操作示例：

实际上，我有一些递归函数效果很好：

def normalizeSeqDic(dic, norm_dic = {}, legend = ()):
    '''returns a normalized dictionary from a seq_amp_dic. Normalization is performed using the first time point as reference
    '''
    try : 
        list_keys = dic.keys()
        for key in list_keys:
            next_legend = legend + (key,) 
            normalizeSeqDic(dic[key], norm_dic, next_legend)
    except AttributeError:
        # normalization
        # unpack list
        mk, ek, nk, tpk = legend
        #assign values to amplitude dict
        if mk not in norm_dic: norm_dic[mk] = {}
        if ek not in norm_dic[mk]: norm_dic[mk][ek] = {}
        if nk not in norm_dic[mk][ek]: norm_dic[mk][ek][nk] = {}
        if tpk not in norm_dic[mk][ek][nk]: norm_dic[mk][ek][nk][tpk] = {}
        new_array = []
        for x in range(dic.shape[0]):
            new_array.append(dic[x][1:]/dic[x][0])
        new_array = asarray(new_array)
        norm_dic[mk][ek][nk][tpk] = new_array
    return norm_dic

def poolDic(dic):
    '''returns a dic in which all the values are pooled, and root (mk) keys are fused
    these pooled dics can later be combined into another dic
    '''
    pooled_dic = {}
    for mk in dic.keys():
        for ek in dic[mk].keys():
            for nk in dic[mk][ek].keys():
                for tpk in dic[mk][ek][nk].keys():
                    #assign values to amplitude dict
                    if ek not in pooled_dic: pooled_dic[ek] = {}
                    if nk not in pooled_dic[ek]: pooled_dic[ek][nk] = {}
                    if tpk not in pooled_dic[ek][nk]:
                        pooled_dic[ek][nk][tpk] = dic[mk][ek][nk][tpk]
                    else: pooled_dic[ek][nk][tpk]= vstack((pooled_dic[ek][nk][tpk], dic[mk][ek][nk][tpk]))
    return pooled_dic

def timePointsDic(dic):
    '''Determines the timepoints for each individual key at root
    '''
    tp_dic = {}
    for mk in dic.keys():
        tp_list = []
        for rgk in dic[mk].keys():
            tp_list.extend(dic[mk][rgk]['Neuropil'].keys())
        tp_dic[mk]=tuple(sorted(list(set(tp_list))))
    return tp_dic

对于某些操作，我发现没有其他方法，只能将字典扁平化：

def flattenDic(dic, label):
    '''flattens a dic to produce a list of of tuples containing keys and 'label' values
    '''
    flat_list = []
    for mk in dic.keys():
        for rgk in dic[mk].keys():
            for nk in dic[mk][rgk].keys():
                for ik in dic[mk][rgk][nk].keys():
                    for ek in dic[mk][rgk][nk][ik].keys():
                        flat_list.append((mk, rgk, nk, ik, ek, dic[mk][rgk][nk][ik][ek][label])
    return flat_list

def extractDataSequencePoints(flat_list, mk, nk, tp_list):
        '''produces a list containing arrays of time point values
        time_points is a list of the time points wished (can have 2 or 3 elements)
        '''
        nb_tp = len(tp_list)
        # build tp_seq list
        tp_seq = []
        tp1, tp2, tp3 = [], [], []
        if nk == 'Neuropil':
            tp1.extend(x for x in flat_list if x[0]==mk and x[2] == 'Neuropil' and x[3] == tp_list[0])
            tp2.extend(x for x in flat_list if x[0]==mk and x[2] == 'Neuropil'and  x[3] == tp_list[1])
        else:
            tp1.extend(x for x in flat_list if x[0]==mk and x[2] != 'Neuropil'and x[3] == tp_list[0])
            tp2.extend(x for x in flat_list if x[0]==mk and x[2] != 'Neuropil'and x[3] == tp_list[1])
        if nb_tp == 3:
            if nk == 'Neuropil':
                tp3.extend(x for x in flat_list if x[0]==mk and x[2] == 'Neuropil'and x[3] == tp_list[2])
            else:
                tp3.extend(x for x in flat_list if x[0]==mk and x[2] != 'Neuropil'and x[3] == tp_list[2])
        for x in tp1:
            for y in tp2:
                if x[0:3] == y[0:3] :
                    if nb_tp == 3:
                        for z in tp3:
                            if x[0:3] == z[0:3] :
                                tp_seq.append(asarray([x[4],y[4],z[4]]))
                    else:
                        tp_seq.append(asarray([x[4],y[4]]))
        return tp_seq

sql 面向对象编程数据结构数据分析数据管理数组操作嵌套字典递归函数

python：高效处理灵活深度嵌套数据的技巧有哪些？

6 个回答

撰写回答