在Python中动态解析研究数据

Question

长话短说： 我正在用Python收集研究数据。最开始的代码虽然不太好看，但能正常工作，它能给我一些基本信息，把原始数据转成适合用SPSS进行深入统计分析的格式。不过，每次我修改实验的时候，就得重新去调整分析的代码。

对于一个典型的实验，我会有30个文件，每个文件对应一个独特的用户。每个实验的字段数量是固定的（但不同实验之间可能会有10到20个字段的差异）。这些文件通常有700到1000条记录，并且有一个表头。记录的格式是用制表符分隔的（比如样本数据包含4个整数、3个字符串和10个浮点数）。

我需要把我的数据列表分类。在一个1000行的文件中，我可能会有4到256个类别。与其提前确定每个文件有多少个类别，我使用下面的代码来统计它们。每行开头的整数决定了这一行的浮点值属于哪个类别。整数的组合可以通过字符串值进行修改，从而产生非常不同的结果，有时多个组合也可以归为一类。

一旦数据被分类，就开始进行数据处理。我会得到每个文件每个类别的统计信息（比如均值、标准差等）。

要点： 我需要把下面的样本数据解析成类别。类别是每条记录中非浮点数的组合。~~我还想找一种动态（图形化）的方式来将列组合与类别关联起来。~~ 这部分我会另开一个帖子讨论。

我在寻找关于如何同时完成这两件事的建议。

    # data is a list of tab separated records
    # fields is a list of my field names

    # get a list of fieldtypes via gettype on our first row
    # gettype is a function to get type from string without changing data
    fieldtype = [gettype(n) for n in data[1].split('\t')]

    # get the indexes for fields that aren't floats
    mask =  [i for i, field in enumerate(fieldtype) if field!="float"]

    # for each row of data[skipping first and last empty lists] we split(on tabs)
    # and take the ith element of that split where i is taken from the list mask
    # which tells us which fields are not floats
    records = [[row.split('\t')[i] for i in mask] for row in data[1:-1]]

    # we now get a unique set of combos
    # since set doesn't happily take a list of lists, we join each row of values
    # together in a comma seperated string. So we end up with a list of strings.
    uniquerecs = set([",".join(row) for row in records])


    print len(uniquerecs)
    quit()

def gettype(s):
    try:
        int(s)
        return "int"
    except ValueError:
        pass
    try:
        float(s)
        return "float"
    except ValueError:
        return "string"

样本数据：

field0  field1  field2  field3  field4  field5  field6  field7  field8  field9  field10 field11 field12 field13 field14 field15
10  0   2   1   Right   Right   Right   5.76765674196   0.0310912272139 0.0573603238282 0.0582901376612 0.0648936500524 0.0655294305058 0.0720571099855 0.0748289246137 0.446033755751
3   1   3   0   Left    Left    Right   8.00982745764   0.0313840132052 0.0576521406854 0.0585844966069 0.0644905497442 0.0653386429438 0.0712603578765 0.0740345755708 0.2641076191
5   19  1   0   Right   Left    Left    4.69440026591   0.0313852052224 0.0583165354345 0.0592403274967 0.0659404609478 0.0666070804916 0.0715314027001 0.0743022054775 0.465994962101
3   1   4   2   Left    Right   Left    9.58648184552   0.0303649003017 0.0571579895338 0.0580911765412 0.0634304670863 0.0640132919609 0.0702920967445 0.0730697946335 0.556525293
9   0   0   7   Left    Left    Left    7.65374257547   0.030318719717  0.0568551744109 0.0577785415066 0.0640577002605 0.0647226582655 0.0711459854908 0.0739256050784 1.23421547397

文件处理数据解析数据统计数据分类统计分析实验设计动态分析记录格式

在Python中动态解析研究数据

4 个回答

撰写回答