如何将字典术语转换为数据帧?

2024-05-23 14:28:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用xmltodict将XML文件中的数据提取到dataframe

import pandas as pd
import xmltodict as xd
parsed = xd.parse(data.strip())
df = pd.DataFrame(parsed["SAMPLE_SET"]["SAMPLE"])

我将XML转换为dataframe,但其中一列包含如下数据:df['SAMPLE_ATTRIBUTES']

OrderedDict([('SAMPLE_ATTRIBUTE', [OrderedDict([('TAG', 'gender'), ('VALUE', 'male')]), OrderedDict([('TAG', 'phenotype'), ('VALUE', 'CML, fuo9')]), OrderedDict([('TAG', 'sample type'), ('VALUE', 'normal tissue')]), OrderedDict([('TAG', 'subject_id'), ('VALUE', '1')]), OrderedDict([('TAG', 'ENA-CHECKLIST'), ('VALUE', 'ERCXX1')])])])
OrderedDict([('SAMPLE_ATTRIBUTE', [OrderedDict([('TAG', 'gender'), ('VALUE', 'female')]), OrderedDict([('TAG', 'phenotype'), ('VALUE', 'CML, fuo4')]), OrderedDict([('TAG', 'sample type'), ('VALUE', 'blood')]), OrderedDict([('TAG', 'subject_id'), ('VALUE', '1')]), OrderedDict([('TAG', 'ENA-CHECKLIST'), ('VALUE', 'ERCXX2')])])])

我想拆分这些术语并添加到dataframe中的新列,如下所示:

gender  phenotype   sample type     subject_id  ENA-CHECKLIST
male    CML, fuo9   normal tissue       1       ERCXX1
female  CML, fuo4   normal tissue       1       ERCXX1

Tags: sampleiddataframevaluetagtypegenderordereddict
1条回答
网友
1楼 · 发布于 2024-05-23 14:28:30

使用自定义函数提取字典,然后concat到现有字典:

import pandas as pd
from collections import OrderedDict

df = pd.DataFrame({'SAMPLE_ATTRIBUTES': [
    OrderedDict([('SAMPLE_ATTRIBUTE', [OrderedDict([('TAG', 'gender'), ('VALUE', 'male')]),
                                       OrderedDict([('TAG', 'phenotype'), ('VALUE', 'CML, fuo9')]),
                                       OrderedDict([('TAG', 'sample type'), ('VALUE', 'normal tissue')]),
                                       OrderedDict([('TAG', 'subject_id'), ('VALUE', '1')]),
                                       OrderedDict([('TAG', 'ENA-CHECKLIST'), ('VALUE', 'ERCXX1')])])]),
    OrderedDict([('SAMPLE_ATTRIBUTE', [OrderedDict([('TAG', 'gender'), ('VALUE', 'female')]),
                                       OrderedDict([('TAG', 'phenotype'), ('VALUE', 'CML, fuo4')]),
                                       OrderedDict([('TAG', 'sample type'), ('VALUE', 'blood')]),
                                       OrderedDict([('TAG', 'subject_id'), ('VALUE', '1')]),
                                       OrderedDict([('TAG', 'ENA-CHECKLIST'), ('VALUE', 'ERCXX2')])])])
]})


def extract(di):
    return {m['TAG']: m['VALUE'] for m in di['SAMPLE_ATTRIBUTE']}


extracted = pd.DataFrame([extract(d) for d in df['SAMPLE_ATTRIBUTES'].tolist()])

res = pd.concat((df.drop('SAMPLE_ATTRIBUTES', 1), extracted), axis=1)
print(res)

输出

   gender  phenotype    sample type subject_id ENA-CHECKLIST
0    male  CML, fuo9  normal tissue          1        ERCXX1
1  female  CML, fuo4          blood          1        ERCXX2

相关问题 更多 >