Python读取XML文件(接近50mb)

2024-03-28 20:11:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在将XML字符串解析为CSV字符串,但速度非常慢:

INDEX_COLUMN = "{urn:schemas-microsoft-com:office:spreadsheet}Index"
CELL_ELEMENT = "Cell"
DATA_ELEMENT = "Data"

def parse_to_csv_string(xml):
    print('parse_to_csv_string')
    csv = []
    parsed_data = serialize_xml(xml)
    rows = list(parsed_data[1][0])
    header = get_cells_text(rows[0])
    rows.pop(0)
    csv.append(join(",", header))
    for row in rows:
        values = get_cells_text(row)
        csv.append(join(",", values))
    return join("\n", csv)

def serialize_xml(xml):
    return ET.fromstring(xml)

def get_cells_text(row):
    keys = []
    cells = normalize_row_cells(row)
    for elm in cells:
        keys.append(elm[0].text or "")
    while len(keys) < 92:
        keys.append("")
    return keys


def normalize_row_cells(row):
    cells = list(row)
    updated_cells = copy.deepcopy(cells)
    pos = 1
    for elm in cells:
        strIndexAttr = elm.get(INDEX_COLUMN)
        index = int(strIndexAttr) if strIndexAttr else pos
        while index > pos:
            empty_elm = ET.Element(CELL_ELEMENT)
            child = ET.SubElement(empty_elm, DATA_ELEMENT)
            child.text = ""
            updated_cells.insert(pos - 1, empty_elm)
            pos += 1
        pos += 1
    return updated_cells

XML字符串有时会漏掉几列,我需要迭代它来填充漏掉的列——每行必须有92列。这就是为什么我有一些助手函数来操作XML

现在我以4GB作为Lambda运行我的函数,但仍然得到超时:(

关于如何提高绩效有什么想法吗


Tags: csvtextposgetreturndefxmlelement
1条回答
网友
1楼 · 发布于 2024-03-28 20:11:48
<^ > ^ {CD1>}构造元素树实例,但{{CD2>}只对每个实例的子文本属性感兴趣,因此我将考虑改变^ {< CD1>}以只返回文本。此外,它还执行复制和调用list.insert:将元素插入列表的中间可能很昂贵,因为插入点之后的每个元素都必须移动

类似这样的东西(未测试的代码)避免了复制和插入,只返回所需的文本,使得get_cells_text冗余

def normalize_row_cells(row):
    cells = list(row)
    updated_cells = []
    pos = 1
    for _ in range(0, 92):
        elm = cells[pos - 1]
        strIndexAttr = elm.get(INDEX_COLUMN)
        index = int(strIndexAttr) if strIndexAttr else pos
        if index == pos:
            updated_cells.append(elm[0].text)
            pos += 1
        else:
            update_cells.append("")        
    return updated_cells

如果您可以将单元格与其标题名称相匹配,那么使用标准库中的csv.DictWriter可能会更好(您需要配置文件以确保)

import csv
import io


def parse_to_csv_string(xml):
    print('parse_to_csv_string')
    csv = []
    parsed_data = serialize_xml(xml)
    rows = list(parsed_data[1][0])
    header = get_cells_text(rows[0])
    with io.StringIO() as f:
        writer = csv.DictWriter(f, fieldnames=header)
        for row in rows:
            row = get_cells_text(row)
            writer.writerow(row)
        f.seek(0)
        data = f.read()
    return data

def get_cells_text(row):
    row_dict = {}
    for cell in row:
        column_name = get_column_name(cell)  # <- can this be done?
        row_dict[column_name] = elm[0].text or ""
    return row_dict

相关问题 更多 >