如何高效解析定宽文件？

92 投票

11 回答

88337 浏览

数据工程师

提问于 2025-04-16 11:20

我正在寻找一种高效的方法来解析那些每行宽度固定的文件。比如说，前20个字符代表一列，从第21到第30个字符又代表另一列，依此类推。

假设每行有100个字符，有什么好的方法可以把一行解析成几个部分呢？

我可以对每行进行字符串切片，但如果行很长，这样做看起来有点麻烦。有没有其他更快的方法呢？

数据处理高效算法文件解析字符串切片定宽文件

11 个回答

还有两个比之前提到的解决方案更简单、更好看的选项：

第一个是使用pandas库：

import pandas as pd

path = 'filename.txt'

#inferred - as suggested in the comments by James Paul Mason
data = pd.read_fwf(path, colspecs='infer')

# Or using Pandas with a column specification
col_specification = [(0, 20), (21, 30), (31, 50), (51, 100)]
data = pd.read_fwf(path, colspecs=col_specification)

第二个选项是使用numpy.loadtxt：

import numpy as np

# Using NumPy and letting it figure it out automagically
data_also = np.loadtxt(path)

其实这要看你想怎么使用你的数据。

回答于 2025-04-16 由 Python大师

分享举报

我不太确定这样做是否高效，但应该比较容易理解（而不是手动切割字符串）。我定义了一个叫做 slices 的函数，它接收一个字符串和列的长度，然后返回子字符串。我把它做成了一个生成器，这样对于非常长的字符串，它就不会一次性创建一个临时的子字符串列表。

def slices(s, *args):
    position = 0
    for length in args:
        yield s[position:position + length]
        position += length

举个例子

In [32]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2))
Out[32]: ['ab']

In [33]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2, 10, 50))
Out[33]: ['ab', 'cdefghijkl', 'mnopqrstuvwxyz0123456789']

In [51]: d,c,h = slices('dogcathouse', 3, 3, 5)
In [52]: d,c,h
Out[52]: ('dog', 'cat', 'house')

不过我觉得如果你需要一次性获取所有列，生成器的优势就没了。生成器的好处在于你想要逐个处理列的时候，比如在一个循环中。

回答于 2025-04-16 由 Python大师

分享举报

使用Python标准库中的struct模块会比较简单，而且速度也很快，因为它是用C语言写的。下面的代码展示了如何使用这个模块。它还允许通过指定负值来跳过某些字符的列。

import struct

fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)

# Convert Unicode input to bytes and the result back to Unicode string.
unpack = struct.Struct(fmtstring).unpack_from  # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring)))

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))

输出结果：

fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

这里有一种使用字符串切片的方法，正如你之前考虑的那样，不过你担心这可能会变得太复杂。确实，这种方法有点复杂，速度上和基于struct模块的版本差不多——不过我有一个想法可以让它更快（这可能会让额外的复杂性变得值得）。关于这个话题的更新见下文。

from itertools import zip_longest
from itertools import accumulate

def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
    flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
    # Optional informational function attributes.
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))

输出结果：

format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

更新

正如我所怀疑的，确实有办法让字符串切片版本的代码更快——在Python 2.7中，它的速度和使用struct的版本差不多，但在Python 3.x中，它的速度快了233%（而且它的未优化版本的速度也和struct版本差不多）。

上面展示的版本定义了一个lambda函数，主要是一个生成式，它在运行时生成一系列切片的边界。

parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)

这相当于一个类似于下面的语句，具体取决于for循环中i和j的值，结果看起来像这样：

parse = lambda line: tuple(line[0:2], line[12:36], line[36:51], ...)

然而，后者的执行速度是前者的两倍以上，因为切片的边界都是常量。

幸运的是，使用内置的eval()函数，将前者转换并“编译”成后者相对简单：

def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
    flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
    parse = eval('lambda line: ({})\n'.format(slcs))  # Create and compile source code.
    # Optional informational function attributes.
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse

回答于 2025-04-16 由 Python大师

分享举报

如何高效解析定宽文件？

11 个回答

更新

撰写回答