如何解析这个文本文件？

2条回答

网友

1楼 · 编辑于 2024-05-16 07:16:12

由于您的文件格式不方便，我只能将此技巧视为解决方案：您可以查看标题（功能名称）并根据它们出现的索引解析所有行，如下所示

import numpy as np
with open('/Users/Copo1/Desktop/aaa.txt') as dataFile:
    lines = dataFile.readlines()
headers = ['ITEM NUMBER','WH ITEM DESCRIPTION', 'PRODUCT NUMBER', 'PRICE']
starts = [lines[1].find(h) for h in headers]
starts.append(len(lines[0]))
headers.append(' ')
items = [[line[starts[i]:starts[i+1]] for line in lines] for i,h in zip(range(len(starts)-1), headers[:-1]) ]

这将为items列表生成以下输出（仅粘贴与“ITEM NUMBER”对应的第一个元素，其他元素也是正确的，您可以检查）。你知道吗

[['                      ',
  'ITEM NUMBER           ',
  '       -   - ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/27K  ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/30K  ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/35K  ',
  '                      ',
  ' \n',
  '10.5PLC/TLED/26V/40K  ',
  '                      ',
  ' \n',
  '1000PAR64/FFR         ',
  '                      ',
  ' \n',
  '1000PAR64/WFL/S       ',
  '                      ',
  ' \n',
  '100A/99               ',
  '                      ',
  ' \n',
  '100A/CL               ',
  '                      ',
  ' '],

在这之后可能还有一些额外的简单的抛光工作要做（比如删除空字符串和'\n's），但我相信你自己也能弄清楚。你知道吗

网友

2楼 · 编辑于 2024-05-16 07:16:12

我想我的answer对问题How to efficiently parse fixed width files?的回答可以调整为做你想做的事。你知道吗

对该答案中的代码的主要修改是使其也去掉每个字段中的任何前导和尾随空格。下面是说明这一点的Python 3.x代码：

from __future__ import print_function
import struct


HEADER_LINES = 5

# Indices       0       1        2      3      4      5       6      7
fieldwidths = (20, -5, 37, -10, 12, -1, 6, -1, 9, -1, 9, -1, 10, -1, 7)

# Convert fieldwidths into a format compatible with struct module.
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                    for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
#print('fmtstring: {!r}, recsize: {} chars\n'.format(fmtstring, fieldstruct.size))

unpack_from = fieldstruct.unpack_from  # To optimize calls.


def parse(line):
    """ Return unpacked fields in string line, stripped of any leading and
        trailing whitespace.
    """
    return list(s.decode().strip() for s in unpack_from(line.encode()))


def readInventoryFile(filename):
    with open(filename) as invfile:
        for _ in range(HEADER_LINES):
            next(invfile)  # Skip header lines.

        for line in invfile:
            if len(line) < fieldstruct.size:  # Pad line if it's too short.
                line = line + (' ' * (fieldstruct.size-len(line)))
            fields = parse(line)
            if fields[0]:  # First field non-blank?
                print(fields)

readInventoryFile('inventoryFiles_INV.txt')

结果：

['10.5PLC/TLED/26V/27K', '14.5W 4PIN CFL REPL 2700K VERT 458406', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['10.5PLC/TLED/26V/30K', '14.5W 4PIN CFL REPL 3000K VERT 458414', '20.00 EA', '0.00', '3', '0', 'PAYOFF I68', 'I68']
['10.5PLC/TLED/26V/35K', '14.5W 4PIN CFL REPL 3500K VERT 458422', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['10.5PLC/TLED/26V/40K', '14.5W 4PIN CFL REPL 4000K VERT 458430', '20.00 EA', '0.00', '0', '0', 'I68', 'I68']
['1000PAR64/FFR', '1000W PAR64 HALOGEN GX16D BASE 56217', '50.00 EA', '0.00', '0', '0', 'I10', '']
['1000PAR64/WFL/S', '1000W PAR64 HALOGEN GX16D BASE S4673', '0.00 EA', '0.00', '0', '0', '', 'I105']
['100A/99', '100W A19 EXTENDED SERVICE      229781', '2.62 EA', '0.00', '0', '0', 'W6-2   I70', 'I11']
['100A/CL', '100W A19 130V CLEAR            375279', '0.99 EA', '0.00', '0', '0', 'A2-2   I70', 'I11']

工作原理

简而言之，这段代码利用Python的^{}模块功能，将充满数据的“缓冲区”拆分或“解包”为固定的“字段”，每个字段包含一定数量的字符。你知道吗

虽然更常用于二进制数据，但它也适用于已转换为字节数组的字符串（在Python2.x中不需要）。基本上你给它一个format string来指定每个字段的特征（类型和大小），以及要解析的数据（本例中是文件中的一行），然后它相应地解压并返回结果作为一个值列表。你知道吗

工作原理

相关问题更多 >

编程相关推荐

热门问题

热门文章