解析用空格对齐的文本文件列

2 投票

4 回答

2787 浏览

提问于 2025-04-18 06:31

我正在尝试解析一个文本文件，这个文件里的内容是用多个空格对齐成列的。文本大概是这样的：

Blah blah, blah               bao     123456     
hello, hello, hello           miao    299292929

我已经确认过，这个文件不是用制表符分隔的。实际上，内容是用多个空格对齐的。

把文本分成单独的行并不难，我还注意到在数字序列后面有多余的空格。所以现在我得到的是：

["Blah blah, blah               bao     123456     ",   
 "hello, hello, hello           miao    299292929  "]

我想要的输出结果是：

[["Blah blah, blah", "bao", "123456"],
 ["hello, hello, hello", "miao", "299292929"]]

文本解析数据清洗文本文件处理列格式化空格对齐

4 个回答

使用 re 模块

import re
l1 = re.split('  +', l[0])
l2 = re.split('  +', l[1])
print [l1.remove(''), l2.remove('')]

回答于 2025-04-18 由 Python大师

分享举报

你可以简单地通过索引来分割数据。你可以选择直接写死这些索引，或者通过检测来找到它们：

l=["Blah blah, blah               bao     123456     ",   
   "hello, hello, hello           miao    299292929  "]

def detect_column_indexes( list_of_lines ):
    indexes=[0]
    transitions= [col.count(' ')==len(list_of_lines) for col in zip(*list_of_lines)]
    last=False
    for i, x in enumerate(transitions):
        if not x and last:
            indexes.append(i)
        last=x
    indexes.append( len(list_of_lines[0])+1 )
    return indexes

def split_line_by_indexes( indexes, line ):
    tokens=[]
    for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
        tokens.append( line[i1:i2].rstrip() )
    return tokens

indexes= detect_column_indexes( l )
parsed= [split_line_by_indexes(indexes, line) for line in l] 
print indexes
print parsed

输出结果：

[0, 30, 38, 50]
[['Blah blah, blah', 'bao', '123456'], ['hello, hello, hello', 'miao', '299292929']]

显然，我们无法区分每一列末尾的空格，但你可以通过使用 rstrip 来检测开头的空格，而不是用 strip。

这种方法并不是万无一失的，但比检测两个连续的空格要更可靠。

回答于 2025-04-18 由 Python大师

分享举报

如果你知道每个字段的宽度，那就简单了。第一个字段宽30个字符，第二个字段宽8个字符，最后一个字段宽11个字符。所以你可以这样做：

line = 'Blah blah, blah               bao     123456     '
parts = [line[:30].strip(), line[30:39].strip(), line[38:].strip()]

回答于 2025-04-18 由 Python大师

分享举报

你可以使用 re.split() 这个函数，并且用 \s{2,} 作为分隔符的规则：

>>> l = ["Blah blah, blah               bao     123456     ",   
...      "hello, hello, hello           miao    299292929  "]
>>> for item in l:
...     re.split('\s{2,}', item.strip())
... 
['Blah blah, blah', 'bao', '123456']
['hello, hello, hello', 'miao', '299292929']

\s{2,} 这个规则是用来匹配两个或更多连续的空白字符。

回答于 2025-04-18 由 Python大师

分享举报

解析用空格对齐的文本文件列

4 个回答

撰写回答