如果您对使用Python解析复杂的以制表符分隔的文本文件有意见，我将不胜感激

File Name Date All_Cars_MPG All_Cars_Doors Group_MPG Car_MPG Units Units Units Units Units Units Units Units Units Units Soft Tops BMW Line of tab separated spaces 01-NOV-2015 32.5 4 18.2 25 01-DEC-2015 30.5 4 15.8 22 01-JAN-2016 35.0 5 19.0 26 Line of spaces or tab separated spaces File name (same as above) Date Car_Doors Car_MPG Car_Doors Car_Speed Units Units Units Units Units Units Units Units BMW AUDI AUDI NISSAN Line of tab separated spaces 01-NOV-2015 5 35 2 250 01-DEC-2015 5 12 8 220 01-JAN-2016 6 19 0 260

SUMMARY OF RUN DATE WOPT WOPT WOPT WOPT WOPT WOPT WOPT WOPT WTHP STB STB STB STB STB STB STB STB PSIA *10**3 B1A B2 B3 B4 B5 B6 B7 B9 B1A 01-JAN-2046 403847.8 0 8613069. 18449.29 0 0 0 0 0 01-FEB-2046 403847.8 0 8633593. 18471.77 0 0 0 0 0 01-MAR-2046 403847.8 0 8652024. 18492.03 0 0 0 0 0 01-APR-2046 403847.8 0 8671890. 18514.38 0 0 0 0 0 01-MAY-2046 403847.8 0 8689601. 18535.93 0 0 0 0 0 01-JUN-2046 403847.8 0 8707051. 18558.15 0 0 0 0 0 01-JUL-2046 403847.8 0 8723709. 18579.61 0 0 0 0 0 01-AUG-2046 403847.8 0 8740806. 18601.75 0 0 0 0 0 01-SEP-2046 403847.8 0 8757767. 18623.84 0 0 0 0 0 01-OCT-2046 403847.8 0 8774027. 18645.17 0 0 0 0 0 01-NOV-2046 403847.8 0 8790653. 18667.15 0 0 0 0 0 01-DEC-2046 403847.8 0 8806563. 18688.37 0 0 0 0 0 01-JAN-2047 403847.8 0 8822815. 18710.24 0 0 0 0 0 SUMMARY OF RUN DATE WTHP WTHP WTHP WTHP WTHP WTHP WTHP WTHP WTHP PSIA PSIA PSIA PSIA PSIA PSIA PSIA PSIA PSIA B2 B3 B4 B5 B6 B7 B9 B10 B16Z 01-JAN-2046 0 0 0 0 0 0 180.0000 0 0 01-FEB-2046 0 0 0 0 0 0 180.0000 0 0 01-MAR-2046 0 0 0 0 0 0 180.0000 0 0 01-APR-2046 0 0 0 0 0 0 180.0000 0 0 01-MAY-2046 0 0 0 0 0 0 180.0000 0 0 01-JUN-2046 0 0 0 0 0 0 180.0000 0 0 01-JUL-2046 0 0 0 0 0 0 180.0000 0 0 01-AUG-2046 0 0 0 0 0 0 180.0000 0 0 01-SEP-2046 0 0 0 0 0 0 180.0000 0 0 01-OCT-2046 0 0 0 0 0 0 180.0000 0 0 01-NOV-2046 0 0 0 0 0 0 180.0000 0 0 01-DEC-2046 0 0 0 0 0 0 180.0000 0 0 01-JAN-2047 0 0 0 0 0 0 180.0000 0 0

DATE,WOPT_B1A,WOPT_B2,WTHP_B1A,WTHP_B2 01-JAN-2046,403847.8,0,0,0 01-FEB-2046,403847.8,0,0,0 01-MAR-2046,403847.8,0,0,0 01-APR-2046,403847.8,0,0,0 01-MAY-2046,403847.8,0,0,0 01-JUN-2046,403847.8,0,0,0 01-JUL-2046,403847.8,0,0,0 01-AUG-2046,403847.8,0,0,0 01-SEP-2046,403847.8,0,0,0 01-OCT-2046,403847.8,0,0,0 01-NOV-2046,403847.8,0,0,0 01-DEC-2046,403847.8,0,0,0 01-JAN-2047,403847.8,0,0,0

1条回答

网友

1楼 · 发布于 2024-05-29 04:22:20

所以。。。这段代码中有相当多的假设，但它适用于您给出的示例。它可能不适用于所有情况，在某些地方可能会加快速度，但我不认为速度是最大的问题，我认为你可以对不起作用的事情做出必要的改变。你知道吗

第1步：
我们需要将.txt文件转换为列表列表。你知道吗

def get_tab_delimited_lines(file):
    lines = []
    with open(file, 'r') as f:
        for line in f.readlines():
                line = line.split('\t') # Split by \t (TAB)
                line = [x.strip() for x in line] # Remove white space
                lines.append(line)
    return lines

第2步：
将主体（表）与标题（列）信息分开。你知道吗

import re # This should go at the top of the file
def get_header_and_body(lines):
    # Lets seperate the header info from the body
    header_info = [] # This is the list we will return for header info
    body = [] # This is the list we will return for body info
    temp_body = []
    temp_header_info = []
    header = True
    for line in lines:
        # If the first part of the line is a date
        # in the format [a few numbers]-[a few letterss]-[a few numbers]
        # Example: 01-JAN-2046
        if re.match(r'[0-9]+-[A-Z]+-[0-9]+', line[0]): # If a date then it is the body
            header = False
            temp_body.append(line[:-1]) # The last element is always an empty '' so remove it
        else: # Else this is header info
            header = True
            if temp_body: # Append the body if we have any
                body.append(temp_body)
            temp_body = [] # Reset the temp
        if header: # If this is a header
            # This is a set of the lines we dont need. If the line
            # starts with any of these we will ignore it.
            unwanted_starts_to_a_line = {'SUMMARY OF RUN', 'STB', '', 'PSIA'}
            # We will also ignore line with things such as *18**.
            if line and line[0] not in unwanted_starts_to_a_line and not re.match(r'\*[0-9]+\*\*', line[0]):
                temp_header_info.append(line)
        else:
            if temp_header_info:
                header_info.append(temp_header_info)
            temp_header_info = []

    if temp_body:
        body.append(temp_body)
    if temp_header_info:
        header_info.append(temp_header_info)
    return header_info, body

第3步：
现在创建所需的新列标题：我反转header_info，因为日期没有附加任何其他内容。所以我反转两个标题行，将它们压缩在一起，然后将其反转回顺序我们想要。你知道吗

def change_to_table_headers(header_info):
    for index in range(len(header_info)):
        # print(header_info[index]) # uncomment this to see why I did the `reversed`
        # and feel free to remove the `reversed` to see what breaks.
        t = list(zip(reversed(header_info[index][0]), reversed(header_info[index][1])))
        t.reverse()
        t = ['_'.join(x) for x in t]
        header_info[index] = ['DATE'] + t

第4步：
拼凑起来：

import pandas as pd  # This should go at the top of the file

lines = get_tab_delimited_lines('test.txt')
header_info, body = get_header_and_body(lines)
change_to_table_headers(header_info)

for index in range(len(header_info)):

    headers = header_info[index]
    df = pd.DataFrame(body[index], columns=headers)
    print(df)

既然它在数据帧中，就可以直接将它发送到csv或者用它做任何你想做的事情。你知道吗

附录：

我在test.txt文档中使用了以下内容来测试它。你知道吗

SUMMARY OF RUN                                                    
DATE            WOPT            WOPT            WOPT            WOPT            WOPT            WOPT            WOPT            WOPT            WTHP           
                STB             STB             STB             STB             STB             STB             STB             STB             PSIA           
                                                                *10**3                                                                                         
                B1A             B2              B3              B4              B5              B6              B7              B9              B1A            

 01-JAN-2046     403847.8              0        8613069.        18449.29               0               0               0               0               0             
 01-FEB-2046     403847.8              0        8633593.        18471.77               0               0               0               0               0         
 01-MAR-2046     403847.8              0        8652024.        18492.03               0               0               0               0               0         
 01-APR-2046     403847.8              0        8671890.        18514.38               0               0               0               0               0         
 01-MAY-2046     403847.8              0        8689601.        18535.93               0               0               0               0               0         
 01-JUN-2046     403847.8              0        8707051.        18558.15               0               0               0               0               0         
 01-JUL-2046     403847.8              0        8723709.        18579.61               0               0               0               0               0         
 01-AUG-2046     403847.8              0        8740806.        18601.75               0               0               0               0               0         
 01-SEP-2046     403847.8              0        8757767.        18623.84               0               0               0               0               0         
 01-OCT-2046     403847.8              0        8774027.        18645.17               0               0               0               0               0         
 01-NOV-2046     403847.8              0        8790653.        18667.15               0               0               0               0               0         
 01-DEC-2046     403847.8              0        8806563.        18688.37               0               0               0               0               0         
 01-JAN-2047     403847.8              0        8822815.        18710.24               0               0               0               0               0         

SUMMARY OF RUN                                                    
DATE            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP            WTHP           
                PSIA            PSIA            PSIA            PSIA            PSIA            PSIA            PSIA            PSIA            PSIA           
                B2              B3              B4              B5              B6              B7              B9              B10             B16Z           

 01-JAN-2046            0              0               0               0               0               0        180.0000               0               0         
 01-FEB-2046            0              0               0               0               0               0        180.0000               0               0         
 01-MAR-2046            0              0               0               0               0               0        180.0000               0               0         
 01-APR-2046            0              0               0               0               0               0        180.0000               0               0         
 01-MAY-2046            0              0               0               0               0               0        180.0000               0               0         
 01-JUN-2046            0              0               0               0               0               0        180.0000               0               0         
 01-JUL-2046            0              0               0               0               0               0        180.0000               0               0         
 01-AUG-2046            0              0               0               0               0               0        180.0000               0               0         
 01-SEP-2046            0              0               0               0               0               0        180.0000               0               0         
 01-OCT-2046            0              0               0               0               0               0        180.0000               0               0         
 01-NOV-2046            0              0               0               0               0               0        180.0000               0               0         
 01-DEC-2046            0              0               0               0               0               0        180.0000               0               0         
 01-JAN-2047            0              0               0               0               0               0        180.0000               0               0

附录：

相关问题更多 >

编程相关推荐

热门问题

热门文章