复杂多表文件的数据清理方法

0 投票
0 回答
28 浏览
提问于 2025-04-12 04:01

大家下午好,

我有一个.dat文件(附带了一个示例),最开始我发帖是因为我觉得只需要帮助调整一些错位的行。然而,经过进一步检查数据后,我发现问题比我想的要复杂一些,我很难找到处理这些数据和问题的方法。

我想清理这个文件,最好能把它放到一个数据框(DataFrame)里,这样我就可以用这个文件中的特定信息与我已经清理好的其他csv文件进行比较分析。

**简而言之** 我有一个包含多个不同表头的.dat文件,我想清理它,并希望能放入一个数据框。对我来说复杂的地方在于,这个文件包含两个主要的表头,它们的长度不同,不仅在文件的开头出现,还贯穿整个文件。此外,某些列也存在错位,保持“结构”很重要,因为文件中的每个表都与一个路线相关。

具体情况是这样的……

首先,这个文件本身由许多表连接组成,数据中也包含表头。

这些表头遵循两种主要类型和模式:

第一个表头包含时间 | 模拟时间等,用“ | ”分隔,紧接着的数据也有相应的“ | ”。这在数据中仅用两行表示。

第二个表头通过索引(idx)来识别,使用空格分隔。它的数据也相应跟随。

在整个文件中,你会注意到一组数据的结束和下一组的开始可以通过找到时间(日期/时间,例如年-月-日)来确定。这表示第一个表头与第二个表头的索引相关联,后面的数据紧接着出现。

此外,当查看数据并将其放入分隔格式时,我观察到第二个表头下的fix_nam列错位,显示时间时没有fix_nam,而fix_nam是用字母表示的。这里的一切都应该向右移动,如果没有字符串字符,fix_nam下应该留空。这让我回到了我最初发帖的问题这里.

我有点卡住了,因为我的所有脚本似乎都产生了不同的输出,但没有一个是正确的数据框,无法包含我所询问的内容。我附上了我最新的一个脚本,如果有人对如何处理这个问题有想法,我将非常感激。

如果我可以提供更多的上下文或信息来帮助,请告诉我。

谢谢。

import pandas as pd

# Initialize lists to store headers and data
headers_list = []
data_list = []
second_header_list = []

# Open the file and read line by line
with open('datashort.dat', 'r') as file:
    for line in file:
        # Strip whitespace characters from the line
        line = line.strip()
        
        if line:
            # Split the line into fields using '|' delimiter
            fields = line.split('|')
            
            if len(fields) > 1:
                # Check if the headers are from the first set or the second set
                if len(headers_list) == 0:
                    headers_list.append(fields)
                else:
                    second_header_list.append(fields)
            else:
         
                data_list.append(fields[0].split())


headers_df = pd.DataFrame(headers_list)
second_header_df = pd.DataFrame(second_header_list)
data_df = pd.DataFrame(data_list)


headers_combined = headers_df.apply(lambda x: '|'.join(x.dropna()), axis=1)


result_df = pd.concat([headers_combined, data_df], ignore_index=True, axis=1)


result_df.columns = result_df.iloc[0]
result_df = result_df[1:]


output_path = 'output.csv'
result_df.to_csv(output_path, index=False)

这里有一些示例数据(如果我添加的不正确请见谅)

Time|Sim_Time|Record_Code|Record_Type|Channel_Id|Channel_Mode|Rec_Id|App_Id|Process_Id|Host_Name|Acid|Cfpid|Tail_No|NAS_Cid|Dept|Arr|FixNbr|Profiles|Parse_Time|System_Mode|Release_Name$$$
2023-07-31T23:30:00.442Z|2023-07-31T23:30:00.442Z|32|28003|A|Active|420|420|6049|fdpsa101|XXX1234|1891|C857TJ|538|LALA|AHAH|75|
Idx fix_loc fix_nam ETA Spd LL UL clr_0 clr_1 Post Center fixOvr fixInd fixInd2 FixTypes
00 183100N0682100W LALA 22:16:22 M078 F320 F320 F320 F320 NY DOMES CERDA x00 x00 x00 RADAR_MON_IN/OUT_XFER/
01 183100N0682100W LALA 22:16:22 M078 F320 F320 F320 F320 CERDA x0100 x04 x00 DISPLAY/OUT_XFER/AIRPORT/DCT/
02 184531N0683420W 22:18:47 M078 F320 F320 F320 F320 CERDA x00 x04 x00 RADAR_MON_OUT/OUT_XFER/DCT/
03 190919N0685619W 22:22:45 M078 F320 F320 F320 F320 CERDA x00 x04 x00 RADAR_MON_IN/OUT_XFER/DCT/
04 193800N0692300W 22:27:35 M078 F320 F320 F320 F320 CERDA x00 x04 x00 DISPLAY/AS_FILED/OUT_XFER/DCT/
05 194748N0692411W 22:28:49 M078 F320 F320 F320 F320 CERDA x00 x04 x00 RADAR_IN/OUT_XFER/DCT/
06 194748N0692411W 22:28:49 M078 F320 F320 F320 F320 CERDA x00 x04 x00 RCADEV_IN/OUT_XFER/DCT/
07 230754N0694847W 22:54:09 M078 F320 F320 F320 F320 CERDA x00 x4001 x00 KEEP/OUT_XFER/REPORTED/ADS_REP/
08 241119N0695649W 23:01:52 M078 F320 F320 F320 F320 CERDA x00 x4001 x00 KEEP/OUT_XFER/REPORTED/ADS_REP/
09 242117N0695806W 23:03:05 M078 F320 F320 F320 F320 CERDA x00 x0200 x00 MOBPST/OUT_XFER/LVCHG/
10 242637N0695847W 23:03:44 M078 F320 F320 F320 F320 CERDA x00 xffffc001 x00 KEEP/OUT_XFER/REPORTED/ADS_REP/RADAR_REP/
11 244025N0700034W 23:05:25 M078 F320 F320 F320 F320 CERDA CERDA x00 x04 x00 XCENTER_IN/OUT_XFER/DCT/
12 244025N0700034W 23:05:25 M078 F320 F320 F320 F320 CERDA CERDA x00 x04 x00 DISPLAY/CENTER_IN/OUT_XFER/DCT/
13 245058N0700155W 23:06:42 M078 F320 F320 F320 F320 90 CERDA x00 x04 x00 POSTING_IN/OUT_XFER/DCT/
14 245213N0700205W 23:06:52 M078 F320 F320 F320 F320 CERDA x00 x4001 x00 KEEP/REPORTED/ADS_REP/
15 250000N0700306W 23:07:49 M078 F320 F320 F320 F320 CERDA CERDA x00 x04 x00 CENTER_OUT/DCT/
16 250000N0700306W 23:07:49 M078 F320 F320 F320 F320 90 x00 x04 x00 SECTOR_IN/IN_FIR/DCT/
17 250000N0700306W LAMER 23:07:49 M078 F320 F320 F320 F320 90 x00 x4000 x00 DISPLAY/AS_FILED/IN_FIR/ADS_REP/
18 250000N0700306W 23:07:49 M078 F320 F320 F320 F320 90 x00 x04 x00 RCADEV_OUT/IN_FIR/DCT/
19 250000N0700306W 23:07:49 M078 F320 F320 F320 F320 90 x00 x04 x00 RADAR_OUT/IN_FIR/DCT/
20 250829N0700503W 23:08:53 M078 F320 F320 F320 F320 90 x00 x4001 x00 KEEP/IN_FIR/REPORTED/ADS_REP/
21 251131N0700544W 23:09:19 M078 F320 F320 F320 F320 90 x00 xffff8001 x00 KEEP/IN_FIR/REPORTED/RADAR_REP/
22 251436N0700627W 23:09:42 M078 F320 F320 F320 F320 90 x00 xffff8001 x00 KEEP/IN_FIR/REPORTED/RADAR_REP/
23 251743N0700710W 23:10:05 M078 F320 F320 F320 F320 90 x400000 x00 x00 IN_FIR/MOBPST/DEVIAT_START/
24 251920N0700733W 23:10:17 M078 F320 F320 F320 F320 90 x800000 xffff8001 x00 KEEP/IN_FIR/DEVIAT/REPORTED/RADAR_REP/
25 252232N0700817W 23:10:40 M078 F320 F320 F320 F320 90 x800000 x4001 x00 KEEP/IN_FIR/DEVIAT/REPORTED/ADS_REP/
26 252515N0700855W 23:11:00 M078 F320 F320 F320 F320 CERDA 90 x800000 x04 x00 XCENTER_OUT/IN_FIR/DEVIAT/DCT/
27 252949N0700958W 23:11:34 M078 F320 F320 F320 F320 90 x800000 x04 x00 RADAR_MON_OUT/IN_FIR/DEVIAT/DCT/
28 253857N0701205W 23:12:45 M078 F320 F320 F320 F320 90 x800000 x4001 x00 KEEP/IN_FIR/DEVIAT/REPORTED/ADS_REP/
29 254222N0701253W 23:13:11 M078 F320 F320 F320 F320 90 x800000 x4001 x00 KEEP/IN_FIR/DEVIAT/REPORTED/ADS_REP/
30 254339N0701311W 23:13:21 M078 F320 F320 F320 F320 90 x800000 x4001 x00 KEEP/IN_FIR/DEVIAT/REPORTED/ADS_REP/
31 262034N0702148W 23:18:03 M078 F320 F320 F320 F320 90 x800000 x00 x00 IN_FIR/MOBPST/DEVIAT/
32 265903N0703054W 23:22:57 M078 F320 F320 F320 F320 90 x800000 x4001 x00 KEEP/IN_FIR/DEVIAT/REPORTED/ADS_REP/
33 270119N0703127W 23:23:15 M078 F320 F320 F320 F320 90 x800000 x00 x00 IN_FIR/MOBPST/DEVIAT/
34 275117N0704325W 23:29:50 M078 F320 F320 F320 F320 90 x800000 x4001 x00 KEEP/IN_FIR/DEVIAT/REPORTED/ADS_REP/
35 285101N0705759W 23:37:29 M078 F320 F320 F320 F320 90 x800000 x6000 x00 KEEP/IN_FIR/DEVIAT/PILOT_EST_REP/ADS_REP/
36 285400N0705844W BOREX 23:37:51 M078 F320 F320 F320 F320 90 x800000 x4000 x00 DISPLAY/AS_FILED/IN_FIR/DEVIAT/ADS_REP/
37 302440N0712125W 23:49:14 M078 F320 F320 F320 F320 87 90 x800000 x04 x00 POSTING_IN/IN_FIR/DEVIAT/DCT/
38 302440N0712125W 23:49:14 M078 F320 F320 F320 F320 90 x800000 x04 x00 SECTOR_OUT/IN_FIR/DEVIAT/DCT/
39 302440N0712125W 23:49:14 M078 F320 F320 F320 F320 87 x800000 x04 x00 SECTOR_IN/IN_FIR/DEVIAT/DCT/
40 302440N0712125W 23:49:14 M078 F320 F320 F320 F320 90 87 x800000 x04 x00 POSTING_OUT/IN_FIR/DEVIAT/DCT/
41 320244N0714648W ALOBI 00:01:34 M078 F320 F320 F320 F320 87 x800000 x00 x05 AS_FILED/IN_FIR/DEVIAT/NON_COMPULSORY/MIDAIRWAY_POINT/
42 330141N0720230W LSIER 00:09:05 M078 F320 F320 F320 F320 87 x800000 x00 x01 KEEP/IN_FIR/DEVIAT/NON_COMPULSORY/
43 335855N0721806W ONGOT 00:16:27 M078 F320 F320 F320 F320 87 x800000 x00 x00 DISPLAY/AS_FILED/IN_FIR/DEVIAT/
44 340000N0721813W 00:16:35 M078 F320 F320 F320 F320 87 x800000 x04 x00 RADAR_MON_IN/IN_FIR/DEVIAT/DCT/
45 341219N0721932W 00:18:07 M078 F320 F320 F320 F320 82 87 x800000 x04 x00 XCENTER_IN/IN_FIR/DEVIAT/DCT/
46 341915N0722017W 00:19:00 M078 F320 F320 F320 F320 87 x01000000 x00 x00 MOBILE/IN_FIR/DEVIAT_END/
47 344317N0722254W 00:22:00 M078 F320 F320 F320 F320 87 x00 x04 x00 SECTOR_OUT/IN_FIR/DCT/
48 344317N0722254W 00:22:00 M078 F320 F320 F320 F320 82 82 x00 x04 x00 CENTER_IN/DCT/
49 344317N0722254W 00:22:00 M078 F320 F320 F320 F320 82 x00 x04 x00 RADAR_IN/DCT/
50 344317N0722254W 00:22:00 M078 F320 F320 F320 F320 82 x00 x04 x00 RCADEV_IN/DCT/
51 344317N0722254W 00:22:00 M078 F320 F320 F320 F320 87 82 x00 x04 x00 POSTING_OUT/DCT/
52 344328N0722255W SAUCR 00:22:02 M078 F320 F320 F320 F320 82 x00 x00 x00 DISPLAY/AS_FILED/
53 345804N0722032W STERN 00:23:50 M078 F320 F320 F320 F320 82 x00 x00 x00 DISPLAY/AS_FILED/
54 360224N0721936W CHUBY 00:31:41 M078 F320 F320 F320 F320 82 x00 x00 x00 DISPLAY/AS_FILED/
55 364207N0723231W HOBOH 00:36:45 M078 F320 F320 F320 F320 82 x00 x00 x00 DISPLAY/AS_FILED/
56 364303N0723232W 00:36:51 M078 F320 F320 F320 F320 86 82 x00 x04 x00 XCENTER_IN/DCT/
57 364303N0723232W 00:36:51 M078 F320 F320 F320 F320 82 82 x00 x04 x00 CENTER_OUT/DCT/
58 364303N0723232W 00:36:51 M078 F320 F320 F320 F320 86 86 x00 x04 x00 CENTER_IN/DCT/
59 364303N0723232W 00:36:51 M078 F320 F320 F320 F320 82 86 x00 x04 x00 XCENTER_OUT/DCT/
60 371340N0723312W SILLY 00:40:31 M078 F320 F320 F320 F320 86 x00 x00 x04 DISPLAY/AS_FILED/MIDAIRWAY_POINT/
61 374940N0721716W STINK 00:44:53 M078 F320 F320 F320 F320 86 x00 x00 x04 DISPLAY/AS_FILED/MIDAIRWAY_POINT/
62 381653N0715713W YAALE 00:48:25 M078 F320 F320 F320 F320 86 x00 x00 x00 DISPLAY/AS_FILED/
63 383655N0715700W YETTI 00:50:46 M078 F320 F320 F320 F320 86 x00 x00 x00 DISPLAY/AS_FILED/
64 385917N0715643W MOUGH 00:53:23 M078 F320 F320 F320 F320 86 x00 x00 x00 DISPLAY/AS_FILED/
65 394805N0724825W 01:01:45 M078 F320 F320 F320 F320 86 86 x00 x04 x00 DISPLAY/CENTER_OUT/DCT/
66 394805N0724825W 01:01:45 M078 F320 F320 F320 F320 86 86 x00 x04 x00 XCENTER_OUT/DCT/
67 394924N0724950W OWENZ 01:01:59 M078 F320 F320 F320 F320 86 x00 x00 x00 DISPLAY/AS_FILED/
68 394841N0731541W PREPI 01:05:00 M078 F320 F320 F320 F320 86 x00 x00 x00 DISPLAY/AS_FILED/
69 395537N0733724W LEECY 01:07:37 M078 F320 F320 F320 F320 86 x00 x00 x00 DISPLAY/AS_FILED/
70 400102N0735140W CAMRN 01:09:23 M078 F320 F320 F320 F320 86 x00 x04 x00 DISPLAY/AS_FILED/DCT/
71 400331N0735121W 01:09:41 M078 F320 F320 F320 F320 86 x00 x04 x00 RADAR_MON_OUT/DCT/
72 403823N0734644W AHAH 01:13:57 M078 F320 F320 F320 F320 86 x0100 x04 x00 DISPLAY/AIRPORT/DCT/
73 403823N0734644W AHAH 01:13:57 M078 F320 F320 F320 F320 86 x00 x00 x00 RCADEV_OUT/
74 403823N0734644W AHAH 01:13:57 M078 F320 F320 F320 F320 NY DOMES 86 x00 x00 x00 RADAR_OUT/|2023-07-31T23:36:18.000Z|L|ocean21.t2802000.na01851b$$$
2023-07-31T23:30:00.690Z|2023-07-31T23:30:00.690Z|32|28003|A|Active|420|420|6049|fdpsa101|FET2102|1979|VLRRI|647|LALA|EGGG|94|
Idx fix_loc fix_nam ETA Spd LL UL clr_0 clr_1 Post Center fixOvr fixInd fixInd2 FixTypes
00 191320N0682332W CHUMA 21:40:00 M084 F340 F340 F340 F340 SANJUAN SANJUAN x00 x00 x00 XCENTER_IN/
01 191322N0682330W 21:40:00 M084 F340 F340 F340 F340 SANJUAN SANJUAN x00 x04 x00 CENTER_IN/DCT/
02 191322N0682330W 21:40:00 M084 F340 F340 F340 F340 SANJUAN x00 x04 x00 RADAR_IN/DCT/
03 191320N0682332W CHUMA 21:40:00 M084 F340 F340 F340 F340 NY DOMES SANJUAN x00 x00 x00 RADAR_MON_IN/
04 191320N0682332W CHUMA 21:40:00 M084 F340 F340 F340 F340 SANJUAN x00 x00 x00 RCADEV_IN/
05 191320N0682332W CHUMA 21:40:00 M084 F340 F340 F340 F340 SANJUAN x00 x00 x00 DISPLAY/AS_FILED/
06 200011N0672752W DOZGO 21:48:20 M084 F340 F340 F340 F340 SANJUAN x00 x04 x00 DISPLAY/AS_FILED/DCT/
07 203222N0671039W JANMA 21:52:36 M084 F340 F340 F340 F340 SANJUAN x00 x00 x00 DISPLAY/AS_FILED/
08 211556N0670734W LENNT 21:57:45 M084 F340 F340 F340 F340 SANJUAN x00 x00 x01 KEEP/NON_COMPULSORY/
09 211924N0670816W 21:58:10 M084 F340 F340 F340 F340 SANJUAN x00 x0200 x00 MOBPST/LVCHG/
10 213716N0671152W 22:00:19 M084 F340 F340 F340 F340 90 SANJUAN x00 x04 x00 POSTING_IN/DCT/
11 213716N0671152W 22:00:19 M084 F340 F340 F340 F340 SANJUAN SANJUAN x00 x04 x00 CENTER_OUT/DCT/
12 213716N0671152W 22:00:19 M084 F340 F340 F340 F340 90 x00 x04 x00 SECTOR_IN/IN_FIR/DCT/
13 213717N0671152W KINCH 22:00:19 M084 F340 F340 F340 F340 90 x00 x4001 x00 DISPLAY/AS_FILED/IN_FIR/REPORTED/ADS_REP/

0 个回答

暂无回答

撰写回答