解析看起来像csv但不是csv的数据?

2024-04-18 23:49:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图分析一个文件,看起来像一个csv文件,但它不是。它由逗号分隔,但每个逗号后面都有一个空格。而且没有标题,行的长度也不同。你知道吗

下面是一个示例,如果我以.txt格式打开文件,会得到如下结果:

FUD, speed, time, heading, offsets
MUD, speed, time, heading, offsets, error
CLA, head, time, speed, offset, error, errorfix
MUD, speed, time, heading, offsets, error
MUD, speed, time, heading, offsets, error
FUD, speed, time, heading, offsets
CLA, head, time, speed, offset, error, errorfix
CLA, head, time, speed, offset, error, errorfix
(note head, time, offset and all those after the first column are all values.)

现在我已经试过了。你知道吗

import pandas as pd

df =pd.read_csv('data.csv', headers = None)
MUD = df[df[0]=='MUD'].values.tolist()

然而,我得到了这个错误

CParserError: Error tokenizing data. C error: Expected 10 fields in line 3, saw 18

当我在谷歌上搜索错误时,有人建议我应该使用

error_bad_lines=False

但是,这给了我一个错误:

expected 10 fields, saw 15.

我试着把我看到的每一个泥巴都列一个熊猫名单,所以以后我可以这样做:

newMUD = MUD[4]/100

最终我会得到这样的结果:

print (MUD)
MUD, 12, 1, 5, 1, 1
MUD, 13, 2, 3, 2, 0
MUD, 12, 3, 5, -2, 0
MUD, 4, 4, 3, -3, 1

我的数据样本

NKF1, 447526092, -3.08, 0.01, 175.83, -0.02133949, 0.03264881, -0.06251871, 0, -28.93325, 26.49632, -0.1290034, 0.07, -0.02, 0.14
NKF2, 447526092, -26, 0.00, 0.00, 0.00, 0.00, 0.00, 255, 55, 341, 0, 0, 0, 0
NKF3, 447526092, -0.01, 0.06, 0.12, -0.04, -0.08, -0.03, 0, 0, 0, -0.73, 0.00
NKF4, 447526092, 0.03, 0.01, 0.00, 0.00, 0.00, 0.0002261061, 0, 0, 0, 16, 9023, 0, 1
NKF5, 447526092, 0, 0, 0, 0, 1.14, 0.88, 0.00, 0.00, 0.50, 0.003602755, 0.01431285, 0.02802294
NKF6, 447526092, -2.66, -0.98, 187.53, -0.06789517, -0.2714562, -0.1189714, 0, -28.96132, 26.25431, -0.2784806, 0.00, 0.36, -0.49
NKF7, 447526092, 21, 0.00, 0.00, 0.00, 0.00, 0.00, 258, 55, 338, 0, 0, 0, 0
NKF8, 447526092, -0.04, -0.20, 0.07, -0.04, -0.23, -0.17, 0, 0, 0, 10.83, 0.00
NKF9, 447526092, 0.04, 0.03, 0.01, 0.12, 0.00, 0.000866859, 0, 0, 0, 16, 9023, 0, 1
AHR2, 447526241, -3.12, -0.42, 176.43, 418.84, 34.3167522, -118.4068499
POS, 447526306, 34.3167515, -118.406853, 419.03, 0.2784806
IMU, 447545009, -0.09418038, 0.1740572, -0.05483108, 0.6083156, 0.2225795, -9.380787, 0, 0, 52.99446, 1, 1
IMU2, 447545009, -0.09127176, 0.1908958, -0.06220703, 0.524766, 0.3107446, -8.754621, 0, 0, 56.125, 1, 1
SONR, 447545584, 0, 0, 0, 0
RFND, 447545593, 0.00, 0.00
IMU, 447565482, -0.08753563, 0.1228692, -0.04508965, 0.6137247, -0.01505011, -9.579732, 0, 0, 53.0831, 1, 1
IMU2, 447565482, -0.08944235, 0.139776, -0.05096832, 0.4677677, 0.03778861, -9.214079, 0, 0, 55.875, 1, 1
GPS, 447565911, 4, 246769200, 1920, 14, 0.70, 34.3167523, -118.4068497, 418.91, 0.05656854, 135, -0.16, 1
GPA, 447565911, 1.11, 0.73, 1.04, 0.29, 1, 447565
SONR, 447566084, 0, 0, 0, 0
RFND, 447566093, 0.00, 0.00
ATT, 447566114, 0.00, -2.88, 0.00, -0.62, 0.00, 187.41, 0.02, 0.01
PIDR, 447566125, 0, 0, 0, 0, 0, 0
PIDP, 447566135, 0, 0, 0, 0, 0, 0
PIDY, 447566145, 0, 0, 0, 0, 0, 0
PIDS, 447566155, 0, 0, 0, 0, 0, 0
NKF1, 447566164, -3.30, 0.35, 175.70, -0.02778457, 0.03493549, -0.04115778, 0, -28.9337, 26.49665, -0.1338468, 0.07, -0.02, 0.14
NKF2, 447566164, -26, 0.00, 0.00, 0.00, 0.00, 0.00, 255, 55, 341, 0, 0, 0, 0
NKF3, 447566164, -0.01, 0.06, 0.12, -0.04, -0.08, -0.11, 0, 0, 0, -0.73, 0.00
NKF4, 447566164, 0.03, 0.01, 0.00, 0.00, 0.00, 0.0002256641, 0, 0, 0, 16, 9023, 0, 1
NKF5, 447566164, 0, 0, 0, 0, 1.14, 0.88, 0.00, 0.00, 0.50, 0.003267812, 0.01763795, 0.02970827
NKF6, 447566164, -2.88, -0.62, 187.40, -0.07544779, -0.2697962, -0.09678251, 0, -28.96231, 26.2515, -0.2831134, 0.00, 0.36, -0.49
NKF7, 447566164, 21, 0.00, 0.00, 0.00, 0.00, 0.00, 258, 55, 338, 0, 0, 0, 0
NKF8, 447566164, -0.04, -0.20, 0.07, -0.04, -0.23, -0.25, 0, 0, 0, 10.83, 0.00
NKF9, 447566164, 0.04, 0.03, 0.01, 0.12, 0.00, 0.00086712, 0, 0, 0, 16, 9023, 0, 1
AHR2, 447566373, -3.34, -0.07, 176.32, 418.84, 34.3167522, -118.4068497
POS, 447566396, 34.3167515, -118.406853, 419.04, 0.2831134
IMU, 447587271, -0.08603665, 0.071096, -0.03380377, 0.5931511, -0.07432687, -9.615693, 0, 0, 53.0831, 1, 1
IMU2, 447587271, -0.08848803, 0.09229023, -0.04071644, 0.4688947, 0.01987415, -9.166938, 0, 0, 56.125, 1, 1
MAG, 447587700, -265, -77, 332, -115, 0, 1, 0, 0, 0, 1, 447587691
MAG2, 447587700, -273, -29, 372, 77, -135, 38, 0, 0, 0, 1, 447587693
ARSP, 447587748, 2.969838, 4.424126, 38.22, -4.424126, 110.8502, 1
BARO, 447587789, -0.09136668, 97036.14, 55.03, -0.8952343, 447587, 0
CURR, 447587949, 16.91083, 0.6012492, 60.22538

Tags: 文件csvdftime错误errorheadmud
2条回答

如果你真的想对列进行计算,那么使用Pandas是有意义的(不是从问题中得到的)。在这种情况下,传递预期的列名就足够了,这样解析器就不会对不断变化的列数感到惊讶了:

# Creates a list ["note", "head" ... ]
columns = "note head time speed offset error errorfix".split()

df = pd.read_csv(filename, names=columns)

MUD = df.query("note == 'MUD'")

MUD["speed"] / 4

您可以在使用from_records创建数据帧时过滤行。这里我使用csv模块创建行并丢弃不需要的行。你知道吗

import pandas as pd
import csv

def data_reader(filename, rowname):
    with open(filename, newline='') as fp:
        yield from (row[1:] for row in csv.reader(fp, skipinitialspace=True)
            if row[0] == rowname)

df = pd.DataFrame.from_records(data_reader('testfile', 'MUD'))
print(df)

这有点危险-如果非泥线不符合标准csv规则,读取器可能会出错。下面是一个更复杂的版本,它将csv解析器限制为泥线

import pandas as pd
import csv

def mud_reader(filename, rowname):
    rowname = rowname + ", "
    with open(filename, newline='') as fp:
        yield from (row[len(rowname):] for row in csv.reader(
            (line for line in fp if line.startswith(rowname)),
            skipinitialspace=True))

df = pd.DataFrame.from_records(mud_reader('testfile', 'MUD'))
print(df)

相关问题 更多 >