Python Pandas 读取具有特定行结束符的 CSV 文件

2 投票
2 回答
5584 浏览
提问于 2025-04-18 11:23

我正在尝试从下面这个给我的示例csv文件创建一个数据框,但我遇到了一个错误,提示“数据解析错误。C错误:在第0行的字符串中遇到文件结束”。我对处理这些错误行的经验不多,但我真的想学习如何更好地处理这种情况。我尝试了read_csv中的许多不同选项,比如设置error_bad_line=False,但也没有解决问题。

CParserError: Error tokenizing data. C error: EOF inside string starting at line 0

我猜是因为行结束符中的","导致了这个问题。我认为最好的办法是逐行处理,所以我想出了下面这个生成器,并得到了其他人的帮助,希望我能接近解决方案。我也很想学习如何在这个过程中使用生成器和yield。

示例数据:

"USNC3255","27","US","NC","LANDS END","72305006","KNJM","KNCA","KNKT","T72305006","","","NCC031","NCZ095","","545","28594","America/New_York","34.65266","-77.07661","7","RDU","893727","
"USNC3256","27","US","NC","LANDSDOWN","72314058","KEHO","KAKH","KIPJ","T72314058","","","NCC045","NCZ068","sc007","517","28150","America/New_York","35.29374","-81.46537","797","CLT","317845","

我写了下面的代码,去掉了最后两个字符,但不太确定如何从这些行生成一个数据框:

def big_table_generator(filename):
    with open(filename, 'rt') as f:
        for line in f:
            yield line[:-3]

gen = big_table_generator('../data/test_sun_file.csv')
pd.DataFrame(gen)

2 个回答

0

这是我想到的解决方案,但我其实想避免使用列表和添加元素的方法,而是想利用生成器。不过我对生成器还不够熟悉,所以还没能做到。

def parse_file(filename):

    newline = []

    with open(filename, 'rb') as f:
        reader = csv.reader(f, quoting=csv.QUOTE_NONE)
        for row in reader:
            newline.append([s.strip('"') for s in row[:-1]])
    df = pd.DataFrame(newline)
    df = df.applymap(lambda x: nan if len(x) == 0 else x).astype(object)
    return df

df = parse_file(filename)
4

我遇到过类似的错误。通过在read_csv中使用选项quoting=csv.QUOTE_NONE解决了这个问题。

比如说:

df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

关于为什么这样做的一些信息,可以在这里的第二条评论中找到:https://github.com/pydata/pandas/issues/5500

撰写回答