读取带有双引号的csv时出错

2024-05-23 16:15:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经阅读了所有相关主题,如thisthisthis,但无法找到有效的解决方案

我有一个输入csv文件,如下所示:

ItemId,Content                                                      
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

我尝试了几种不同的方法,但都没能奏效。我想将此csv文件读入如下数据框:

ItemId    Content
--------  -------------------------------------------------------------------------------
i0000008  {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010  {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

使用以下代码(Python 3.9)

df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')

据我所知,字典列中的逗号和引号中的逗号被视为常规分隔符,因此会产生以下错误:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6

有可能产生预期的结果吗?谢谢


Tags: 文件ofcsvtitlecontentratedthisrecord
2条回答

我不认为你能用pandas正常阅读它,因为它的分隔符对一个值使用了多次;但是,使用python阅读并进行一些处理后,您应该能够将其转换为dataframe:

def splitValues(x):
    index = x.find(',')
    return x[:index], x[index+1:].strip()

import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))

输出:

     ItemId                                                                          Content
0  i0000008   {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1  i0000010  {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

问题是Content列中的逗号被解释为分隔符。您可以通过使用pd.read_fwf手动设置要拆分的字符数来解决此问题:

df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])  

结果:

^{tb1}$

相关问题 更多 >