Pandas无法读取CSV中的双引号元素

32 投票
2 回答
61088 浏览
提问于 2025-04-30 00:07

我有一个输入文件,里面的每个值都是以字符串的形式存储的。这个文件是一个csv格式的,每个条目都被双引号包裹着。

示例文件:

"column1","column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"

这个文件只有六列。我需要在pandas的read_csv中输入什么选项才能正确读取这个文件呢?

我现在尝试的是:

import pandas as pd
df = pd.read_csv(file, quotechar='"')

但是这给我带来了一个错误信息: CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 14

这显然意味着它忽略了双引号,把每个逗号都当成了一个字段。不过在第三行,第三到第六列应该是包含逗号的字符串。("1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD")

我该怎么做才能让pandas.read_csv正确解析这个文件呢?

谢谢。

暂无标签

2 个回答

3

这个方法对我有效:(我使用的是Python 3.9)

dataset = pd.read_csv('test.csv', sep=',', skipinitialspace=True)
28

这个方法是可行的。因为你使用了不规则的分隔符,比如有时候是逗号,有时候是空格,所以它会退回使用Python的解析器。如果你只用逗号的话,它就会使用C语言的解析器,那样会快很多。

In [1]: import csv

In [2]: !cat test.csv
"column1","column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"

In [3]: pd.read_csv('test.csv',sep=',\s+',quoting=csv.QUOTE_ALL)
pandas/io/parsers.py:637: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  ParserWarning)
Out[3]: 
     "column1","column2" "column3"   "column4"   "column5"   "column6"
"AM"                "07"       "1"        "SD"        "SD"        "CR"
"AM"                "08"   "1,2,3"  "PR,SD,SD"  "PR,SD,SD"  "PR,SD,SD"
"AM"                "01"       "2"        "SD"        "SD"        "SD"

撰写回答