Python Pandas 读取具有特定行结束符的 CSV 文件
我正在尝试从下面这个给我的示例csv文件创建一个数据框,但我遇到了一个错误,提示“数据解析错误。C错误:在第0行的字符串中遇到文件结束”。我对处理这些错误行的经验不多,但我真的想学习如何更好地处理这种情况。我尝试了read_csv中的许多不同选项,比如设置error_bad_line=False,但也没有解决问题。
CParserError: Error tokenizing data. C error: EOF inside string starting at line 0
我猜是因为行结束符中的","导致了这个问题。我认为最好的办法是逐行处理,所以我想出了下面这个生成器,并得到了其他人的帮助,希望我能接近解决方案。我也很想学习如何在这个过程中使用生成器和yield。
示例数据:
"USNC3255","27","US","NC","LANDS END","72305006","KNJM","KNCA","KNKT","T72305006","","","NCC031","NCZ095","","545","28594","America/New_York","34.65266","-77.07661","7","RDU","893727","
"USNC3256","27","US","NC","LANDSDOWN","72314058","KEHO","KAKH","KIPJ","T72314058","","","NCC045","NCZ068","sc007","517","28150","America/New_York","35.29374","-81.46537","797","CLT","317845","
我写了下面的代码,去掉了最后两个字符,但不太确定如何从这些行生成一个数据框:
def big_table_generator(filename):
with open(filename, 'rt') as f:
for line in f:
yield line[:-3]
gen = big_table_generator('../data/test_sun_file.csv')
pd.DataFrame(gen)
2 个回答
0
这是我想到的解决方案,但我其实想避免使用列表和添加元素的方法,而是想利用生成器。不过我对生成器还不够熟悉,所以还没能做到。
def parse_file(filename):
newline = []
with open(filename, 'rb') as f:
reader = csv.reader(f, quoting=csv.QUOTE_NONE)
for row in reader:
newline.append([s.strip('"') for s in row[:-1]])
df = pd.DataFrame(newline)
df = df.applymap(lambda x: nan if len(x) == 0 else x).astype(object)
return df
df = parse_file(filename)
4
我遇到过类似的错误。通过在read_csv中使用选项quoting=csv.QUOTE_NONE解决了这个问题。
比如说:
df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
关于为什么这样做的一些信息,可以在这里的第二条评论中找到:https://github.com/pydata/pandas/issues/5500