Python Pandas和生成器处理CSV行
希望这个问题在StackOverflow上是可以提的,我想请教一下如何把下面这段处理文件中每一行并生成数据框的代码,改成使用生成器和yield的方式,因为现在这个用列表和append的方法实在是太慢了。
这是我想到的解决方案,但我其实希望能避免使用非常慢的列表和append操作。我更想要一个酷炫的生成器和yield的解决方案,但我对使用生成器还不够熟悉。
文件中的示例行:
"USNC3255","27","US","NC","LANDS END","72305006","KNJM","KNCA","KNKT","T72305006","","","NCC031","NCZ095","","545","28594","America/New_York","34.65266","-77.07661","7","RDU","893727","
"USNC3256","27","US","NC","LANDSDOWN","72314058","KEHO","KAKH","KIPJ","T72314058","","","NCC045","NCZ068","sc007","517","28150","America/New_York","35.29374","-81.46537","797","CLT","317845","
当前解决方案:
def parse_file(filename):
newline = []
with open(filename, 'rb') as f:
reader = csv.reader(f, quoting=csv.QUOTE_NONE)
for row in reader:
newline.append([s.strip('"') for s in row[:-1]])
df = pd.DataFrame(newline)
df = df.applymap(lambda x: nan if len(x) == 0 else x).astype(object)
return df
df = parse_file(filename)
如果用上面的示例行来运行,输出的结果就是一个有23列和两行的数据框。
1 个回答
3
你文件里唯一的问题就是每一行的结尾都有个 ,"
。这让解析器搞不清楚。如果你能把最后的逗号和引号去掉,就可以用普通的解析器了。
import pandas as pd
from StringIO import StringIO
with open('example.txt') as myfile:
data = myfile.read().replace(',"\n', '\n')
pd.read_csv(StringIO(data), header=None)
这是我得到的结果:
0 1 2 3 4 5 6 7 8 9 \
0 USNC3255 27 US NC LANDS END 72305006 KNJM KNCA KNKT T72305006
1 USNC3256 27 US NC LANDSDOWN 72314058 KEHO KAKH KIPJ T72314058
... 13 14 15 16 17 18 19 \
0 ... NCZ095 NaN 545 28594 America/New_York 34.65266 -77.07661
1 ... NCZ068 sc007 517 28150 America/New_York 35.29374 -81.46537
20 21 22
0 7 RDU 893727
1 797 CLT 317845
[2 rows x 23 columns]