Python:如何优化CSV解析循环?

2024-04-18 11:58:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我写这个循环是为了解析一个100万行的.csv文件。它工作,但只能处理约7k线/分钟。有没有一个合理的方法让它运行得更快?你知道吗

循环当前正在将数据块转换为一行,去掉多余的字符,并将每行写入一个新的.csv文件。你知道吗

pattern = re.compile(r",{2,}")

with open("OceanData.csv") as infile, open("OceanParsed.csv","w", newline="") as fout:
    outfile = csv.writer(fout)
    data =[]
    for line in infile:
        if line.startswith("#--------------------------------------------------------------------------------"):
            outfile.writerow(data)
            continue
        for ch in ["[","]","'"," ","\n"]:
            if ch in line:
                line = line.replace(ch,"")
        for i in line:
            line =re.sub(pattern,",", line)
            continue

        if not line: continue
        data.append(line)

样本数据:http://www.sharecsv.com/s/674dc42035c29eb4f250b5c2365c8dc6/OceanParseTest.csv


Tags: 文件csv数据inrefordataif
1条回答
网友
1楼 · 发布于 2024-04-18 11:58:22

不要重新发明轮子来读取csv文件。你知道吗

您可以使用pandas。你知道吗

import pandas as pd

df = pd.read_csv('file.csv')

也可以使用csv标准库。你知道吗

要读取一个大的csv文件,如果以上这些方法不起作用。您可以将文件拆分为小文件,创建一个进程来读取每个文件。你知道吗

你的数据sample。你知道吗

我认为你的格式文件不是csv文件。假设你有这样一个部分:

#                                        ,,,,,,
CAST                        ,,9285001,WOD Unique Cast Number,WOD code,,
NODC Cruise ID              ,,US-10209       ,,,,
Originators Station ID      ,,82,,,integer,
Originators Cruise ID       ,,               ,,,,
Latitude                    ,,-76.477,decimal degrees,,,
Longitude                   ,,166.3137,decimal degrees,,,
Year                        ,,1997,,,,
Month                       ,,1,,,,
Day                         ,,1,,,,
Time                        ,,3.9931,decimal hours (UT),,,
METADATA,,,,,,
Country                     ,,             US,NODC code,UNITED STATES,,
Accession Number            ,,520,NODC code,,,
Project                     ,,406,NODC code,RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPON
SE IN ROSS SEA,,
Platform                    ,,3596,OCL code,NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c
.s.KUS1475;IMO900725,,
Institute                   ,,431,NODC code,US DOC NOAA NESDIS,,
Cast/Tow Number             ,,1,,,,
High resolution CTD - Bottle,,9182488,,,,
probe_type                  ,,7,OCL_code,bottle/rossette/net,,
scale            ,Temperature,103,WOD code,Temperature: ITS-90,,
Instrument       ,Temperature,411,WOD code,CTD: SBE 911plus (Sea-Bird Electronics, Inc.),
VARIABLES ,Depth     ,F,O,Temperatur ,F,O
UNITS     ,m         , , ,degrees C ,, 
Prof-Flag ,          ,0, ,          ,0, 
1,0,0, ,-1.591,0, 
2,5,0, ,-1.668,0, 
3,10,0, ,-1.702,0, 
4,15,0, ,-1.733,0, 
5,20,0, ,-1.746,0, 
6,25,0, ,-1.76,0, 
7,30,0, ,-1.773,0, 
8,35,0, ,-1.785,0, 
9,40,0, ,-1.796,0, 
10,45,0, ,-1.805,0, 
11,50,0, ,-1.813,0, 
12,55,0, ,-1.823,0, 
13,60,0, ,-1.832,0, 
14,65,0, ,-1.84,0, 
15,70,0, ,-1.848,0, 
16,75,0, ,-1.855,0, 
17,80,0, ,-1.861,0, 
18,85,0, ,-1.867,0, 
19,90,0, ,-1.873,0, 
20,95,0, ,-1.878,0, 
21,100,0, ,-1.882,0, 
22,125,0, ,-1.892,0, 
23,150,0, ,    -0 -,0, 
24,175,0, ,    -0 -,0, 
25,200,0, ,    -0 -,0, 
26,225,0, ,    -0 -,0, 
27,250,0, ,    -0 -,0, 
28,275,0, ,    -0 -,0, 
29,300,0, ,    -0 -,0, 
30,325,0, ,    -0 -,0, 
31,350,0, ,    -0 -,0, 
32,375,0, ,    -0 -,0, 
33,400,0, ,    -0 -,0, 
34,425,0, ,    -0 -,0, 
35,450,0, ,    -0 -,0, 
36,475,0, ,    -0 -,0, 
37,500,0, ,    -0 -,0, 
38,550,0, ,-1.898,0, 
END OF VARIABLES SECTION,,,,,,

清洁此部分:

格式.sh

#!/usr/bin/env bash
# use : bash format.sh pathname    

cat "$1" | \
    grep -v '^#\|^END' | \
    sed 's/,/ /g' | tr -s " " | sed 's/ /,/' 

要获得:

CAST,9285001 WOD Unique Cast Number WOD code 
NODC,Cruise ID US-10209 
Originators,Station ID 82 integer 
Originators,Cruise ID 
Latitude,-76.477 decimal degrees 
Longitude,166.3137 decimal degrees 
Year,1997 
Month,1 
Day,1 
Time,3.9931 decimal hours (UT) 
METADATA,
Country,US NODC code UNITED STATES 
Accession,Number 520 NODC code 
Project,406 NODC code RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPONSE IN ROSS SEA 
Platform,3596 OCL code NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c.s.KUS1475;IMO900725 
Institute,431 NODC code US DOC NOAA NESDIS 
Cast/Tow,Number 1 
High,resolution CTD - Bottle 9182488 
probe_type,7 OCL_code bottle/rossette/net 
scale,Temperature 103 WOD code Temperature: ITS-90 
Instrument,Temperature 411 WOD code CTD: SBE 911plus (Sea-Bird Electronics Inc.) 
VARIABLES,Depth F O Temperatur F O
UNITS,m degrees C 
Prof-Flag,0 0 
1,0 0 -1.591 0 
2,5 0 -1.668 0 
3,10 0 -1.702 0 
4,15 0 -1.733 0 
5,20 0 -1.746 0 
6,25 0 -1.76 0 
7,30 0 -1.773 0 
8,35 0 -1.785 0 
9,40 0 -1.796 0 
10,45 0 -1.805 0 
11,50 0 -1.813 0 
12,55 0 -1.823 0 
13,60 0 -1.832 0 
14,65 0 -1.84 0 
15,70 0 -1.848 0 
16,75 0 -1.855 0 
17,80 0 -1.861 0 
18,85 0 -1.867 0 
19,90 0 -1.873 0 
20,95 0 -1.878 0 
21,100 0 -1.882 0 
22,125 0 -1.892 0 
23,150 0  -0 - 0 
24,175 0  -0 - 0 
25,200 0  -0 - 0 
26,225 0  -0 - 0 
27,250 0  -0 - 0 
28,275 0  -0 - 0 
29,300 0  -0 - 0 
30,325 0  -0 - 0 
31,350 0  -0 - 0 
32,375 0  -0 - 0 
33,400 0  -0 - 0 
34,425 0  -0 - 0 
35,450 0  -0 - 0 
36,475 0  -0 - 0 
37,500 0  -0 - 0 
38,550 0 -1.898 0 

如果你是1米的线,我想你有大约15000节。你知道吗

我明白了:

for _ in `seq 1 15000`; do cat one_section.txt >> data.txt; done

支票:

grep -n ^# data.txt | cut -d : -f1 | wc -l
wc -l data.txt
ls -sh data.txt   

给井15000节,960000行,和34MB。你知道吗

。。。。你知道吗

相关问题 更多 >