从具有不同列数的csv文件中读取和选择项目

2024-05-13 23:39:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从csv文件中获取一些项目,但有一个问题,它有不同的列数,所以我不能使用熊猫.read\u csv(filepath)函数来读取它。我需要打开它,这样我就可以选择一些显示的项目。csv文件如下所示(每行之间添加一个空行,以便大家更容易阅读):

“路径”,“文件”,“获取日期”,“示例”,“杂项”

“C:\msdchem\2\DATA\AlbertVirgili\DaniGM\”,“DGM\U CPTIS003 1h.D”,“19年3月25日, 11:55:48“,”DGM\U CPTIS003 1h“,”“

“内部FID1A.CH”

“2019年3月25日星期一17:48:31”

“峰值”,“相对湿度”,“开始”,“结束”,“峰值”,“高度”,“面积”,“最大百分比”,“总百分比”

1,2.082,2.063,2.189,“BB”,2238493194951058782100.00,46.349

2,2.317,2.281,2.386,“BB”,732099421093871144,22.09,10.240

3,3.343,3.224,3.403,“BB”,93165657220621038,44.85,20.788

4,5.538,5.409,5.598,“BB”,517837981975386485,39.90,18.492

5,5.744,5.693,5.803,“BB”,24084957360235490,7.28,3.372

6,8.716,8.676,8.776,“BB”,8566883,80973220,1.64,0.758

“路径”,“文件”,“获取日期”,“示例”,“杂项”

“C:\msdchem\2\DATA\AlbertVirgili\DaniGM\”,“DGM\u CPTIS003 2h.D”,“19年3月25日,12:15:42”,“DGM\u CPTIS003 2h”,“

“内部FID1A.CH”

“2019年3月25日星期一12:31:45”

“峰值”,“相对湿度”,“开始”,“结束”,“峰值”,“高度”,“面积”,“最大百分比”,“总百分比”

1、2.083、2.064、2.194,“BB”,232382153545486688100.00、59.673

2,2.318,2.282,2.384,“BB”,37916041587535474,11.18,6.671

3,3.322,3.241,3.381,“BB”,677152931373898201,26.14,15.600

4,5.509,5.406,5.569,“BB”,395027471227609422,23.36,13.939

5,5.731,5.689,5.791,“BB”,17799521230201751,4.38,2.614

6,8.717,8.674,8.776,“BB”,12367646132409300,2.52,1.503

我需要做的是阅读标题下的条目:Peak,R.T.,Start,End,PK-TY,。。。但我不能这样做,因为它们的长度与前面的行不同(标题、路径、文件、获取日期…)。我不能使用skiprows函数仅仅消除0-3和11-14之间的行,因为我要读取的部分的行数并不总是一致的(这种类型的文件是由外部程序生成的,我不能修改它的结构)。有没有什么方法可以用来只读取csv代码中属于我想要的标题下的部分,这样我就可以用它从那些值中选择想要的数据?你知道吗

事先谢谢你的帮助。你知道吗


Tags: 文件csv项目函数路径标题示例data
2条回答

你需要做一些预处理。如果您处理来自外部系统的数据,那么考虑这些集成点是非常常见的。你知道吗

外部文件包含结构化数据。CSV行的序列,每个项目有5个标题行。最后一个标题行包含CSV列标签。你知道吗

从外部文件读入内容。根据您的需要调整下面的代码。你知道吗

external_file_content = r'''
"Path","File","Date Acquired","Sample","Misc"
"C:\msdchem\2\DATA\AlbertVirgili\DaniGM\","DGM_CPTIS003 1h.D","25-Mar-19, 11:55:48","DGM_CPTIS003 1h"," "
"INT FID1A.CH"
"Mon Mar 25 17:48:31 2019"
"Peak","R.T.","Start","End","PK TY","Height","Area","Pct Max","Pct Total"
1, 2.082, 2.063, 2.189,"BB ",223849319,4951058782,100.00, 46.349
2, 2.317, 2.281, 2.386,"BB ",73209942,1093871144, 22.09, 10.240
3, 3.343, 3.224, 3.403,"BB ",93165657,2220621038, 44.85, 20.788
4, 5.538, 5.409, 5.598,"BB ",51783798,1975386485, 39.90, 18.492
5, 5.744, 5.693, 5.803,"BB ",24084957,360235490, 7.28, 3.372
6, 8.716, 8.676, 8.776,"BB ",8566883, 80973220, 1.64, 0.758
"Path","File","Date Acquired","Sample","Misc"
"C:\msdchem\2\DATA\AlbertVirgili\DaniGM\","DGM_CPTIS003 2h.D","25-Mar-19, 12:15:42","DGM_CPTIS003 2h"," "
"INT FID1A.CH"
"Mon Mar 25 12:31:45 2019"
"Peak","R.T.","Start","End","PK TY","Height","Area","Pct Max","Pct Total"
1, 2.083, 2.064, 2.194,"BB ",232382153,5255486688,100.00, 59.673
2, 2.318, 2.282, 2.384,"BB ",37916041,587535474, 11.18, 6.671
3, 3.322, 3.241, 3.381,"BB ",67715293,1373898201, 26.14, 15.600
4, 5.509, 5.406, 5.569,"BB ",39502747,1227609422, 23.36, 13.939
5, 5.731, 5.689, 5.791,"BB ",17799521,230201751, 4.38, 2.614
6, 8.717, 8.674, 8.776,"BB ",12367646,132409300, 2.52, 1.503
'''

使用定义良好的分隔符将序列拆分为唯一的部分

parts = external_file_content.split('"Path","File","Date Acquired","Sample","Misc"')

选择要进一步处理到数据帧中的单个部件。配置pd.read_csv跳过4行。你知道吗

df = pd.read_csv(StringIO(parts[1]), skiprows=4);

显示数据帧的第一行

df.head(5)


    Peak    R.T.    Start   End     PK TY   Height  Area    Pct Max     Pct Total
0   1   2.082   2.063   2.189   BB  223849319   4951058782  100.00  46.349
1   2   2.317   2.281   2.386   BB  73209942    1093871144  22.09   10.240
2   3   3.343   3.224   3.403   BB  93165657    2220621038  44.85   20.788
3   4   5.538   5.409   5.598   BB  51783798    1975386485  39.90   18.492
4   5   5.744   5.693   5.803   BB  24084957    360235490   7.28    3.372

过滤掉非数字行

def gen_rows(stream):
    for row in csv.reader(stream):             
        if row.pop(0).isdigit(): # check that value is a number  
            yield row

with open('data.csv') as fo:
    df = pd.DataFrame.from_records(gen_rows(fo), 
    columns = ["Peak","R.T.","Start","End","PKTY",
                    "Height","Area","Pct Max","Pct Total"])

相关问题 更多 >