如何使用pandas从csv文件每次读取10条记录?

2024-06-12 06:46:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我想读一个有1000行的csv文件,所以我决定分块读这个文件。但是我在读这个csv文件时遇到了一些问题。在

我想在第一次迭代时读取前10条记录,并在第二次迭代时将其特定列转换为python字典跳过前10条记录,然后像这样读取下10条记录。在

输入.csv-

time,line_id,high,low,avg,total,split_counts
1468332421098000,206,50879,50879,50879,2,"[50000,2]"
1468332421195000,206,39556,39556,39556,2,"[30000,2]"
1468332421383000,206,61636,61636,61636,2,"[60000,2]"
1468332423568000,206,47315,38931,43123,4,"[30000,2][40000,2]"
1468332423489000,206,38514,38445,38475,6,"[30000,6]"
1468332421672000,206,60079,60079,60079,2,"[60000,2]"
1468332421818000,206,44664,44664,44664,2,"[40000,2]"
1468332422164000,206,48500,48500,48500,2,"[40000,2]"
1468332423490000,206,39469,37894,38206,12,"[30000,12]"
1468332422538000,206,44023,44023,44023,2,"[40000,2]"
1468332423491000,206,38813,38813,38813,2,"[30000,2]"
1468332423528000,206,75970,75970,75970,2,"[70000,2]"
1468332423533000,206,42546,42470,42508,4,"[40000,4]"
1468332423536000,206,41065,40888,40976,4,"[40000,4]"
1468332423566000,206,66401,62453,64549,6,"[60000,6]"

程序代码-

^{pr2}$

我正面临这个问题-

AttributeError: 'DataFrame' object has no attribute 'time'

我知道在第二次迭代中,它无法识别时间和分割计数属性,但有什么方法可以做我想要的吗?在


Tags: 文件csvid字典time记录line程序代码
2条回答

您可以在^{}中使用chunksize

import pandas as pd
import io

temp=u'''time,line_id,high,low,avg,total,split_counts
1468332421098000,206,50879,50879,50879,2,"[50000,2]"
1468332421195000,206,39556,39556,39556,2,"[30000,2]"
1468332421383000,206,61636,61636,61636,2,"[60000,2]"
1468332423568000,206,47315,38931,43123,4,"[30000,2][40000,2]"
1468332423489000,206,38514,38445,38475,6,"[30000,6]"
1468332421672000,206,60079,60079,60079,2,"[60000,2]"
1468332421818000,206,44664,44664,44664,2,"[40000,2]"
1468332422164000,206,48500,48500,48500,2,"[40000,2]"
1468332423490000,206,39469,37894,38206,12,"[30000,12]"
1468332422538000,206,44023,44023,44023,2,"[40000,2]"
1468332423491000,206,38813,38813,38813,2,"[30000,2]"
1468332423528000,206,75970,75970,75970,2,"[70000,2]"
1468332423533000,206,42546,42470,42508,4,"[40000,4]"
1468332423536000,206,41065,40888,40976,4,"[40000,4]"
1468332423566000,206,66401,62453,64549,6,"[60000,6]"'''
#after testing replace io.StringIO(temp) to filename

#for testing 2
reader = pd.read_csv(io.StringIO(temp), chunksize=2)
print (reader)
<pandas.io.parsers.TextFileReader object at 0x000000000AD1CD68>
^{pr2}$

pandas documentation。在

第一次迭代应该可以正常工作,但是任何进一步的迭代都是有问题的。在

read_csv有一个headerskwarg,默认值为infer(基本上是0)。这意味着解析的csv中的第一行将用作dataframe中列的名称。在

read_csv还有另一个kwarg,names。在

documentation中所述:

header : int or list of ints, default ‘infer’ Row number(s) to use as the column names, and the start of the data. Default behavior is as if set to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

names : array-like, default None List of column names to use. If file contains no header row, then you should explicitly pass header=None

您应该将headers=None和{}传递给read_csv。在

相关问题 更多 >