python csv 扭曲文件

3 投票

2 回答

1171 浏览

提问于 2025-04-17 15:53

我正在尝试找出在读取一个csv文件时，我的位置百分比。我知道可以用文件对象的tell()方法来做到这一点，但当我用csv.reader读取这个文件对象，然后在我的读取对象上用for循环遍历行时，tell()函数总是返回好像我在文件的末尾一样，无论我在循环中处于哪个位置。我该如何找到我现在的位置呢？

当前的代码：

with open(FILE_PERSON, 'rb') as csvfile:
    spamreader = csv.reader(csvfile)
    justtesting = csvfile.tell()
    size = os.fstat(csvfile.fileno()).st_size
    for row in spamreader:
        pos = csvfile.tell()
        print pos, "of", size, "|", justtesting

我在这里加了“justtesting”只是为了证明tell()确实在我开始for循环之前返回0。

这对于我csv文件中的每一行都会返回相同的结果：579 of 579 | 0

我哪里做错了呢？

数据处理文件读取行读取文件对象 csv 循环遍历位置指针 tell方法

2 个回答

csvreader 的文档上说：

... csvfile 可以是任何支持迭代器协议的对象，每次调用它的 next() 方法时都会返回一个字符串 ...

所以对原始代码稍微做了一点修改：

import csv
import os
filename = "tar.data"
with open(filename, 'rb') as csvfile:
    spamreader = csv.reader(csvfile)
    justtesting = csvfile.tell()
    size = os.fstat(csvfile.fileno()).st_size
    for row in spamreader:
        pos = csvfile.tell()
        print pos, "of", size, "|", justtesting
###############################################
def generator(csvfile):
    # readline seems to be the key
    while True:
        line = csvfile.readline()
        if not line:
            break
        yield line
###############################################
print
with open(filename, 'rb', 0) as csvfile:
    spamreader = csv.reader(generator(csvfile))
    justtesting = csvfile.tell()
    size = os.fstat(csvfile.fileno()).st_size
    for row in spamreader:
        pos = csvfile.tell()
        print pos, "of", size, "-", justtesting

在我的测试数据上运行这个，结果如下，显示出这两种不同的方法产生了不同的结果。

224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0

16 of 224 - 0
32 of 224 - 0
48 of 224 - 0
64 of 224 - 0
80 of 224 - 0
96 of 224 - 0
112 of 224 - 0
128 of 224 - 0
144 of 224 - 0
160 of 224 - 0
176 of 224 - 0
192 of 224 - 0
208 of 224 - 0
224 of 224 - 0

我在 open 上设置了零缓冲，但没有什么区别，关键在于生成器中的 readline。

回答于 2025-04-17 由 Python大师

分享举报

csv库在读取文件时会使用一个缓冲区，所以文件指针会一次性跳过较大的块，而不是一行一行地读取。

它是以较大的块来读取数据，这样解析起来更简单。而且因为CSV文件中的数据可能会在引号内包含换行符，所以如果一行一行地读取CSV数据就会出问题。

如果你必须提供进度报告，那你需要提前统计一下行数。下面的代码只有在你的CSV文件中没有在列值中嵌入换行符的情况下才能正常工作：

with open(FILE_PERSON, 'rb') as csvfile:
    linecount = sum(1 for _ in csvfile)
    csvfile.seek(0)
    spamreader = csv.reader(csvfile)
    for line, row in enumerate(spamreader):
        print '{} of {}'.format(line, linecount)

还有其他方法可以统计行数（可以参考如何在Python中便宜地获取行数？），但因为你反正要读取文件来处理它作为CSV，所以不妨利用你已经打开的文件。我不确定将文件作为内存映射打开，然后再像普通文件一样读取是否会有更好的性能。

回答于 2025-04-17 由 Python大师

分享举报

python csv 扭曲文件

2 个回答

撰写回答