Python Pandas读取可变前导长度的csv文件

4 投票

2 回答

2387 浏览

提问于 2025-04-17 15:54

你好，我正在使用pandas这个工具来读取一系列文件，并把它们合并成一个数据框。我的文件开头有一堆杂乱的内容，长度不固定，我想把这些内容忽略掉。pd.read_csv()这个函数有一个叫做skiprows的方法可以跳过这些行。我写了一个函数来处理这个问题，但我必须打开文件两次才能让它正常工作。有没有更好的方法呢？

HEADER = '#Start'

def header_index(file_name):
    with open(file_name) as fp:
        for ind, line in enumerate(fp):
            if line.startswith(HEADER):
                return ind

for row in directories:
    path2file = '%s%s%s' % (path2data, row, suffix)
    myDF = pd.read_csv(path2file, skiprows=header_index(path2file), header=0, delimiter='\t')

非常感谢任何帮助。

数据处理数据框 csv文件文件合并 skiprows

2 个回答

-1

因为 read_csv() 也可以接受像文件一样的对象，所以你可以在传递这个对象之前跳过开头的无用行——而不是直接传文件名。

举个例子：

把

df = pd.read_csv(filename, skiprows=no_junk_lines(filename), ...)

替换为：

def forward_csv(f, prefix):
    pos = 0
    while True:
        line = f.readline()
        if not line or line.startswith(prefix):
            f.seek(pos)
            return f
        pos += len(line.encode('utf-8'))

df = pd.read_csv(forward_csv(open(filename), HEADER), ...)

注意事项：

readline() 在到达文件末尾时会返回一个空字符串
不调用 tell() 来跟踪位置可以节省一些 lseek 系统调用
forward_csv() 的最后一行假设你的输入文件是用 ASCII 或 UTF-8 编码的——如果不是，你需要调整这一行

回答于 2025-04-17 由 Python大师

分享举报

现在可以这样做了（我不知道以前是否可以）:

pos= 0
oldpos = None

while pos != oldpos:  # make sure we stop reading, in case we reach EOF
    line= fp.readline()
    if line.startswith(HEADER):
        # set the read position to the start of the line
        # so pandas can read the header
        fp.seek(pos)
        break
    oldpos= pos
    pos= fp.tell()    # renenber this position as sthe start of the next line

pd.read_csv(fp, ...your options here...)

回答于 2025-04-17 由 Python大师

分享举报

Python Pandas读取可变前导长度的csv文件

2 个回答

撰写回答