从特定值开始读取并连接3,000个文件到pandas数据框，python

2 投票

3 回答

521 浏览

提问于 2025-04-18 12:44

我有3000个.dat文件，我想把它们读进来并合并成一个pandas的数据框。这些文件的格式都是一样的（4列，没有标题），不过有些文件开头有描述，而有些则没有。为了合并这些文件，我需要在合并之前把那些开头的行去掉。因为每个文件需要跳过的行数不一样，所以

pandas.read_csv()里的skiprows选项不适用（顺便说一下，我用pandas.read_csv()而不是pandas.read_table()，因为这些文件是用逗号分隔的）。

不过，所有3000个文件中，跳过的行之后的第一个值都是一样的，这个值是“2004”，它是我数据集的第一个数据点。

有没有类似skiprows的选项，可以让我指定“从‘2004’开始读取文件，之前的内容都跳过”（对每个文件都这样）？

我现在真的很无助，希望能得到一些帮助。

谢谢！

数据处理数据清洗逗号分隔 pandas 数据框数据集文件合并行跳过

3 个回答

使用 skip_to() 函数：

def skip_to(f, text):
    while True:
        last_pos = f.tell()
        line = f.readline()
        if not line:
            return False
        if line.startswith(text):
            f.seek(last_pos)
            return True


with open("tmp.txt") as f:
    if skip_to(f, "2004"):
        df = pd.read_csv(f, header=None)
        print df

回答于 2025-04-18 由 Python大师

分享举报

这里其实没必要太聪明；如果你有一个方便的标准，不妨用它来弄清楚什么是 skiprows，也就是说可以用类似下面的方式来理解。

import pandas as pd
import csv

def find_skip(filename):
    with open(filename, newline="") as fp:
        # (use open(filename, "rb") in Python 2)
        reader = csv.reader(fp)
        for i, row in enumerate(reader):
            if row[0] == "2004":
                return i

for filename in filenames:
    skiprows = find_skip(filename)
    if skiprows is None:
        raise ValueError("something went wrong in determining skiprows!")
    this_df = pd.read_csv(filename, skiprows=skiprows, header=None)
    # do something here, e.g. append this_df to a list and concatenate it after the loop

回答于 2025-04-18 由 Python大师

分享举报

你可以通过一个循环来处理这些内容，跳过那些不以2004开头的行。

大概可以这样做……

while True:
    line = pandas.read_csv()
    if line[0] != '2004': continue
    # whatever else you need here

回答于 2025-04-18 由 Python大师

分享举报

从特定值开始读取并连接3,000个文件到pandas数据框，python

3 个回答

撰写回答