混乱的CSV自动标题提取器

-1 投票

1 回答

45 浏览

提问于 2025-04-14 18:04

我有一堆（超过100个）CSV文件。每个文件里可能有空行，或者一些我不需要的行（比如一些模糊的信息，比如“恭喜你们，大家 bla bla”）。在用Pandas读取这些文件时，我需要指定哪一行是表头。处理多个文件时，这工作量可真不小。而且要注意的是，所有文件的格式都不一样。

现在我是在遍历所有行，检查每一行的所有单元格是否都是字符串，然后选定那一行作为表头。

我需要一个更好的方法，把一堆字符串压缩成一个置信度分数（这样我就能看到哪一行是最有可能的表头）

def detect_header(df, num_header_rows=2, threshold=0):
    potential_header_rows = []
    
    # Iterate through each row and compare it with the previous row
    for i in range(1, len(df)):
        row1 = df.iloc[i].apply(preprocess_string)
        row2 = df.iloc[i - 1].apply(preprocess_string)
        if list(row1).count('') > 0.5 * len(row1):
            # skip Nones
            continue
        # method that needs better implementation
        similarity = string_similarity(row1, row2)
        print(similarity)
        if similarity >= threshold:
            potential_header_rows.append(i)
            if len(potential_header_rows) == num_header_rows:
                break 
    return potential_header_rows

你能帮帮我吗？

数据清洗 pandas csv文件行处理自动标题提取置信度评分数据格式识别

1 个回答

不久前，我也遇到了类似的需求，我的做法如下。

这个函数接收已经加载好的数据表，如果第一行不是数字，它可以创建表头（如果是数字，我们就保持表格不变）：

def reset_headers(df):
    indexes = []
    indexes_final = []
    for index, row in df.iterrows():
        #Check if all the cells in these first rows are not numeric:
        if row.apply(lambda x: False if represents_number(x) else True).all():
            #So, this one should be considered as part of the header:
            indexes.append(df.iloc[index])
        else:
            break
    #The data of the dataframe discarding the rows that are considered header:
    df = df.iloc[index:, :].copy()
    # Concatenate the information in the header if there are more than one row considered header:
    if len(indexes) > 0:
        for row in zip(*indexes):
            temp = ""
            for i in row:
                temp += i + " "
            indexes_final.append(temp.strip())
        df.columns = indexes_final
    # At the end, rename properly if there are some duplicate column names:
    duplicated_cols = df.columns.duplicated()
    duplicated = df.columns[duplicated_cols].unique()
    rename_cols = []
    i=1
    for col in df.columns:
        if col in duplicated:
            rename_cols.extend([col + '_' + str(i)])
            i=i+1
        else:
            rename_cols.extend([col])
    df.columns = rename_cols

    return df

要执行上面的函数，你可能需要另一个函数来检查一个字符串是否代表一个数字，像下面这样：

def represents_number(s):
    try:
        float(s)
    except ValueError:
        return False
    except TypeError:
        return False
    else:
        return True

希望这能帮助你开始尝试...

回答于 2025-04-14 由 Python大师

分享举报

混乱的CSV自动标题提取器

1 个回答

撰写回答