混乱的CSV自动标题提取器

-1 投票
1 回答
45 浏览
提问于 2025-04-14 18:04

我有一堆(超过100个)CSV文件。每个文件里可能有空行,或者一些我不需要的行(比如一些模糊的信息,比如“恭喜你们,大家 bla bla”)。在用Pandas读取这些文件时,我需要指定哪一行是表头。处理多个文件时,这工作量可真不小。而且要注意的是,所有文件的格式都不一样。

现在我是在遍历所有行,检查每一行的所有单元格是否都是字符串,然后选定那一行作为表头。

我需要一个更好的方法,把一堆字符串压缩成一个置信度分数(这样我就能看到哪一行是最有可能的表头)

def detect_header(df, num_header_rows=2, threshold=0):
    potential_header_rows = []
    
    # Iterate through each row and compare it with the previous row
    for i in range(1, len(df)):
        row1 = df.iloc[i].apply(preprocess_string)
        row2 = df.iloc[i - 1].apply(preprocess_string)
        if list(row1).count('') > 0.5 * len(row1):
            # skip Nones
            continue
        # method that needs better implementation
        similarity = string_similarity(row1, row2)
        print(similarity)
        if similarity >= threshold:
            potential_header_rows.append(i)
            if len(potential_header_rows) == num_header_rows:
                break 
    return potential_header_rows

你能帮帮我吗?

1 个回答

1

不久前,我也遇到了类似的需求,我的做法如下。

这个函数接收已经加载好的数据表,如果第一行不是数字,它可以创建表头(如果是数字,我们就保持表格不变):

def reset_headers(df):
    indexes = []
    indexes_final = []
    for index, row in df.iterrows():
        #Check if all the cells in these first rows are not numeric:
        if row.apply(lambda x: False if represents_number(x) else True).all():
            #So, this one should be considered as part of the header:
            indexes.append(df.iloc[index])
        else:
            break
    #The data of the dataframe discarding the rows that are considered header:
    df = df.iloc[index:, :].copy()
    # Concatenate the information in the header if there are more than one row considered header:
    if len(indexes) > 0:
        for row in zip(*indexes):
            temp = ""
            for i in row:
                temp += i + " "
            indexes_final.append(temp.strip())
        df.columns = indexes_final
    # At the end, rename properly if there are some duplicate column names:
    duplicated_cols = df.columns.duplicated()
    duplicated = df.columns[duplicated_cols].unique()
    rename_cols = []
    i=1
    for col in df.columns:
        if col in duplicated:
            rename_cols.extend([col + '_' + str(i)])
            i=i+1
        else:
            rename_cols.extend([col])
    df.columns = rename_cols

    return df

要执行上面的函数,你可能需要另一个函数来检查一个字符串是否代表一个数字,像下面这样:

def represents_number(s):
    try:
        float(s)
    except ValueError:
        return False
    except TypeError:
        return False
    else:
        return True

希望这能帮助你开始尝试...

撰写回答