混乱的CSV自动标题提取器
我有一堆(超过100个)CSV文件。每个文件里可能有空行,或者一些我不需要的行(比如一些模糊的信息,比如“恭喜你们,大家 bla bla”)。在用Pandas读取这些文件时,我需要指定哪一行是表头。处理多个文件时,这工作量可真不小。而且要注意的是,所有文件的格式都不一样。
现在我是在遍历所有行,检查每一行的所有单元格是否都是字符串,然后选定那一行作为表头。
我需要一个更好的方法,把一堆字符串压缩成一个置信度分数(这样我就能看到哪一行是最有可能的表头)
def detect_header(df, num_header_rows=2, threshold=0):
potential_header_rows = []
# Iterate through each row and compare it with the previous row
for i in range(1, len(df)):
row1 = df.iloc[i].apply(preprocess_string)
row2 = df.iloc[i - 1].apply(preprocess_string)
if list(row1).count('') > 0.5 * len(row1):
# skip Nones
continue
# method that needs better implementation
similarity = string_similarity(row1, row2)
print(similarity)
if similarity >= threshold:
potential_header_rows.append(i)
if len(potential_header_rows) == num_header_rows:
break
return potential_header_rows
你能帮帮我吗?
1 个回答
1
不久前,我也遇到了类似的需求,我的做法如下。
这个函数接收已经加载好的数据表,如果第一行不是数字,它可以创建表头(如果是数字,我们就保持表格不变):
def reset_headers(df):
indexes = []
indexes_final = []
for index, row in df.iterrows():
#Check if all the cells in these first rows are not numeric:
if row.apply(lambda x: False if represents_number(x) else True).all():
#So, this one should be considered as part of the header:
indexes.append(df.iloc[index])
else:
break
#The data of the dataframe discarding the rows that are considered header:
df = df.iloc[index:, :].copy()
# Concatenate the information in the header if there are more than one row considered header:
if len(indexes) > 0:
for row in zip(*indexes):
temp = ""
for i in row:
temp += i + " "
indexes_final.append(temp.strip())
df.columns = indexes_final
# At the end, rename properly if there are some duplicate column names:
duplicated_cols = df.columns.duplicated()
duplicated = df.columns[duplicated_cols].unique()
rename_cols = []
i=1
for col in df.columns:
if col in duplicated:
rename_cols.extend([col + '_' + str(i)])
i=i+1
else:
rename_cols.extend([col])
df.columns = rename_cols
return df
要执行上面的函数,你可能需要另一个函数来检查一个字符串是否代表一个数字,像下面这样:
def represents_number(s):
try:
float(s)
except ValueError:
return False
except TypeError:
return False
else:
return True
希望这能帮助你开始尝试...