当您的数据较大时,是否有有效的方法使用第二个表填写正确的不一致数据?

2024-04-24 07:25:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据不一致的表,如下所示:

表1:

^{tb1}$

这是不一致的,因为某些字段不符合指定的格式。例如,engine_编号不应等于“000000”,也不应不存在(第2行和第3行)。我想用坏字段和相应的正确值创建另一个表(指示符表),我有另一个表(非常大),我可以用它来创建这样的指示符表

表2:

^{tb2}$

我想要得到的指示器表如下所示

表3:*指标表*

^{tb3}$

正如您所注意到的,两个表(表1和表2)都有相同的“飞机尾”、“年”和“月”列。但是,我不能合并它们来创建指示符表,因为我处理的是连续数据,而且表的维度非常大。我尝试使用模糊模糊匹配方法来匹配飞机的尾翼、年份和填写好的发动机号,但由于数据量太大,再次失败。有什么想法可以创建这样的指标表吗

我是这个领域的新手:)) 谢谢


1条回答
网友
1楼 · 发布于 2024-04-24 07:25:44

我想你可以和我一起去

import pandas as pd

df1 = pd.DataFrame(data=[
    {"flight_id":"000000_20180121","engine_number":"000000",
     "aircraft_tail":"G-RHBZ","year":"2018","month":"01"},
    {"flight_id":"258741_20171021","engine_number":"258741",
     "aircraft_tail":"H-RZBE","year":"2017","month":"10"},
    {"flight_id":"_20150214","engine_number":"",
     "aircraft_tail":"V-RDER","year":"2015","month":"02"},
    {"flight_id":"_20110287","engine_number":"NO-NUMBER", 
     "aircraft_tail":"G-EHRK","year":"2011","month":"12"}]
)
df2 = pd.DataFrame(data=[
    {"engine_number":"258741","aircraft_tail":"H-RZBE","year":"2017","month":"10"},
    {"engine_number":"348741","aircraft_tail":"V-RDER","year":"2015","month":"02"},
    {"engine_number":"348741","aircraft_tail":"V-RDER","year":"2015","month":"03"},
    {"engine_number":"589745","aircraft_tail":"G-RHBZ","year":"2018","month":"01"},
    {"engine_number":"587981","aircraft_tail":"G-EHRK","year":"2011","month":"12"}]
    )

# Validator function
def bad_engine_number_detector(engine_number):

    lst_invalid_engine_number = ["000000", "NO-NUMBER"]

    is_bad_engine_number = False
    if engine_number == "":
        is_bad_engine_number = True
    elif engine_number in lst_invalid_engine_number:
        is_bad_engine_number = True

    return is_bad_engine_number
    
# Identify invalid entries on df1
mask = df1["engine_number"].apply(bad_engine_number_detector)

# Merge both tables (df1 filtered only with bad entries)
df1.loc[mask].merge(df2, 
                    on=["aircraft_tail","year","month"],
                    suffixes=["_bad","_good"])
^{tb1}$

相关问题 更多 >