数据质量框架
contessa的Python项目详细描述
目录
数据质量框架
短期使用
fromcontessaimportContessaRunner,NOT_NULL,GT,SQLno_bags_sql=""" SELECT CASE WHEN is_no_bags_booking = 'T' AND bags > 0 THEN false ELSE true END FROM {{table_fullname}};"""contessa=ContessaRunner("postgres://:@localhost:5432")RULES=[{"name":NOT_NULL,"columns":["status","market","src","dst"],},{"name":GT,"value":0,"columns":["initial_price","turnover_before_refunds",],},{"name":SQL,"sql":no_bags_sql,"description":"No bags booking should have bags = 0",},]ts_nodash="20191010T101010"# should be set dynamically (e.g. by airflow), just example herecontessa.run(raw_rules=RULES,check_table={"schema_name":"temporary","table_name":f"my_table_{ts_nodash}"},result_table={"schema_name":"dq","table_name":"my_table"},)
这将导致表dq.quality_check_my_table
,每一行看起来像:
classQualityCheck:id=Column(BIGINT,primary_key=True)attribute=Column(TEXT)rule_name=Column(TEXT)rule_description=Column(TEXT)total_records=Column(INTEGER)failed=Column(INTEGER)median_30_day_failed=Column(DOUBLE_PRECISION)failed_percentage=Column(DOUBLE_PRECISION)passed=Column(INTEGER)median_30_day_passed=Column(DOUBLE_PRECISION)passed_percentage=Column(DOUBLE_PRECISION)status=Column(TEXT)time_filter=Column(TEXT)task_ts=Column(TIMESTAMP(timezone=True),nullable=False,index=True)created_at=Column(DateTime(timezone=True),server_default=text("NOW()"),nullable=False,index=True,)
如何运行测试
$ make test-up # run postgres + app $ make testargs="/app/test -s"# args for pytest $ make test-down # delete containers + volumes
如果是单元测试(您不需要数据库):
$ pytest test/unit/test_operator.py
上下文
每次运行都有自己的上下文,主要用于模板化最终的sql。这是它的上下文:
{ “table_fullname”:“public.my_cool_table” “任务”:由客户端或DateTime.Now()传递 }