允许测试数据集的命令行工具

testaton的Python项目详细描述


json文件example_config/configuration.json包含dtest、spark以及需要执行的数据元素和测试的示例配置。

主要有两种连接类型:

  • 数据库连接
  • 文件连接(将细分为本地和S3)

数据定义定义了以下三项之一:

  • 数据库表
  • 文件(csv或拼花)
  • 数据库查询

测试定义了可以执行的测试。以下是当前可以执行的测试:

unique-检查字段列表的唯一性

Required:
{ "fields" : [list of fields to check for uniqueness]
  "dataset" : [the dataset against which you're running the test for]
}

Optional:
{ "filter" : [a sql syntax filter] }

示例:

        "product-id-uniqueness": {
            "description": "product_id unique check",
            "test_type": "unique",
            "dataset": "table_name",
            "field": ["product_id"],
            "severity": "Error"
        }

^ {EM1}$$FutuxKEY -通过检查主次表中是否存在二级表中的一个字段,来执行关系外键约束。

Required:
{
    "parent_dataset" : [the parent dataset (one with primary key)]
    "parent_field" : [the field name of the parent dataset]
    "child_dataset" : [the child dataset]
    "child_field" : [the field in the child dataset]
}

Optional:
{ "filter" : [a sql syntax filter that is applied to both tables] }

示例:

        "customer-transaction-fk": {
            "description": "customer vs transaction test",
            "test_type": "foreign_key",
            "parent_dataset": "table_name",
            "parent_field": "customer_id",
            "child_dataset": "table_name",
            "child_field": "transaction_id",
            "filter" : "product_id is not null",
            "severity": "Error"
        }

filter-检查与筛选器匹配的记录数。如果返回结果>;0,则测试将失败。失败的返回值是返回的记录数。

Required:
{
    "filter": [an sql valid filter for the dataset in question]
    "dataset" : [the dataset against which you're running the test for]
}

示例:

        "gender-null": {
            "description": "gender null",
            "test_type": "filter",
            "dataset": "table_name",
            "filter": "gender is null",
            "severity": "Info"
        }

字段精度-比较应该具有相同数据的两行,并计算有关数据精度的统计信息。此测试不成功或失败,但返回一个包含数据集统计信息的表。

Required:
{
    "fields" : [an array with the two fields to compare in the datast]
    "dataset" : [the dataset against which you're running the test for]
}

示例

        "accuracy-check": {
            "description": "Compare the value of two fields",
            "test_type": "field_accuracy",
            "dataset": "some-file",
            "fields": [
                "field1",
                "field1_b"
            ]
        }

data_load_check-确认数据已跨多个日期加载的测试

Required:
{
        "date_field": [the date field to check in the dataset]
        "dataset" : [the dataset to check]
        "start_date" : [the start date for the date load check, format YYYYMMDD]
        "end_date" : [the end date for the date load check, format YYYYMMDD]
        "date_table" : the name of the date table
        "date_type" : the type of date that will be used, must be one of the following ("string_8ch", "string_dash", "date") 
}

注意:运行此测试需要日期表。此表应列出所需期间的所有日期。

有两种日期格式;

  • 字符串格式为yyyymmdd
  • 字符串或日期格式为yyyy-mm-dd

它应该有一个名为{{{date{id}}的日期字段(格式yyyymmdd)

示例:

        "sfmc-send-job-load": {
            "description": "Check if the send job table has data loaded for all days in May",
            "test_type": "data_load_check",
            "date_field": "event_date_id",
            "dataset": "sfmc-open",
            "start_date": "20190501",
            "end_date": "20190531",
            "date_table": "date-table",
            "severity": "Warn",
            "date_type": "date"
        }

dataset_size-用于确保正在使用的数据集在特定的行范围内(包括行)的测试。


Required:
{
    "min_value" : [the lowest acceptable value of rows needed in the dataset]
    "max_value" : [the highest number of rows allowed in the dataset]
}

示例

        "dataset_size_test":{
            "description": "check the number of rows in dataset",
            "test_type": "dataset_size",
            "dataset": "flights",
            "min_value": "5000",
            "max_value": "6000",
            "filter": "carrier != 'American Airlines'",
            "severity": "Error"
        }

### Optional fields supported in all tests

There are a number of fields that are supported in all tests as follows:

*severity*  - The severity level of the test failure. Can be one of (Error, Warn, Info)

*disabled* - Enables a test to be disabled in the script. Can be either true or false

#### Date decoding

For date filters one can specify a value of TODAY and a possible offset from today as a partial date. 

The format for specifying a date ofset is {{{TODAY}}} or {{{TODAY-x}}}

For example:
    "sfmc-send-job-load": {
        "description": "Check if the send job table has data loaded for all days",
        "test_type": "data_load_check",
        "date_field": "event_date_id",
        "dataset": "sfmc-open",
        "start_date": "20190501",
        "end_date": "{TODAY-1}",
        "date_table": "date-table",
        "severity": "Warn"
    }

The default is a date string in the format yyyy-mm-dd to support a date field type query on the database. If you need a string, for example to compare with a date_id field you can use the ":STR" addition to the definition, e.g. TODAY:STR or TODAY:STR-1 (i.e. yesterday in string format)

## Installation

`pip install testaton`

## Requirements

Local installation of spark if `spark-config:master` is set to `local`

## Execution 

`testaton configuration-file.json`

## Configuration
#### Dtest
See [Dtest](https://github.com/sjensen85/dtest) documentation.
`test-suite-metadata` is translated to the `metadata` argument
`message-broker-config` is translated to the `connectionConfig` argument

#### Spark
The configuration values for Spark are the master node and the application name. These translate to the corresponding arguments needed to build a SparkSession. More information can be found in the official [SparkSession documentation](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.Builder).

The `master` configuration variable sets the Spark master URL to connect to, such as “local” to run locally, “local[4]” to run locally with 4 cores, or “spark://ip-of-master:7077” to run on a Spark standalone cluster.

The `app-name` configuration variable sets a name for the application, which will be shown in the Spark web UI.

## TODO

**Testing the testaton**
- [ ] test all the current available tests on a spark cluster
- [ ] add unit tests
	- [ ] add unit tests for the generate sql code statements 

**Enhancements to current tests**
- [ ] update the unique filter test to check uniqueness of multiple fields
- [ ] update the daily check test query to support row count validation
- [ ] design a structure for a generic sql test, e.g. 
    "raw-query-test-example" : {
        "description" : "NOT IMPLEMENTED!! example of a raw sql test", 
        "test_type" : "custom_sql",
        "table" : "cinema-file",
        "sql_code" : "select count(1) error_cells from cinema where cinema_id < 1000",
        "validation" : "df['error_cells] < 100"
    }

**New tests and test enhancements**
- [x] create a test to check for the number of rows in a table are within a range
- [ ] count of yesterday's record > today + 10%
- [ ] add optional threshold ranges to the tests

**Other**
- [ ] json configuration validator (syntax)
	- [ ] validation of the existance of files, configurations, etc (semantics)
- [ ] convert testing code into an extendable class
- [ ] cross environment test execution (e.g. a table in a database and a file in parquet)

## Done

- [x] add timing calculation to the execution of the test
- [x] count of null fields > amount 
- [x] complete Dtest integration to the suite (sending the message) 
- [x] add a score function test against two variables from two data sets
- [x] remove username and password from test file
- [x] filter : a number is out of range (e.g. mileage < 0)
- [x] update the documentation to explain the different types of tests 
- [x] ensure that the integration with dtest 0.19 works
- [x] ensure that sending sample data to the UI works

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java如何在Android上的可访问文件夹中创建文件?(非根)   tomcat Unix脚本未使用Process Runtime在Java中运行   模拟器中的java Android AudioTrack速度异常   java是否创建具有特定@ConditionalOnProperty的注释?   java如何使用json数据从gridview中的URL加载图像?(类别和子类别)   ConcurrentHashMap的java锁定值对象   如何在具有额外属性的Java枚举上执行Javadoc?   java如何修复SocketException连接重置问题?   无附加表的java JPA实体继承   java Android应用程序在启动屏幕后崩溃   java如何将Arraylist保存到文件?   java restTemplate从restfull Web服务获取文件数组   java活动带布局隐藏导航栏   socket未接收任何内容的java BufferedReader