Skip to content

Latest commit

 

History

History
639 lines (568 loc) · 24.7 KB

measure-configuration-guide.md

File metadata and controls

639 lines (568 loc) · 24.7 KB

Apache Griffin Measure Configuration Guide

Apache Griffin measure module needs two configuration files to define the parameters of execution, one is for environment, the other is for dq job.

Environment Parameters

{
  "spark": {
    "log.level": "WARN",
    "checkpoint.dir": "hdfs:///griffin/streaming/cp",
    "batch.interval": "1m",
    "process.interval": "5m",
    "config": {
      "spark.default.parallelism": 5,
      "spark.task.maxFailures": 5,
      "spark.streaming.kafkaMaxRatePerPartition": 1000,
      "spark.streaming.concurrentJobs": 4,
      "spark.yarn.maxAppAttempts": 5,
      "spark.yarn.am.attemptFailuresValidityInterval": "1h",
      "spark.yarn.max.executor.failures": 120,
      "spark.yarn.executor.failuresValidityInterval": "1h",
      "spark.hadoop.fs.hdfs.impl.disable.cache": true
    }
  },

  "sinks": [
    {
      "name": "ConsoleSink",
      "type": "console",
      "config": {
        "max.log.lines": 100
      }
    }, {
      "name": "HdfsSink",
      "type": "hdfs",
      "config": {
        "path": "hdfs:///griffin/streaming/persist",
        "max.lines.per.file": 10000
      }
    }
  ],

  "griffin.checkpoint": [
    {
      "type": "zk",
      "config": {
        "hosts": "<zookeeper host ip>:2181",
        "namespace": "griffin/infocache",
        "lock.path": "lock",
        "mode": "persist",
        "init.clear": true,
        "close.clear": false
      }
    }
  ]
}

Above lists environment parameters.

  • spark: This field configures spark and spark streaming parameters.
    • log.level: Level of spark log.
    • checkpoint.dir: Check point directory of spark streaming, for streaming mode.
    • batch.interval: Interval of dumping streaming data, for streaming mode.
    • process.interval: Interval of processing dumped streaming data, for streaming mode.
    • config: Configuration of spark parameters.
  • sinks: This field configures list of sink definitions to persist both records and metrics. Details of sink configuration are available here.
  • griffin.checkpoint: This field configures list of griffin checkpoint parameters, multiple cache ways are supported. It is only for streaming dq case. Details of info cache configuration here.

Sinks

Sinks allow persistence of job metrics and bad data (source records that violated the defined rules) to external storage systems. Sinks have to be defined in the Env Config, and their name are mentioned in the Job Config.

List of supported sinks:

  • Console
  • HDFS
  • MongoDB
  • ElasticSearch
  • Custom Implementations

Configuration

A sample sink configuration is as following,

...

"sinks": [
   {
     "name": "ConsoleSink",
     "type": "CONSOLE",
     "config": {
       "numRows": 10,
       "truncate": false
     }
   }
 ]

...
Key Parameters:
Name Type Description Supported Values
name String User defined unique name for Sink
type String Type of Sink (Value is case insensitive) console, hdfs, elasticsearch, mongodb, custom
config Object Configuration params of the sink Depends on sink type (see below)
For Custom Sinks:
  • config object must contain the key class whose value specifies class name for user-provided sink implementation. This class should implement org.apache.griffin.measure.sink.Sink trait
  • Example:
    ...
    
    "sinks": [
      {
        "name": "MyCustomSink",
        "type": "CUSTOM",
        "config": {
          "class": "my.package.sink.MyCustomSinkImpl",
          ...
        }
      }
    ]
    
    ...
    

Note: User-provided sink should be present in Spark job's class path, by either providing custom jar with --jars parameter to spark-submit or by adding setting spark.jars in spark -> config section of environment config.

For Console Sink:
  • Console Sink, supports the following configurations. Other alias' like 'Log' as value for type.

    Name Type Description Default Values
    numRows Integer Number of records to log 20
    truncate Boolean If true, strings more than 20 characters will be truncated and all cells will be aligned right true
  • Example:

    ...
    
    "sinks": [
        {
          "name": "ConsoleSink",
          "type": "CONSOLE",
          "config": {
            "numRows": 10,
            "truncate": false
          }
        }
      ]
    
    ...
    
For HDFS Sink:
  • HDFS Sink, supports the following configurations

    Name Type Description Default Values
    path String HDFS base path to sink metrics
    max.persist.lines Integer the max lines of total sink data -1
    max.lines.per.file Integer the max lines of each sink file 1000000
  • Example:

    ...
    
    "sinks": [
        {
          "name": "hdfsSink",
          "type": "HDFS",
          "config": {
            "path": "hdfs://localhost/griffin/batch/persist",
            "max.persist.lines": 10000,
            "max.lines.per.file": 10000
          }
        }
      ]
    
    ...
    
For MongoDB Sink:
  • MongoDB Sink, supports the following configurations. Other alias' like 'Mongo' as value for type.

    Name Type Description Default Values
    url String URL of MongoDB
    database String Database name
    collection String Collection name
    over.time Long Wait Duration -1
    retry Int Number of retries 10
  • Example:

    ...
    
    "sinks": [
        {
          "name": "MongoDBSink",
          "type": "MongoDB",
          "config": {
            ...
          }
        }
      ]
    
    ...
    
For Elasticsearch Sink:
  • Elasticsearch Sink, supports the following configurations. Other alias' like 'ES' and 'HTTP' as value for type.

    Name Type Description Default Values
    api String api to submit sink metrics
    method String http method, "post" default
    connection.timeout Long Wait Duration -1
    retry Integer Number of retries 10
  • Example:

    ...
    
    "sinks": [
        {
          "name": "ElasticsearchSink",
          "type": "Elasticsearch",
          "config": {
            ...
          }
        }
      ]
    
    ...
    

Griffin Checkpoint

  • type: Griffin checkpoint type, "zk" for zookeeper checkpoint.
  • config: Configure parameters of griffin checkpoint type.
    • zookeeper checkpoint
      • hosts: zookeeper hosts list as a string, separated by comma.
      • namespace: namespace of cache info, "" as default.
      • lock.path: path of lock info, "lock" as default.
      • mode: create mode of zookeeper node, "persist" as default.
      • init.clear: clear cache info when initialize, true default.
      • close.clear: clear cache info when close connection, false default.

DQ Job Parameters

{
  "name": "accu_batch",
  "process.type": "BATCH",
  "data.sources": [
    {
      "name": "src",
      "connector": {
        "type": "AVRO",
        "config": {
          "file.path": "<path>/<to>",
          "file.name": "<source-file>.avro"
        }
      }
    },
    {
      "name": "tgt",
      "connector": {
        "type": "AVRO",
        "config": {
          "file.path": "<path>/<to>",
          "file.name": "<target-file>.avro"
        }
      }
    }
  ],
  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "ACCURACY",
        "out.dataframe.name": "accu",
        "rule": "source.user_id = target.user_id AND upper(source.first_name) = upper(target.first_name) AND source.last_name = target.last_name AND source.address = target.address AND source.email = target.email AND source.phone = target.phone AND source.post_code = target.post_code",
        "details": {
          "source": "source",
          "target": "target",
          "miss": "miss_count",
          "total": "total_count",
          "matched": "matched_count"
        },
        "out": [
          {
            "type": "metric",
            "name": "accu"
          },
          {
            "type": "record"
          }
        ]
      }
    ]
  },
  "sinks": [
    "CONSOLESink",
    "HTTPSink",
    "HDFSSink"
  ]
}

Above lists DQ job configure parameters.

  • name: Name of DQ job.
  • process.type: Process type of DQ job, "BATCH" or "STREAMING".
  • data.sources: List of data sources in this DQ job.
    • name: Name of this data source, it should be different from other data sources.
    • connector: Data connector for this data source. Details of data connector configuration here.
  • evaluate.rule: Evaluate rule parameters of this DQ job.
    • dsl.type: Default dsl type of all the rules.
    • rules: List of rules, to define every rule step. Details of rule configuration here.
  • sinks: Whitelisted sink types for this job. Note: no sinks will be used, if empty or omitted.

Data Connector

Data Connector help connect to external sources on which DQ checks can be applied.

List of supported data connectors:

  • Hive
  • Kafka (Steaming only)
  • ElasticSearch (Batch only)
  • File based: Parquet, Avro, ORC, CSV, TSV, Text.
  • JDBC based: MySQL, PostgreSQL etc.
  • Custom: Cassandra

Configuration

A sample data connector configuration is as following,

...

"connector": {
   "type": "file",
   "config": {
     "key1": "value1",
     "key2": "value2"
   }
 }

...
Key Parameters:
Name Type Description Supported Values
type String Type of the Connector file, hive, kafka (streaming only), jdbc, custom
config Object Configuration params of the connector Depends on connector type (see below)
For Custom Data Connectors:
  • config object must contain the key class whose value specifies class name for user-provided data connector implementation.
    • For Batch it should implement BatchDataConnector trait.
    • For Streaming it should implement StreamingDataConnector trait.
  • Example:
    "connector": {
       "type": "custom",
       "config": {
         "class": "org.apache.griffin.measure.datasource.connector.batch.CassandraDataConnector",
         ...
       }
     }
    

Note: User-provided data connector should be present in Spark job's class path, by either providing custom jar with --jars parameter to spark-submit or by adding setting spark.jars in spark -> config section of environment config.

[Deprecated] For ElasticSearch Custom Data Connectors:
  • Currently supported SQL mode (for ElasticSearch with sql plugin) and NORMAL mode.
  • For NORMAL mode, config object supports the following keys,
Name Type Description Default Values
index String ElasticSearch index name default
type String ElasticSearch data type accuracy
host String ElasticSearch url host Empty
port String ElasticSearch url port Empty
fields List list of columns Empty
size Integer data size (lines) to load 100
metric.name String metric name to load *
  • Example:
    "connectors": [
     { 
       "type": "custom",
       "config": {
         "class": "org.apache.griffin.measure.datasource.connector.batch.ElasticSearchGriffinDataConnector",
         "index": "test-index-v1",
         "type": "metric",
         "host": "test.es-xxx.org",
         "port": "80",
         "fields": ["col_a", "col_b", "col_c"],
         "size": 20
       }
     }
    ]
    
  • For SQL mode, config object supports the following keys,
Name Type Description Default Values
host String ElasticSearch url host Empty
port String ElasticSearch url port Empty
sql.mode Boolean use sql mode false
sql String ElasticSearch SQL Empty
  • Example:
    "connectors": [
     { 
       "type": "custom",
       "config": {
         "class": "org.apache.griffin.measure.datasource.connector.batch.ElasticSearchGriffinDataConnector",
         "host": "test.es-xxx.org",
         "port": "80",
         "sql.mode": true,
         "sql": "select col_a, col_b, col_c from test-index-v1 limit 20"
       }
     }
    ]
    
For File based Data Connectors:
  • Currently supports formats like Parquet, ORC, AVRO, Text and Delimited types like CSV, TSV etc.
  • Local files can also be read by prepending file:// namespace.
  • config object supports the following keys,
Name Type Description Supported Values Default Values
format String type of file source parquet, avro, orc, csv, tsv, text parquet
paths List path(s) to be read Empty
options Object format specific options Empty
skipOnError Boolean whether to continue execution if one or more paths are invalid true, false false
schema List given as list of key value pairs See example below null
  • Example:
    "connector": {
       "type": "file",
       "config": {
         "format": "csv",
         "paths": [
           "/path/to/csv/dir/*",
           "/path/to/dir/test.csv"
         ],
         "options": {
           "header": "true"
         },
         "skipOnError": "false",
         "schema": [
           {
             "name": "user_id",
             "type": "string",
             "nullable": "true"
           },
           {
             "name": "age",
             "type": "int",
             "nullable": "false"
           }
         ]
       }
     }
    
    

Note: Additional examples of schema:

  • "schema":[{"name":"user_id","type":"string","nullable":"true"},{"name":"age","type":"int","nullable":"false"}]
  • "schema":[{"name":"user_id","type":"decimal(5,2)","nullable":"true"}]
  • "schema":[{"name":"my_struct","type":"structf1:int,f2:string","nullable":"true"}]
For ElasticSearch Data Connectors:
  • Elasticsearch Data Connector, supports the following configurations
Name Type Description Default Values
paths List Elasticsearch indices (Required)
filterExprs List List of string expressions that act as where conditions (row filters) Empty
selectionExprs List List of string expressions that act as selection conditions (column filters) Empty
options Object Additional elasticsearch options. Refer to ConfigurationOptions for options Empty
  • Example:
    "connector": {
           "type": "elasticsearch",
           "config": {
             "selectionExprs": [
               "account_number",
               "city",
               "gender",
               "age > 18"
             ],
             "filterExprs": [
               "account_number < 10"
             ],
             "paths": [
               "bank",
               "customer"
             ],
             "options": {
               "es.nodes": "localhost",
               "es.port": 9200
             }
           }
         }
    
For Hive Data Connectors:
  • config object supports the following keys,
    • database: data base name, optional, "default" as default.
    • table.name: table name.
    • where: where conditions string, split by ",", optional. e.g. dt=20170410 AND hour=15, dt=20170411 AND hour=15, dt=20170412 AND hour=15
For JDBC based Data Connectors:
  • config object supports the following keys,
Name Type Description Default Values
database String database name default
tablename String table name to be read Empty
url String the connection string URL to database Empty
user String user for connection to database Empty
password String password for connection to database null
driver String driver class for JDBC connection to database com.mysql.jdbc.Driver
where String condition for reading data from table Empty
  • Example:
    "connector": {
       "type": "jdbc",
       "config": {
         "database": "default",
         "tablename": "test",
         "url": "jdbc:mysql://localhost:3306/default",
         "user": "test_u",
         "password": "test_p",
         "driver": "com.mysql.jdbc.Driver",
         "where": ""
       }
     } 
    
    

Note: Jar containing driver class should be present in Spark job's class path, by either providing custom jar with --jars parameter to spark-submit or by adding setting spark.jars in spark -> config section of environment config.

Rule

  • dsl.type: Rule dsl type, "spark-sql", "df-ops" and "griffin-dsl".
  • dq.type: DQ type of this rule, only for "griffin-dsl" type. Supported types: "ACCURACY", "PROFILING", "TIMELINESS", "UNIQUENESS", "COMPLETENESS".
  • out.dataframe.name (step information): Output table name of this rule, could be used in the following rules.
  • in.dataframe.name (step information): Input table name of this rule, only used for "df-ops" type.
  • rule: The rule string.
  • details: Details of this rule, optional.
    • accuracy dq type detail configuration
      • source: the data source name which as source in accuracy, default is the name of first data source in "data.sources" if not configured.
      • target: the data source name which as target in accuracy, default is the name of second data source in "data.sources" if not configured.
      • miss: the miss count name in metric, optional.
      • total: the total count name in metric, optional.
      • matched: the matched count name in metric, optional.
    • profiling dq type detail configuration
      • source: the data source name which as source in profiling, default is the name of first data source in "data.sources" if not configured. If the griffin-dsl rule contains from clause, this parameter is ignored.
    • distinctness dq type detail configuration
      • source: name of data source to measure uniqueness.
      • target: name of data source to compare with. It is always the same as source, or more than source.
      • distinct: the unique count name in metric, optional.
      • total: the total count name in metric, optional.
      • dup: the duplicate count name in metric, optional.
      • accu_dup: the accumulate duplicate count name in metric, optional, only in streaming mode and "with.accumulate" enabled.
      • num: the duplicate number name in metric, optional.
      • duplication.array: optional, if set as a non-empty string, the duplication metric will be computed, and the group metric name is this string.
      • with.accumulate: optional, default is true, if set as false, in streaming mode, the data set will not compare with old data to check distinctness.
    • uniqueness dq type detail configuration
      • source: name of data source to measure uniqueness.
      • target: name of data source to compare with. It is always the same as source, or more than source.
      • unique: the unique count name in metric, optional.
      • total: the total count name in metric, optional.
      • dup: the duplicate count name in metric, optional.
      • num: the duplicate number name in metric, optional.
      • duplication.array: optional, if set as a non-empty string, the duplication metric will be computed, and the group metric name is this string.
    • completeness dq type detail configuration
      • source: name of data source to measure completeness.
      • total: name of data source to compare with. It is always the same as source, or more than source.
      • complete: the column name in metric, optional. The number of not null values.
      • incomplete: the column name in metric, optional. The number of null values.
    • timeliness dq type detail configuration
      • source: name of data source to measure timeliness.
      • latency: the latency column name in metric, optional.
      • total: column name, optional.
      • avg: column name, optional. The average latency.
      • step: column nmae, optional. The histogram where "bin" is step=floor(latency/step.size).
      • count: column name, optional. The number of the same latencies in the concrete step.
      • percentile: column name, optional.
      • threshold: optional, if set as a time string like "1h", the items with latency more than 1 hour will be record.
      • step.size: optional, used to build the histogram of latencies, in milliseconds (ex. "100").
      • percentile.values: optional, used to compute the percentile metrics, values between 0 and 1. For instance, We can see fastest and slowest latencies if set [0.1, 0.9].
  • cache: Cache output dataframe. Optional, valid only for "spark-sql" and "df-ops" mode. Defaults to false if not specified.
  • out: List of output sinks for the job.
    • Metric output.
      • type: "metric"
      • name: Metric name, semantics depends on "flatten" field value.
      • flatten: Aggregation method used before sending data frame result into the sink:
        • default: use "array" if data frame returned multiple records, otherwise use "entries"
        • entries: sends first row of data frame as metric results, like like {"agg_col": "value"}
        • array: wraps all metrics into a map, like {"my_out_name": [{"agg_col": "value"}]}
        • map: wraps first row of data frame into a map, like {"my_out_name": {"agg_col": "value"}}
    • Record output. Currently handled only by HDFS sink.
      • type: "record"
      • name: File name within sink output folder to dump files to.
    • Data source cache update for streaming jobs.
      • type: "dsc-update"
      • name: Data source name to update cache.