Skip to content
This repository has been archived by the owner on Nov 24, 2021. It is now read-only.

phdata/pipeforge

Repository files navigation

Pipeforge

Purpose

Pipeforge uses JDBC metadata to build the tables.yml file used with Pipewrench.

CLI Args

  • configuration: Builds Pipewrench tables.yml and environment.yml files from JDBC Metadata. Writes the results to the configured output directory
    • -e, --environment: Path to source environment.yml file
    • -p, --password: Optional. Source database user password, user will be prompted to enter password if it is not supplied
    • -t, --tables: Optional. Existing tables.yml file path. A merge operation will be performed newly parsed tables will be appended to the original table definitions.
  • merge: Executes Pipewrench merge and stores the Pipewrench output to the specified directory.
    • -d, --directory: Pipewrench configuration directory containing both the Pipewrench environment.yml and tables.yml
    • -t, --template: Pipewrench template name
  • rest-api: Starts a process for interacting with Pipeforge via REST and Yaml documents
    • -p, --port: Port to expose the pipewrench endpoints on
  • validate-schema: Validated source RDBMS against Impala database. application.conf properties impala.hostname and impala.port are required for this functionality to work.
    • -e, --environment: Path to source environment.yml file
    • -p, --database-password: Optional. Source database user password, user will be prompted to enter password if it is not supplied
    • -h, --hadoop-password: Optional. Hadoop user password, user will be prompted to enter password if it is not supplied

Configuration Files

Application Configuration application.conf

application.conf Example

To provide a custom application

pipewrench { 
  virtualInstall = true # Determines whether Pipeforge should clone Pipewrench and setup virtual environment for Python.  Set to false if Pipewrench is already installed.
  git {
    url = "https://github.com/Cargill/pipewrench" # Git location of the pipewrench version to install.
  }
  directory {
    # Directory path containing installtion scripts `requirements.sh` and `generate-scripts.sh`.  Set to `conf` when using a packaged deployment.
    # When delopying pipeforge via bundled zip file this value should be set to `conf/`
    install = "src/main/resources"
    # Pipewrench installation location.
    pipewrench = "<fully qualified path where Pipewrench should be installed>"
    # Fully qualified path location of Pipewrench templates directory.
    templates = "<pipewrench template directory>"
    # Path where to write pipewrench configuration files and pipewrench output scripts 
    ingest = "." 
  }
}
impala {
  # Environment specific impala shell command.  Note -f is required as pipewrench passes sql files to the impala shell commands
  cmd = "<impala-shell command> -f "
  # OPTIONAL: Hostname of Impala daemon FQDN or load balanced url this value is used when doing schema validation between a source system and existing impala database
  hostname = "<Impala daemon FQDN>"
  # OPTIONAL: Port for JDBC communication to Impala daemon this value is used when doing schema validation between a source system and existing impala daemon
  port = <Impala daemon port>
}
# OPTIONAL: Used to query the Hive Metastore for column level comments, if left blank Impala comments will not be parsed
hive {
  metastore {
    # OPTIONAL: JDBC url for Hive Metastore ex. jdbc:mysql://<host>:<port>/<schema>
    url = "<hive metastore jdbc url>"
    # OPTIONAL: Hive metastore schema
    schema = "<hive metasotre schema>"
    # OPTIONAL: Hive metastore read only user
    username = "<hive metastore username>"
    # OPTIONAL: Hive metastore read only user password
    password = "<hive metastore password>"
    # OPTIONAL: MySQL is the only support option at this time
    databaseType = "mysql"
  }
}

Source Environment environment.yml

Yaml Example

name: dev.employee # Unique name for data ingestion
group: edh_dev_employee # Associated AD group for ingestion
databaseType: mysql # Database type must be mysql, oracle, mssql, hana, or teradata, as400, redshift, impala
schema: employees # RDMS or impala database / schema
jdbcUrl: "jdbc:mysql://localhost:3306/employees" # JDBC Url for connecting to database. Impala example: jdbc:hive2://<host>:21050/<database>;AuthMech=3;ssl=true
username: employee # RDMS username or AD user / service account user name (Impala)
objectType: table # Database object type to be ingested must be either table or view
# tables: # Optionally add a whitelisting of tables to ingest
#   - employee
metadata: # Metadata map to be applied to every table's tblproperties. https://www.cloudera.com/documentation/enterprise/latest/topics/impala_create_table.html
  SOURCE: employee database # Source database identifier
  LOAD_FREQUENCY: Daily # Frequency of data ingestion
  CONTACT: [email protected] # Distribution list for data owners
hadoopUser: ps_dev_employee # Hadoop user recommended to use a process account
passwordFile: hdfs:///user/developer/.employee_db_password # Location of sqoop's password file recommended HDFS location
stagingDatabase:
  name: employee
  path: hdfs:///ns/user/developer/staging/db
rawDatabase:
  name: employee_raw
  path: hdfs:///ns/user/developer/raw/db
# checkColumn: last_updated # Optional check column parameter. This value will be added to all pipewrench table configs for incremental ingest.
# userDefined: Optional user defined parameters.  These key values will be added to the pipewrench environment and configuration files to be used in templates
#  table_suffix: _ingest

Distributing application

sbt clean universal:packageBin

Creates a bundled zip application with required dependencies and executable scripts (sbt-native-packager). Package is found in target/universal/pipeforge-<version>.zip.

  1. Copy zip file to $INSTALL_LCOATION
  2. Change directory to $INSTALL_LOCATION
  3. Execute unzip pipforge-<version>.zip
  4. Change directory into pipeforge-<version>

Execution

Pipewrench Configuration Builder

$INSTALL_LOCATION/bin/pipeforge configuration \
  -e <path to pipeforge environment file>

Pipeforge Rest Api

$INSTALL_LOCATION/bin/pipeforge rest-api \
  -p <api port> -Dconfig.file=<path to application.conf>

Documentation

Testing

Unit tests

sbt clean test

Integration Tests

sbt clean it:test

NOTE: Docker must be installed on the system running integration tests. Docker is used to spin up test databases during testing.