Skip to content
/ sifter Public

ETL Platform for building JSON schema based data and property graphs

Notifications You must be signed in to change notification settings

bmeg/sifter

This branch is 28 commits behind main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

dec97a1 · Jan 30, 2024
Jan 30, 2024
Jan 30, 2024
Mar 28, 2019
Nov 28, 2022
Jan 10, 2024
Nov 22, 2022
Jan 19, 2023
Mar 23, 2022
Dec 6, 2023
Feb 12, 2021
Jan 13, 2024
Dec 14, 2023
Jan 21, 2021
Jul 23, 2023
Dec 9, 2020
Jan 30, 2024
Dec 20, 2023
Jan 30, 2024
Jan 10, 2024
Jan 7, 2021
May 3, 2020
May 15, 2019
Jan 30, 2024
May 3, 2020
Jan 7, 2021
Nov 28, 2022
Nov 22, 2022
Dec 20, 2023
Apr 8, 2019
Dec 21, 2023
Dec 21, 2023
Feb 26, 2023
Apr 8, 2019

Repository files navigation

Sifter

Sifter is a Extract Tranform Load (ETL) engine. It can be used to Extract from a number of different data resources, including TSV files, SQLDump files and external databases. It includes a pipeline description language to define a set of Transform steps to create object messages that can be validated using a JSON schema data.

Finally, SIFTER has a loader module that takes JSON message streams and load them into a property graph using rules described by GEN3 based JSON schema files.

ETL Process

  1. Download external artifacts (files, database dumps)
  2. Transform elements into JsonSchema compliant object streams. Each stream is a single file of "\n" delimited. File name os ..json.gz
  3. Graph Transform 3.1) Reformatted to fix GIDs, lookup unfinished edge ids 3.2) takes that 'link' commands from the Gen3 formatted JsonSchema files to generated 'Vertex.json.gz' and 'Edge.json.gz' files 3.3) Check for vertices that are used on edges but missing from vertex files

Example Extract/Transform Playbook

More detailed descriptions can be found in out Playbook manual

class: sifter
name: census_2010

config:
  census: ../data/census_2010_byzip.json
  date: "2010-01-01"
  schema: ../covid19_datadictionary/gdcdictionary/schemas/

inputs:
  censusData:
    jsonLoad:
      input: "{{config.census}}"

pipelines:
  transform:
    - from: censusData
    - map:
        #fix weird formatting of zip code
        gpython: >
          def f(x):
            d = int(x['zipcode'])
            x['zipcode'] = "%05d" % (int(d))
            return x
        method: f
    - project:
        mapping:
          submitter_id: "{{row.geo_id}}:{{inputs.date}}"
          type: census_report
          date: "{{config.date}}"
          summary_location: "{{row.zipcode}}"
    - objectValidate:
        title: census_report
        schema: "{{config.schema}}"

Running Sifter

sifter run examples/genome.yaml

Python Exec

Sifter will run Python code, however for this to function, the python environment needs to have GRPC install. To install, run:

pip install grpcio-tools

Go Tests

Run go tests with

go clean -testcache
go test ./test/... -v