Skip to content

metaphacts/metaphacts-etl-pipeline

Repository files navigation

metaphacts ETL pipeline

The Extract-Transform-Load (ETL) pipeline provides a means to convert structured data to RDF, perform post-processing steps, and ingest it into a graph database.

The pipeline follows the principles described in Concepts and is based on an opinionated selection of components and tools:

Features

The ETL pipeline has the following features:

  • read source files from a S3 bucket
  • convert source files to RDF using RML mappings
  • supported formats are CSV, XML, JSON, JSONL, also in compressed (gzipped) form
  • the RDF files are written to an S3 bucket, one RDF file per source file
  • the RDF files are ingested into a graph using the GraphDB Preload tool
  • adding new files into the source bucket after the initial ingestion will add them as incremental updates

Setup and Operation

See ETL Pipeline Setup for how to set up and run the pipeline.

Architecture

The following diagram shows the architecture of the ETL pipeline:

See Architecture for a detailed description.

Copyright

All content in this repository is (c) 2023 by metaphacts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published