The Extract-Transform-Load (ETL) pipeline provides a means to convert structured data to RDF, perform post-processing steps, and ingest it into a graph database.
The pipeline follows the principles described in Concepts and is based on an opinionated selection of components and tools:
- Amazon Web Services (AWS) as cloud environment
- a selection of AWS services such as S3, CloudFormation, StepFunctions, Lambda, EC2, etc. for various parts
- RDF Mapping Language (RML) as declarative mapping language with Carml as mapping engine
- Ontotext GraphDB as RDF database
The ETL pipeline has the following features:
- read source files from a S3 bucket
- convert source files to RDF using RML mappings
- supported formats are CSV, XML, JSON, JSONL, also in compressed (gzipped) form
- the RDF files are written to an S3 bucket, one RDF file per source file
- the RDF files are ingested into a graph using the GraphDB Preload tool
- adding new files into the source bucket after the initial ingestion will add them as incremental updates
See ETL Pipeline Setup for how to set up and run the pipeline.
The following diagram shows the architecture of the ETL pipeline:
See Architecture for a detailed description.
All content in this repository is (c) 2023 by metaphacts.