Skip to content

Data Pipeline

Emil Balitzki edited this page Jul 14, 2024 · 16 revisions

Information and usage

The data pipeline is a CLI application, used to insert a dataset into the database. Currently, it supports the following file types: CSV, SHAPE, and CITYGML. Each data source is described using a separate .yaml file but can point to the same dataset ID.

  • CLI execution
DataPipeline datasource.yaml
  • Options:

-b | --behaviour

Behavior when inserting into a database:

  • replace: drop the existing table before inserting.
  • skip: do not insert when the table already exists.
  • ignore: always try to insert regardless if the table already exists or not.

Example Yaml file:

# describe the source
source:
  # link | filepath
  type: URL
  # filepath or URL pointing to the file.
  location: https://data.bundesnetzagentur.de/Bundesnetzagentur/SharedDocs/Downloads/DE/Sachgebiete/Energie/Unternehmen_Institutionen/E_Mobilitaet/Ladesaeulenregister.csv
  # the format of the data. Options: CSV, SHAPE
  # SHAPE expects .zip file.
  data_format: CSV
options:
  # skip lines at the beginning
  skip_lines: 0
  # discard any rows that have null values
  discard_null_rows: false
  # how to deal with existing table. Options: ignore, replace, skip (default).
  if_table_exists: replace
# the name of the table in the database
table_name: EV_charging_stations
# the delimiter used for CSV datatype.
delimiter: ";"
table_cols:
  - name: Betreiber
    name_in_table: operator
    # if this is true, will discard any row that has this column as null. defaults to false.
    is_not_nullable: true
  - name: Bundesland
    name_in_table: state
    # the sql type of this column. defaults to VARCHAR(500)
    type: INT

Setting up Visual Studio to take a path as an argument in debug mode

To make the import of a .yaml file work when starting the application from inside Visual Studio you need to set it up in the properties.

  1. Right-click on the solution -> properties.
  2. go to Debug
  3. click "open debug launch profiles UI"
  4. set the arguments as test.yaml (or another yaml file) and the working directory as the /yaml directory in the projects directory.

Installing and seting up SQL Server locally

  • Download SQL server developer version from here
  • Then install SQL server
  • After that download and install SQL Server Management Studio (SSMS) from here
  • Now you will be able to use SQL server from SSMS.
  • In SSMS left click on db server then go to properties -> Security. Then in server authentication, select sql server and Windows authentication mode.
  • Then restart the SQL server.
  • First, create DB using the below command:
IF EXISTS(select 1 from sys.databases where [name]='BIEDB')
               DROP DATABASE [BIEDB]

GO
USE [master]
CREATE DATABASE [BIEDB]
GO

USE [BIEDB]
GO
  • Now create a user by following the steps below:

    1. In SSMS navigate to Security
    2. Then enter Login name
    3. Then select SQL server authentication and enter the password
    4. Then in User Mapping select the db you created and also select db_owner
  • After the user is created, necessary permissions need to be gave to it.

    1. Go into Database properties
    2. Then into permissions
    3. Then provide relevant permissions like create table and insert.

Establishing a DB Connection

To establish a connection with SQL server you need to pass follow environment variables:

 - DB_NAME=BIEDB             // Specifies the name of the database.
 - DB_PASSWORD=MyPass@1234   // Specifies the password for the SQL Server user profile.
 - DB_SERVER=db              // Specifies the hostname or IP address of the SQL Server instance.
 - DB_USERNAME=sa            // Specifies the SQL Server user profile name.
 - DB_TYPE=SQL               // Specifies the type of database server being used.
 - TRUSTED=False             // By setting false indicates that Windows Authentication is not used, and instead, SQL Server login credentials are provided.
  • For local use, you can pass environment variables by following these steps:

    1. Right-click on the solution -> properties.
    2. go to Debug
    3. click "open debug launch profiles UI"
    4. then set the environment variables
  • On the server you can pass these parameters to the DataPipeline service as environment variables in docker-compose file.

Csv-Importer documentation

The csv-importer is used to read the data from a given csv-file and prepare it for storage in the database. The path to the csv-file and other arguments are given via the .yaml file.

The csv-Importer can work with local csv-files (given as a file path) or remote csv-files (using URL links).

The filtering process inside of the CSV-importer includes:

  • Removing empty rows
  • Removing columns not mentioned in the yaml file
  • Removing rows where fields, that are market not empty in the yaml-file, are empty.