Merge pull request #147 from luigi-asprino/master

Upload FBDA query generator and executor
oeg-upm · Jul 22, 2024 · f368e31 · f368e31
2 parents 3b6c8db + b451bc1
commit f368e31
Show file tree

Hide file tree

Showing 42 changed files with 1,631 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -200,6 +200,8 @@ Additionally to the generator engine, that provides the data at desirable scales
 
 Our experiences testing (virtual) knowledge graph engines have revealed the difficulties for setting up an infrastructure where many variables and resources are involved: databases, raw data, mappings, queries, data paths, mapping paths, databases connections, etc. For that reason, and in order to facilitate the use of the benchmark to any developer or practitioner, we provide a set of [utils](https://github.com/oeg-upm/gtfs-bench/tree/master/utils) such as docker-compose templates or evaluation bash scripts that, in our opinion, can reduce the time for preparing the testing set up.
 
+Moreover, the utils folder contains a series of scripts for evaluating Façade-based data access engines (e.g. [SPARQL Anything](https://github.com/SPARQL-Anything/sparql.anything)) [more details](utils/fbda-bench/README.md).
+
 ## Desirable Metrics:
 
 We highly recommend that (virutalizers or materializers) KG construction engines tested with this benchmark provide (at least) the following metris:

diff --git a/utils/fbda-bench/README.md b/utils/fbda-bench/README.md
@@ -0,0 +1,69 @@
+# Façade-based Data Access Benchmark
+
+This folder provides a benchmark derived from GTFS-Madrid-Bench for evaluating Façade-based Data Access (FBDA) engines, such as [SPARQL Anything](https://github.com/SPARQL-Anything/sparql.anything).
+
+The extension consists of:
+- a *set of query templates* that translate the GTFS-Madrid-Bench's queries and RML mappings into FBDA queries;
+- a *query executor* which fires the queries and measures the performance of the FBDA engines under four experimental regimes: 
+	- In-memory execution over a complete materialised view (in-memory+complete);
+    - In-memory execution optimised by a triple-filtering approach (in-memory+triple-filtering);
+    - In-memory execution over a sliced materialised view and optimised by triple-filtering (sliced+triple-filtering);
+   	- On-disk execution optimised by triple-filtering (on-disk+triple-filtering).
+
+More details can be found in this [article](https://www.semantic-web-journal.net/content/materialisation-approaches-fa%C3%A7ade-based-data-access-sparql).
+
+
+## Requirements for the use
+
+To have locally installed Java 11 (or later versions).
+
+## Using FBDA Benchmark
+
+1. Generate data using GTFS-Madrid-Bench and move the result folder generated by GTFS within experiments folder. At the moment only csv, json and xml formats are allowed.
+
+2. Generate FBDA queries for the scales passed to GTFS-Madrid-Bench (e.g. 1, 10, 100)
+
+```
+./generate_queries.sh "1 10 100" "TMP_FOLDER" "xml csv json"
+```
+
+where:
+- `TMP_FOLDER` is the path to a temporary folder that will be used during the experiments
+- "xml csv json" are the formats passed to GTFS-Madrid-Bench
+
+3. Download the executable jar file of the FBDA engine to evaluate (e.g. [SPARQL Anything v0.9.0](https://github.com/SPARQL-Anything/sparql.anything/releases/download/0.9.0/sparql-anything-0.9.0.jar))
+
+4. Run the the queries
+
+```
+./execute_queries.sh /path/to/fbda_engine.jar "1 10 100" "xml csv json" "/path/to/results" "TMP_FOLDER"
+```
+
+where:
+- "1 10 100" are the scales passed to GTFS-Madrid-Bench
+- "xml csv json" are the formats passed to GTFS-Madrid-Bench
+- "/path/to/results" is the path to a folder where the results of the execution of the queries (i.e. measures) will be stored
+- `TMP_FOLDER` is the path to a temporary folder that will be used during the experiments
+
+
+## Analysing the results
+
+The execution of the queries generates two TSV files for each query executed on a given format, namely  `time_q<query_id>_<format>.tsv` and `mem_q<query_id>_<format>.tsv`.
+These files trace the execution of the queries in terms of computational resources used by the engine (i.e. memory footprint, CPU and time).
+
+The files are stored in the directory `/path/to/results` passed as argument of `execute_queries.sh`.
+
+The `time_q<query_id>_<format>.tsv` file keeps track of the execution time of the queries on a experimenting format. The table has the following structure:
+
+| Query | InputSize | Strategy | Slice | Ondisk | MemoryLimit | Run | Time | Unit | Status | STDErr |
+|-------|-----------|----------|-------|--------|-------------|-----|------|------|--------|--------|
+|       |           |          |       |        |             |     |      |      |        |        |
+
+The `mem_q<query_id>_<format>.tsv` file keeps track of the usage by the engine of the CPU and memory during the evaluation of the queries. The table has the following structure:
+
+| Query | InputSize | Strategy | Slice | Ondisk | MemoryLimit | Run | PID | %cpu | %mem | vsz | rss |
+|-------|-----------|----------|-------|--------|-------------|-----|-----|------|------|-----|-----|
+|       |           |          |       |        |             |     |     |      |      |     |     |
+
+
+
diff --git a/utils/fbda-bench/execute_queries.sh b/utils/fbda-bench/execute_queries.sh
@@ -0,0 +1,64 @@
+#!/bin/bash
+#
+# Copyright (c) 2024 SPARQL Anything Contributors @ http://github.com/sparql-anything
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+SPARQL_ANYTHING_JAR=$1
+RESULTS_DIR=$(pwd)/$4
+TMP_FOLDER=$5
+
+if [ ! -d $RESULTS_DIR ]; then
+  mkdir $RESULTS_DIR
+else
+  echo "$RESULTS_DIR already exists!"
+fi
+
+if [ ! -d $TMP_FOLDER ]; then
+  mkdir $TMP_FOLDER
+else
+  echo "$TMP_FOLDER already exists! Cleaning it.."
+  rm -rf $TMP_FOLDER/*
+fi
+
+source functions.sh
+
+if [ -n "$6" ]; then
+  QUERIES_TO_EXECUTE=$6
+else
+  QUERIES_TO_EXECUTE="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18"
+fi
+
+
+for format in $3
+do
+  for size in $2
+  do
+    for query in $QUERIES_TO_EXECUTE
+    do
+
+      #echo "Monitoring q$query strategy0 no_slice size $size $format"
+      #monitor-query $size "q$query" "strategy0" "no_slice" $format
+      #echo "Monitoring q$query strategy1 no_slice size $size $format"
+      #monitor-query $size "q$query" "strategy1" "no_slice" $format
+      #echo "Monitoring q$query strategy1 slice size $size $format"
+      #monitor-query $size "q$query" "strategy1" "slice" $format
+
+      # ON_DISK
+      echo "Monitoring q$query strategy1 no_slice size $size $format ondisk"
+      monitor-query $size "q$query" "strategy1" "no_slice" $format $TMP_FOLDER
+
+    done
+  done
+done