Efficient Distributed Evaluation of SPARQL with Apache Spark.
Overview: SPARQL is the W3C standard query language for querying data expressed in RDF (Resource Description Framework). The increasing amounts of RDF data available raise a major need and research interest in building efficient and scalable distributed SPARQL query evaluators.
In this context, we propose and share SPARQLGX: our implementation of a distributed RDF datastore based on Apache Spark. SPARQLGX is designed to leverage existing Hadoop infrastructures for evaluating SPARQL queries. SPARQLGX relies on a translation of SPARQL queries into executable Spark code that adopts evaluation strategies according to (1) the storage method used and (2) statistics on data. Using a simple design, SPARQLGX already represents an interesting alternative in several scenarios.
Version: 1.1 (A change log is available in CHANGES
.)
-
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda. SPARQLGX: Efficient Distributed Evaluation of SPARQL with Apache Spark. The 15th International Semantic Web Conference, Oct 2016, Kobe, Japan. link
-
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda. SPARQLGX in action: Efficient Distributed Evaluation of SPARQL with Apache Spark. The 15th International Semantic Web Conference, Oct 2016, Kobe, Japan. link
-
Damien Graux, Louis Jachiet, Pierre Genevès, Nabil Layaïda. SPARQLGX : Une Solution Distribuée pour RDF Traduisant SPARQL vers Spark. 32ème Conférence sur la Gestion de Données - Principes, Technologies et Applications, Nov 2016, Poitiers, France. link
- Apache Hadoop (+HDFS) version 2.6.0-cdh5.7.0
- Apache Spark version 1.6.0
- OCaml version ≥ 4.0
- Menhir compatible with the OCaml version
- Ocamlfind compatible with the OCaml version
- Yojson compatible with the OCaml version
Note: Menhir, Yojson and Ocamlfind can all be installed through opam (the ocaml package manager).
In this package, we provide sources to load and query RDF datasets with SPARQLGX and SDE (a SPARQL direct evaluator). We also present a test-suite where two popular RDF/SPARQL benchmarks can be run: LUBM and WatDiv. For space reasons, these two datasets only contain a few hundred of thousand RDF triples, but determinist generators are available on benchmarks' webpages.
We provide a Dockerfile to compile and test SPARQLGX in a Docker container.
It can be built then run with the following command lines:
docker build -t sparqlgx .
docker run -it sparqlgx
The image contains an installation of Hadoop and Spark, according
to the versions specified in conf/
.
They are respectively stored in /opt/hadoop
and /opt/spark
.
SPARQLGX is installed in /opt/sparqlgx
, which is the home
directory of the user with same name.
All the required tools are installed, so that SPARQLGX can be
rebuilt using bash compile.sh
as described in the next section.
By default, spark-submit
will run the computations locally
(local[*]
). This can be changed by mounting a configuration
file as /opt/spark/conf/spark-defaults.conf
.
The configuration of SPARQLGX can be changed by mounting a file
as /opt/sparqlgx/conf/sparqlgx.conf
.
Firstly, clone this repository. Secondly, check that all the needed
commands are available on your (main) machine and that your HDFS is
correctly configured. Thirdly, compile the whole project. Fourthly,
you can modify the parameters listed in the configuration file (in
conf/
) according to your own cluster.
git clone github.com/tyrex-team/sparqlgx.git
cd sparqlgx/ ;
bash dependencies.sh ;
bash compile.sh ;
SPARQLGX can only load RDF data written according to the N-Triples
format; however, as many datasets
come in other standards (e.g. RDF/XML...) we also provide a .jar file
(rdf2rdf in bin/
) from an external developer able to translate RDF
data from a standard to an other one.
Before, loading an RDF triple file, you have to copy it directly on
the HDFS. Then, the complete preprocessing routine can be realized
using the load
parameter; it will partition the HDFS triple file and
compute statistics on data. These two distinct steps can be executed
separately with respectively light-load
and generate-stat
.
hadoop fs -copyFromLocal local_file.nt hdfs_file.nt ;
bash bin/sparqlgx.sh light-load dbName hdfs_file.nt ;
bash bin/sparqlgx.sh generate-stat dbName hdfs_file.nt ;
bash bin/sparqlgx.sh load dbName hdfs_file.nt ;
bash bin/sparqlgx.sh remove dbName ;
To execute a SPARQL query over a loaded RDF dataset, users can use
query
which translates the SPARQL query into Scala, compiles it and
runs it with Apache Spark. Moreover, users can call SDE (the SPARQLGX
Direct Evaluator) with direct-query
which directly evaluates SPARQL
queries on RDF datasets saved on the HDFS. Finally, three levels of
optimizations are available:
-
No Optimization If
--no-optim
is specified in the command line, SPARQLGX or SDE will execute the given query following exactly the order of clauses in the WHERE. -
Avoid Cartesian Products [Default] If no optimization option is given to either SPARQLGX or SDE, the translation engine will try to avoid (if possible) cartesian product in its translation output.
-
Query Planning with Statistics The
--stat
option is only available with SPARQLGX since SDE directly evaluates queries without preprocessing phase; in addition, you should have already generate statistics either withgenerate-stat
or withload
. It will imply a reordering of SPARQL query clauses (triple patterns in the WHERE{...}) according to data repartition. Finally, it will also try to avoid cartesian products from the new statistic-based order.
bash bin/sparqlgx.sh query dbName local_query.rq ;
bash bin/sparqlgx.sh direct-query local_query.rq hdfs_file.nt ;
It is also possible to translate only the SPARQL query (without
executing the output) into the Scala code. This routine returns the
origin query, the one that is actually translated after the potential
optimizations (--no-optim
or --stat
or nothing) and the obtained
Scala code.
bash bin/sparqlgx.sh translate dbName local_query.rq ;
We also provide a basic test suite using two popular benchmarks (LUBM
and WatDiv). To that purpose, we pre-generated two small RDF datasets
and give the various queries required for these benchmarks. The test
suite is divided into three parts: preliminaries.sh
sets up files
and directories on the HDFS, run-benchmarks.sh
executes everything
(preprocessing, querying with SDE or SPARQLGX and with various
optimization options), clean-all.sh
puts everything back in place.
cd tests/ ;
bash preliminaries.sh ;
bash run-benchmarks.sh ; # This step can take a while!
bash clean-all.sh
This project is under the CeCILL license.
Damien Graux
Louis Jachiet
Pierre Genevès
Nabil Layaïda
Tyrex Team, Inria (France)
Research Project and funding body: the ANR CLEAR Project