Skip to content

Latest commit

 

History

History
126 lines (104 loc) · 6.54 KB

README.md

File metadata and controls

126 lines (104 loc) · 6.54 KB

Working Through an LHCb Open Data Set in Clojure and Spark

This project is an example of using Apache Spark through the Clojure programming language. I created the project a demo of the SparkSQL API and I'm extending it to the structured streaming API. Both of these use the Spark Dataset API. I based the project on a workbook by the LHCb collaboration which you can find here in Python. I was very much inspired by the PySpark walk through here.

I gave two talks which reference this project in 2018. One at Voxx Days Bristol 2018 (Me) and the other at Big Data in Vilnius (Me).

This project is a work in progress and not a neatly packaged runnable. Hopefully I'll delete this line. It's also not a physics nor LHCb analysis tutorial, you should go to the source for that.

Quick Start

If you know Clojure and already have a Clojure development environment set up and use leiningen then

where --workspace is the location you downloaded the data sets to; and --dataset can be simulation, magnet-down or magnet-up. Hopefully it's obvious which data set they'll try and load. The data set names are hard coded and expected to be within the workspace directory. To use the out of the box settings you'll need to be able to give your JVM 3G. This is not necessary, it's what I've hard coded into the project.clj file. You can modify this.

The output of lein run is a file called b-mass-plot.json. This is a Vega-Lite file and you can copy and paste it into here. I'm working on getting a better output from this.

With a REPL open in the lhcb-opendata namespace you can start to explore the code. The PySpark workbook will provide some questions and guidance.

Apache Spark and Clojure

Clojure is a dynamic language that runs on the JVM and it has excellent support for Java interop. Spark is a general purpose distributed data processing engine which is written in Scala but has APIs in Java, Python and R all of which allow you to process your data sets with SQL. The Sparkling is a Clojure library that has done a lot of the hard work of enabling Clojure data structures to be used naturally with the Spark Java API. It has also exposed a lot of the Spark core functionality and some of the Spark SQL API as Clojure functions. This project uses Sparkling as well as some additionally libraries exposing the more of the Dataset and related class APIs. This is done in the ignition namespace. Don't be afraid to check out Flambo another Clojure project exposing the Spark API.

This project uses the Spark Dataset API to reconstruct the B candidate's mass and to plot a historgram of the distribution. It primarily provides examples of how to use the SQL API programmatically and work with structured datasets. This mostly means calling the API directly with fairly straight forward conversion of Clojure to Java data structures. There is a more involved example of using an User Defined Function (UDF) in the Dataset API to reduce a column to bin counts. Using a UDF on a column object is a lot easier than writing a Spark Dataset function to work with map as there you must access the row object yourself and also return a new extended row. I've lost the example code but it involves a lot more boiler plate.

Getting Start with Clojure

Head over to leiningen and download and install the binary. This project has a project.clj file which is understood by lein and will allow it to pull in all the dependencies, build a JAR or uberjar, and most importantly start a Clojure REPL. If you're completely new to Clojure I recommend checking out Clojure for the Brave and True and also installing Clojure support for your favourite editor.

Getting Started With BeakerX

My intention was to give this project as a live demo using a notebook and after some searching found BeakerX which extends Jupyter notebooks for JVM languages. This was a partial success. I never managed to make it into a clean demo but was able to use the notebook to run Clojure, Spark and produce some plots. It's a little scrappy at the moment as I gave up on it to write some slides instead. I'll try to fix it and tidy it up.

The docker-compose.yml file will bring up a BeakerX container, bind some local directories for the data required in the project and also a .m2 directory. This is used as a persistent Maven cache so that the container doesn't have to download all it's dependencies every time you run the notebook. Spark and Hadoop are not lightweight JARs. I have bound the necessary port to access the notebook but also expose the local network to the container. This was part of a failed experiment to allow the notebook to connect to a local Spark instance.

docker-compose up

will get you started. You can install Docker and Docker compose from here and here. You can of course install BeakerX locally but I had some Python 2 vs 3 issues so went down the container route.