gitbase-spark-connector is a Scala library that lets you expose gitbase tables as Spark SQL Dataframes to run scalable analysis and processing pipelines on source code.
- Scala 2.11.12
- Apache Spark 2.3.2 Installation
- gitbase >= v0.18.x
- bblfsh >= 2.10.x
Maven:
<dependency>
<groupId>tech.sourced</groupId>
<artifactId>gitbase-spark-connector</artifactId>
<version>[version]</version>
<type>slim</type>
</dependency>
SBT:
libraryDependencies += "tech.sourced" % "gitbase-spark-connector" % "[version]" classifier "slim"
Note the slim
type or classifier.
This is intended to make possible use this data source as a library.
If you don't add it, the retrieved jar will have all the needed dependencies included (fat-jar or uber-jar).
That might cause dependency conflicts in your application.
You can check the available versions here.
First of all, you'll need a gitbase instance running. It will expose your repositories through a SQL interface. Gitbase depends on bblfsh, to extract UAST (universal abstract syntax tree) from source code. For instance if you plan to filter queries by language or generally run some operations on UASTs then babelfish server is required.
The most convenient way is to run all services with docker-compose. This Compose file (docker-compose.yml) defines three services (bblfshd, gitbase and gitbase-spark-connector).
You can run any combination of them, e.g. (only bblfshd and gitbase):
- Note: You must change
/path/to/repos
indocker-compose.yml
(for gitbase volumes) to the actual path where your git repositories are located.
$ docker-compose up bblfshd gitbase
All containers run in the same network. Babelfish server will be exposed on port :9432
, Gitbase server is linked to Babelfish and exposed on port :3306
, and Spark connector is linked to both (bblfsh and gitbase) and serves Jupyter Notebook on port :8080
.
The command:
$ docker-compose up
runs all services, but first it builds a Docker image (based on Dockerfile) for gitbase-spark-connector
.
If all services started without any errors, you can go to http://localhost:8080
and play with Jupyter Notebook to query gitbase via spark connector.
Finally you can try it out from your code. Add the gitbase DataSource
and configuration by registering in the spark session.
import tech.sourced.gitbase.spark.GitbaseSessionBuilder
val spark = SparkSession.builder().appName("test")
.master("local[*]")
.config("spark.driver.host", "localhost")
.registerGitbaseSource()
.getOrCreate()
val refs = spark.table("ref_commits")
val commits = spark.table("commits")
val df = refs
.join(commits, Seq("repository_id", "commit_hash"))
.filter(refs("history_index") === 0)
df.select("ref_name", "commit_hash", "committer_when").show(false)
Output:
+-------------------------------------------------------------------------------+----------------------------------------+-------------------+
|ref_name |commit_hash |committer_when |
+-------------------------------------------------------------------------------+----------------------------------------+-------------------+
|refs/heads/HEAD/015dcc49-9049-b00c-ba72-b6f5fa98cbe7 |fff7062de8474d10a67d417ccea87ba6f58ca81d|2015-07-28 08:39:11|
|refs/heads/HEAD/015dcc49-90e6-34f2-ac03-df879ee269f3 |fff7062de8474d10a67d417ccea87ba6f58ca81d|2015-07-28 08:39:11|
|refs/heads/develop/015dcc49-9049-b00c-ba72-b6f5fa98cbe7 |880653c14945dbbc915f1145561ed3df3ebaf168|2015-08-19 01:02:38|
|refs/heads/HEAD/015da2f4-6d89-7ec8-5ac9-a38329ea875b |dbfab055c70379219cbcf422f05316fdf4e1aed3|2008-02-01 16:42:40|
+-------------------------------------------------------------------------------+----------------------------------------+-------------------+
Apache License 2.0, see LICENSE