Skip to content

Latest commit

 

History

History
95 lines (70 loc) · 4.78 KB

README.md

File metadata and controls

95 lines (70 loc) · 4.78 KB

gitbase-spark-connector Build Status codecov

gitbase-spark-connector is a Scala library that lets you expose gitbase tables as Spark SQL Dataframes to run scalable analysis and processing pipelines on source code.

Pre-requisites

Import as a dependency

Maven:

<dependency>
  <groupId>tech.sourced</groupId>
  <artifactId>gitbase-spark-connector</artifactId>
  <version>[version]</version>
  <type>slim</type>
</dependency>

SBT:

libraryDependencies += "tech.sourced" % "gitbase-spark-connector" % "[version]" classifier "slim"

Note the slim type or classifier. This is intended to make possible use this data source as a library. If you don't add it, the retrieved jar will have all the needed dependencies included (fat-jar or uber-jar). That might cause dependency conflicts in your application.

You can check the available versions here.

Usage

First of all, you'll need a gitbase instance running. It will expose your repositories through a SQL interface. Gitbase depends on bblfsh, to extract UAST (universal abstract syntax tree) from source code. For instance if you plan to filter queries by language or generally run some operations on UASTs then babelfish server is required.

The most convenient way is to run all services with docker-compose. This Compose file (docker-compose.yml) defines three services (bblfshd, gitbase and gitbase-spark-connector).

You can run any combination of them, e.g. (only bblfshd and gitbase):

  • Note: You must change /path/to/repos in docker-compose.yml (for gitbase volumes) to the actual path where your git repositories are located.
$ docker-compose up bblfshd gitbase

All containers run in the same network. Babelfish server will be exposed on port :9432, Gitbase server is linked to Babelfish and exposed on port :3306, and Spark connector is linked to both (bblfsh and gitbase) and serves Jupyter Notebook on port :8080.

The command:

$ docker-compose up

runs all services, but first it builds a Docker image (based on Dockerfile) for gitbase-spark-connector. If all services started without any errors, you can go to http://localhost:8080 and play with Jupyter Notebook to query gitbase via spark connector.

Finally you can try it out from your code. Add the gitbase DataSource and configuration by registering in the spark session.

import tech.sourced.gitbase.spark.GitbaseSessionBuilder

val spark = SparkSession.builder().appName("test")
    .master("local[*]")
    .config("spark.driver.host", "localhost")
    .registerGitbaseSource()
    .getOrCreate()

val refs = spark.table("ref_commits")
val commits = spark.table("commits")

val df = refs
  .join(commits, Seq("repository_id", "commit_hash"))
  .filter(refs("history_index") === 0)

df.select("ref_name", "commit_hash", "committer_when").show(false)

Output:

+-------------------------------------------------------------------------------+----------------------------------------+-------------------+
|ref_name                                                                       |commit_hash                             |committer_when     |
+-------------------------------------------------------------------------------+----------------------------------------+-------------------+
|refs/heads/HEAD/015dcc49-9049-b00c-ba72-b6f5fa98cbe7                           |fff7062de8474d10a67d417ccea87ba6f58ca81d|2015-07-28 08:39:11|
|refs/heads/HEAD/015dcc49-90e6-34f2-ac03-df879ee269f3                           |fff7062de8474d10a67d417ccea87ba6f58ca81d|2015-07-28 08:39:11|
|refs/heads/develop/015dcc49-9049-b00c-ba72-b6f5fa98cbe7                        |880653c14945dbbc915f1145561ed3df3ebaf168|2015-08-19 01:02:38|
|refs/heads/HEAD/015da2f4-6d89-7ec8-5ac9-a38329ea875b                           |dbfab055c70379219cbcf422f05316fdf4e1aed3|2008-02-01 16:42:40|
+-------------------------------------------------------------------------------+----------------------------------------+-------------------+

License

Apache License 2.0, see LICENSE