Skip to content

igorgatis/spark-httpjson

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTTP JSON Data Source for Apache Spark

HTTP JSON is protocol specification which makes it easy to source data from your system into Apache Spark using HTTP requests with JSON payloads.

It is useful for scenarios where you need to fetch data from a system for which there is no native Spark connector or reading data directly from the data source is not possible, too complex or just undesired.

How it works

Apache Spark uses spark-httpjson client library to connect to your backend. It issues a couple of discovery requests asking for data schema and partition list. Then Spark distributes reading tasks among its workers which read partitions in parallel. Check the Protocol page for details.

Your server needs to implement both discovery and partition reading operations. This repository includes the following reference implementations for the HTTP handling:

Usage

First, make sure you include the Spark-HttpJson library as part of your Spark cluster (or job), check Releases.

The following code creates a DataFrame using spark.sources.datasourcev2.httpjson data source format. It tells to read data from https://hostname.domain.com:443/spark-api endpoint using Basic authentication for user someusername and password S3cr3t. It also includes options which are meant to instruct your backend.

val df = spark.read
  .format("spark.sources.datasourcev2.httpjson")
  .option("http.url", "https://hostname.domain.com:443/spark-api")
  .option("http.user", "someusername")
  .option("http.password", "S3cr3t")
  .option("http.gzip", "true")
  .option("http.ignorecertificates", "true")
  .option("http.X-My-Header", "blah")
  .option("myservice.Entity", "Pets")
  .option("myservice.PartitionSize", "1000")
  .option("myservice.MinimumWeightInKg", "5")
  .load()

The options with http. prefix control the behaviour of HTTP requests. Some of them are handled specially:

Option Description
http.url Sets endpoint URL (required).
http.user Sets Basic Auth user.
http.password Sets Basic Auth password.
http.gzip Tells the library to use compressed data (default: false).
http.ignorecertificates Tells the library should ignore certificate errors (default: false).

All others http. prefixed options are passed as regular HTTP headers (eg. X-My-Header: blah). This means you can use http.Authentication to use a Bearer token or http.Cookie to send a cookie as part of the request. WARNING: although gzip compression reduces request/response sizes, it does imply on CPU costs. Make sure you try with and without compression to make sure it yields expected performance gains.

All non http. prefixed options are passed as JSON payload to requests made to your backend.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages