-
Notifications
You must be signed in to change notification settings - Fork 4
Home
BlinkDB is a large-scale data warehouse system built on Shark and Spark and is designed to be compatible with Apache Hive. It can answer HiveQL queries up to 200-300 times faster than Hive by executing them on user-specified samples of data and providing approximate answers that are augmented with meaningful error bars. BlinkDB 0.1.0 is an alpha developer release that supports creating/deleting samples on any input table and/or materialized view and executing approximate HiveQL queries with those aggregates that have statistical closed forms (i.e., AVG, SUM, COUNT, VAR and STDEV).
- Create/Delete Samples on Native Table and/or Materialized View
- Approximate Answers w/ Error Bars for AVG, SUM and COUNT queries.
- Complete support for GROUP BYs and FILTERs
- Scala 2.9.3
- Spark 0.8.x
- OpenJDK 7 or Oracle HotSpot JDK 7 or Oracle HotSpot JDK 6u23+
BlinkDB, being built upon Shark and Spark, shares a large portion of its setup instructions from these two codebases. Here are the specific set of instructions for running BlinkDB locally, on a cluster or on EC2. In case of any problems, please open a Github issue or send us an email at sameerag [AT] cs.berkeley.edu.
Running BlinkDB Locally: Get BlinkDB up and running on a single node for a quick spin in ~ 5 mins.
Running BlinkDB on a Cluster: Get BlinkDB up and running on your own cluster.
Running BlinkDB on EC2: Launch a BlinkDB cluster on Amazon EC2 in ~ 10 mins, including examples on how to query data in S3.
BlinkDB User Guide: An introduction to running BlinkDB and its API.
BlinkDB is being developed in the UC Berkeley AMP Lab. This research and development is supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, Blue Goji, Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, General Electric, Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Quanta, Samsung, Splunk, VMware and Yahoo!. Sameer Agarwal is supported by the Qualcomm Innovation Fellowship during 2012-13 and the Facebook Graduate Fellowship during 2013-14.
This wiki is closely mirrored after the Shark Wiki
Shark: Hive on Spark.
Spark: The in-memory cluster computing framework that powers Shark.
Apache Hive: Apache Hive data warehouse system.
Apache Mesos: cluster manager that provides efficient resource isolation and sharing across distributed applications.