Skip to content
euriion edited this page Mar 5, 2012 · 28 revisions

RHive is an R package facilitating distributed computing via Hive query. The packages have been implemented and tested in Hive 0.8.0-SNAPSHOT.

What
RHive is an R package that integrates R environment with Hive, the open source implementation of data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries, and the analysis of big datasets stored in Hadoop compatible file systems.

Using RHive, it is possible to write HQL in R, launch this query from R, and interact with Hive. Also, R functions and R objects are exported to Hive and launched in Hive via RHive.

In Rhive, the small data is executed in R and the large data or big data is executed in Hive.

Why
Recently, many enterprises collect data at the most detailed level possible, thereby needing big repositories which can store from terabytes to petabytes in size. Also, they want to know any potential knowledge in these huge datasets. These days, people focus on the highly popular R statistical analysis program and many analysts have been using and become familiar with it. But R can’t support the analysis of data of this scale. Sometimes they have tried to analyze big data through sampling but this may result in misunderstanding of the dataset. MapReduce in Hadoop is capable of handling big data of this scale but many analysts don’t recognize this framework, less know how to use it. However, they are more likely to be familiar with using SQL to gain an insight of dataset and preprocessing it. Like SQL, Hive has an ad-hoc query which it executes in Hadoop. We hereby provide an good solution to handle and analyze big data via integrating R and Hive.

Thus we need to design for integration of R and Hive.

Rhive is inspired by this reason, the analysis of BIG DATA.

RHive consists of the following components:

udf – functions to allow users to use R functions and R Objects in Hive
rhive – functions to interact with Hive from within R
rhive.hdfs – functions to interact with HDFS from within R


Presentations and Papers about RHive


User Guide (tutorial) * (last updated: Mar 5, 2012)
These are official documents for RHive users


Example


How to contribute

Clone this wiki locally