The easiest way is to actually download a demo VM with Hadoop, Hive and HBase installed. Cloudera Demo VMs are available. You will need an 64-bit OS on your computer.
- Install a VM player: VirtualBox (recommended), VMware Player for Windows or Linux, or VMware Fusion for Mac. (Again, a 64-bit host OS is required.)
- Install the VM image for the lab: Download here. (Note: be sure to install the correct image for whatever player you have, and be sure to unpack the file before using.)
- If using a PC, confirm that your laptop is configured to support virtualization. (Enter BIOS, find the "Virtualization" settings [usually under "Security"] and enable all the virtualization options.) For common troubleshooting tips during installation, read Ryan Blue's troubleshooting tips here. Ensure that your virtual machine can connect to the internet. FYI, if you are running VirtualBox on Ubuntu 12.10, you may be hitting a known bug related to internet connectivity of Demo VM. See here for more details.
On your demo VM, download the datasets by git cloning this repository:
cd ~
# Install git in case you don't already have it
sudo yum install git
git clone git://github.com/markgrover/bdtc-hive.git
# This may take a minute because of the large datasets
There are 2 datasets in the repo.
a) The first dataset contains on-time flight performance data from 2008, originally released by Research and Innovative Technology Administration (RITA). The source of this dataset is http://stat-computing.org/dataexpo/2009/the-data.html. The dataset
b) The second dataset contains listing of various airport codes in continental US, Puerto Rico and US Virgin Islands. The source of this dataset is http://www.world-airport-codes.com/ The data was scraped from this website and then cleansed to be in its present CSV form.
By default, the DM doesn't occupy the full size of the your computer screen. To change the resolution of the Demo VM, so it can expand to the full size of your screen, you will need to download VirtualBox Guest Additions.
This is the sequence of commands that worked for me. Not all them may be necessary but it wouldn't hurt to run all of them. I tested them using the Demo VM (which is Redhat 6) with VirtualBox on my Mac OS X.
yum install make rpmbuild unifdef gcc kernel*
KERN_DIR=/usr/src/kernels/`uname -r`.`uname -m`
cd /media/VirtualBoxGuestAdditions
sudo ./VBoxLinuxAdditions.run
sudo reboot
Troubleshooting
If the above fails at some stage, you may want to do the following:
- Verify that KERN_DIR environment variable points to a valid directory location. If it doesn't, run
ls /usr/src/kernels
to set it to the appropriate subdirectory in there. - If the OpenGL build fails (it failed for me too), simply ignore. It seems pretty benign.
- See more troubleshooting tips here.
In the event of internet connectivity being slow, you will be provided with a USB stick with the datasets. You can copy the datasets on to your laptop and then share the same with your VM.
These instructions were tried on VirtualBox, the instructions for VMWare would be similar as well.
Click on Devices and then Shared Folders.... Then, add a new Machine Folder, select auto-mount, read-only, and make permanent. Now reboot your VM.
After reboot, run df -h
on terminal to figure out where the shared folder got auto-mounted and you are all set!