Skip to content

Detailed Guides Walkthru

Dr. Rob Lambert, PhD edited this page Jul 15, 2015 · 21 revisions

Detailed guides: Walkthru/Blog

This page is meant for Beta Testers. If you are not intending to do an installation yourself, please go back to the top page.

In this page we will store user stories of how we have installed KAVEs, what procedure we followed and how well this worked. We will also follow up on different post-install actions we made.

These stories are divided by the package and version that was installed

AmbariKave 1.2-Beta

Installing a hadoop cluster in situ in a data centre

This was a really interesting project. A group was interested in processing 300GB of transactional data in a hadoop cluster to benchmark test against their standard proprietary solution.

What did we start from?

We started with having no hardware, having a good understanding of Ambari and hadoop, and the example blueprints. This is a pretty basic place to start!

How long did it take?

  • Resourcing lead time: 1 week
  • Stage A deployment (installing Centos6 nodes/Hypervisors) : 2 days
  • Stage B deployment (installing Kave from the installer): 5 hours
  • Relocating to the data centre: 1 week
  • Stage C deployment (fine-tuning, loading data and fixing issues): 2 days

So, all-in-all it took one week of work and two weeks of downtime besides, waiting for the blades, and waiting for the cluster to move to the final datacenter.

How did we do it?

  1. We first considered what software we would be running. In this case we opted for a simple three-node hadoop setup with an additional gateway for the data scientists to have a playground, and a separate machine to run gitlabs for code management.
  2. This helped us to choose the resources we needed. We followed the recommended specs of Ambari, and ended up with a 4-blade setup, where each blade was 24 cores and 112 GB ram, 12 TB of HDD space.
  3. We allocated three of these blades as physical data nodes, and installed Centos6 on them. On the other we installed a kvm hypervisor and sub-divided this into two namenodes, a gateway machine, an ambari machine, and a dev machine for gitlabs.
  4. While this was happening, we started from the simple hadoop blueprint example, and we iterated over it, installing and re-installing it, on Amazon (AWS). This really sped up the whole process. Using the existing AmbariKave test structure we created three files, test.aws.json test.blueprint.json and test.cluster.json, and we matched the aws resources to the same or similar size we knew we were getting from the blades. After around 6 hours we had a very workable blueprint for a small hadoop cluster, gitlabs, kavetoolbox and the kavelanding page.
  5. In what turned out to be a really really good idea, the network administrator for the data centre configured a fake network architecture for the blades which matched exactly the network setup in the data centre, with one exception, these machines were located in an office where we could reach them for the initial setup, and the nodes were all given access to the internet for software installation. The network was identical down to the assigned IPs, domain names, nameserver, subnet structure, firewall, so that we could be sure that the nodes would work when rebooted inside the data centre.
  6. The network admin/sys admin followed the requirements shortlist to ensure the cluster was setup correctly.
  7. Once the servers were configured, with Centos6 installed, we went to the office with the blades and began the installation from the pre-prepared blueprint. We needed: (a) the quickstart instructions on the website (b) the deployment package from the repo, to get the deploy_from_blueprint script (http://repos.kave.io/noarch/AmbariKave/1.2-Beta/ambarikave-deployment-1.2-Beta.tar.gz) (c) Our pre-tested blueprint with us on a memory stick (d) A copy of the restart_all_services script. (e) A copy of the clean.sh script.
  8. Installed AmbariKave onto the ambari node from the quickstart instructions on the website
  9. Tried to deploy our blueprint using deploy_from_blueprint.py our.blueprint.json our.cluster.json (many problems encountered here)
  10. Monitored through the web interface on the ambari node :8080
  11. Made fixes and changes where necessary (see below)
  12. Once the cluster was up and running, it was shut down, moved to the datacenter and restarted. This worked great :)
  13. With the cluster back up and running, we obtained access through the new firewalls now in place and we could immediately work on our KAVE.
  14. We found we needed to adjust the yarn resource parameters from Ambari to better match the available CPU and memory of the data nodes. We also needed to add users in FreeIPA, on HUE and go to the namenode and add all users expected to use hadoop to the hadoop and hdfs groups separately (sudo usermod -a -G hadoop myuser).

What issues were encountered and how were they fixed?

  1. Ambari agents not talking to the ambari server! The deploy_from_blueprint script is clever. It stops at the first sign of trouble so you can go and investigate it. In this case it was telling me that there were nodes that were not colecting to the ambari server, so not appearing in the hosts of the ambari server. We first noticed that both selinux and iptables were still running on the nodes. We turned this off, and then at least some of the nodes were visible by the ambari server.
  2. Ambari agents still not talking to the ambari server! One of the three data nodes was still refusing to talk to the ambari server. We tried a complete re-install of Ambari, using the clean.sh script, we tried restarting the nodes themselves, but this was not the problem. Testing the ambari agent logs there was some problem authenticating to the server over SSL. This was odd, since all the other nodes could communicate fine. By a coincidence we noticed that data001 was the only node whose system clock was randomly set slightly earlier than the ambari node. We installed ntp, reset the clocks based on the existing DNS clock, and then performed a re-install of the agents to get allocated new SSL certificates by ambari. Since many of the machines were running under the same hypervisor, those were all already sychnronized, and then of the three data nodes which could be out of sync, only one of them was set delayed with respect to the rest. This was a very interesting problem to debug, in future we will add "install ntp" to the instructions!
  3. KaveToolbox 1.2 installation failing! At one point of KaveToolbox installation, it uses rsync to copy files. However, KaveToolbox 1.2 does not itself install rsync, and it was not already installed on the centos6 minimal images we were starting from. So, we used pdsh to issue the correct install command to all nodes (pdsh -R ssh -g our yum -y install rsync).
  4. Machines becoming isolated in the network! During the installation of FreeIPA, reverse lookups suddenly stopped working within the cluster, causing a lot of problems, and no external urls could be resolved, meaning the installations of other services were just timing out trying to download/install packages. We installed bind-utils to try and debug this and quickly realized the problem. The FreeIPA "forwarders" parameter was incorrectly set. FreeIPA was taking over the DNS, but not forwarding lookups to the existing DNS gateway to the rest of the world. To fix this all we needed to do was go to the FreeIPA web interface on the ambari node and configure the forwarders to use the correct IP.
  5. Hive/tez/oozie apps not existing on hdfs! When we first booted the cluster, running sql queries through Hive was failing, it was complaining it could not access certain jar/tar files which were supposed to be on hdfs. This problem fixed itself. Due to all the other installation problems where were several points where the datanodes could not talk to each other or the namenode and visa versa, and so these files had not been replicated correctly. The next time we sat down to debug this, the files had been replicated and there was no more problem to fix.
  6. Optimization Very small usage of the cluster by yarn jobs. The first test jobs we ran used only 10% of the available CPU and <10% of the available memory, and took a long time to complete. We needed to optimize the parameters of both yarn and mapreduce. In yarn we needed to match the possible memory size to what was available on the nodes (ssh into the nodes, free -m, and then leave some margin for the OS). We also upped the ram for mappers and reducers. In the end we settled on 2GB per mapper, 4GB per reducer, and 1GB sorting buffer. (yarn.nodemanager.resource.memory-mb, mapreduce.map.memory.mb, mapreduce.reduce.memory.mb, mapreduce.io.sort.mb)
  7. Order-by jobs not working. During optimization we saw errors in some jobs like: Unable to initialize any output collector. This was because we'd accidentally set the mapreduce.io.sort.mb too high. As explained here. Fix was to set the mapreduce.io.sort.mb to at most 1/2 of mapreduce.map.memory.mb
  8. Error: Java Heap Size During optimization we saw some errors in jobs like: Error: Java Heap Size. This was because the two other mapreduce properties to creating java child tasks were not set properly as explained ehre. here. Fix was to set mapreduce.map.java.opts to be equal to mapreduce.map.memory.mb and set mapreduce.reduce.java.opts to be equal to mapreduce.reduce.memory.mb

It was great to see the cluster working so quickly!

KaveToolbox 1.2-Beta

Table of Contents

For users, installers, and other persons interested in the KAVE, or developing solutions on top of a KAVE.

Kave on Azure

For contributors

For someone who modifies the AmbariKave code itself and contributes to this project. Persons working on top of existing KAVEs or developing solutions on top of KAVE don't need to read any of this second part.

Clone this wiki locally