Detailed Guides Walkthru

Detailed guides: Walkthru/Blog

This page is meant for Beta Testers. If you are not intending to do an installation yourself, please go back to the top page.

In this page we will store user stories of how we have installed KAVEs, what procedure we followed and how well this worked. We will also follow up on different post-install actions we made.

These stories are divided by the package and version that was installed

AmbariKave 2.0-Beta
AmbariKave 1.2-Beta
KaveToolbox 1.2-Beta

AmbariKave 2.0-Beta

Installing a hadoop cluster within a test big-data environment

Dante, A KAVE beta tester with an already existing GreenPlumb-based big-data test cluster was looking to expand capabilities stepping into the hadoop world to benefit from the hadoop ecosystem. Here is what Dante has to say of the experience:

Introduction

I like to think of myself as a contributor and not merely a consumer of free software and support. I test things all the time and I do my best to alert developers of any problems I find.

My clients know I’m accustomed to uncharted waters. For weird state-of-the-art POCs, they come to me. Recently, I was asked “What do you think about this Kave thing that KPMG is working on? Is it just another Hadoop distro? Should we use it? Why don’t you try it out and let us know?” So there I went and here is my first look for the community at large.

Approach

I started browsing the tool site (which, I must say, is concise and well organized) and to my utter surprise found out that KAVE was Open Source. I concede that, at first, before I saw that it was open source, I was quite terrified about testing any closed source software, my big fear was this could be a Trojan, or a spy. Also, there were, at the time, no toy servers to spare. I would have to use my work laptop for testing. I called it Battlefield KAVE... into the belly of the beast!

I decided to install in my notebook. It’s a nice Dell running CentOS. I find it quite fast. Kave is surprisingly compact. Exercise some caution and allow some time for installation because it installs a lot of dependencies.

What issues were encountered and how were they fixed?

As you may imagine, I use other tools on this same machine, Greenplum DB for example. A day or two after installing Kave, I repeated a previously successful experiment with Greenplum, and it would not start. During Kave installation, I saw that it installed PostgreSQL, and Greenplum is based on PostgreSQL, so I was inclined to blame Kave. After a period of inquisition, (pun intended) I found Kave was innocent.

I also experienced problems with DHCP and non-fixed internal IPs in my environment. [[ Ed.: now added to install guide! ]]. Once I'd fixed this temporarily I would have this general advice: most big data tools are designed to be run in a closed environment, with fixed local IP addresses. Installing them [Ed.: complete AmbariKave] directly on a laptop or other LAN PC may be a bad idea. Your best option for development works may be to install them in a VM environment with static network addresses.

Aside from that installation was fine, no retry of any part was needed. So far I have no issues with running the cluster. The installation unfolded without complications. The tool is promising. If you share my view that open-source constructs are by definition more transparante and trustworthy, I recommend you consider Kave for your Big Data project.

AmbariKave 1.2-Beta

Installing a hadoop cluster in situ in a data centre

This was a really interesting project. A group was interested in processing 300GB of transactional data in a hadoop cluster to benchmark test against their standard proprietary solution.

What did we start from?

We started with having no hardware, having a good understanding of Ambari and hadoop, and the example blueprints. This is a pretty basic place to start!

How long did it take?

Resourcing lead time: 1 week
Stage A deployment (installing Centos6 nodes/Hypervisors) : 2 days
Stage B deployment (installing Kave from the installer): 5 hours
Relocating to the data centre: 1 week
Stage C deployment (fine-tuning, loading data and fixing issues): 2 days

So, all-in-all it took one week of work and two weeks of downtime besides, waiting for the blades, and waiting for the cluster to move to the final datacenter.

How did we do it?

We first considered what software we would be running. In this case we opted for a simple three-node hadoop setup with an additional gateway for the data scientists to have a playground, and a separate machine to run gitlabs for code management.
This helped us to choose the resources we needed. We followed the recommended specs of Ambari, and ended up with a 4-blade setup, where each blade was 24 cores and 112 GB ram, 12 TB of HDD space.
We allocated three of these blades as physical data nodes, and installed Centos6 on them. On the other we installed a kvm hypervisor and sub-divided this into two namenodes, a gateway machine, an ambari machine, and a dev machine for gitlabs.
While this was happening, we started from the simple hadoop blueprint example, and we iterated over it, installing and re-installing it, on Amazon (AWS). This really sped up the whole process. Using the existing AmbariKave test structure we created three files, test.aws.json test.blueprint.json and test.cluster.json, and we matched the aws resources to the same or similar size we knew we were getting from the blades. After around 6 hours we had a very workable blueprint for a small hadoop cluster, gitlabs, kavetoolbox and the kavelanding page.
In what turned out to be a really really good idea, the network administrator for the data centre configured a fake network architecture for the blades which matched exactly the network setup in the data centre, with one exception, these machines were located in an office where we could reach them for the initial setup, and the nodes were all given access to the internet for software installation. The network was identical down to the assigned IPs, domain names, nameserver, subnet structure, firewall, so that we could be sure that the nodes would work when rebooted inside the data centre.
The network admin/sys admin followed the requirements shortlist to ensure the cluster was setup correctly.
Once the servers were configured, with Centos6 installed, we went to the office with the blades and began the installation from the pre-prepared blueprint. We needed: (a) the quickstart instructions on the website (b) the deployment package from the repo, to get the deploy_from_blueprint script (http://repos.kave.io/noarch/AmbariKave/1.2-Beta/ambarikave-deployment-1.2-Beta.tar.gz) (c) Our pre-tested blueprint with us on a memory stick (d) A copy of the restart_all_services script. (e) A copy of the clean.sh script.
Installed AmbariKave onto the ambari node from the quickstart instructions on the website
Tried to deploy our blueprint using deploy_from_blueprint.py our.blueprint.json our.cluster.json (many problems encountered here)
Monitored through the web interface on the ambari node :8080
Made fixes and changes where necessary (see below)
Once the cluster was up and running, it was shut down, moved to the datacenter and restarted. This worked great :)
With the cluster back up and running, we obtained access through the new firewalls now in place and we could immediately work on our KAVE.
We found we needed to adjust the yarn resource parameters from Ambari to better match the available CPU and memory of the data nodes. We also needed to add users in FreeIPA, on HUE and go to the namenode and add all users expected to use hadoop to the hadoop group separately (sudo usermod -a -G hadoop myuser). HDP has a set of tools to help optimizing Yarn memory settings

What issues were encountered and how were they fixed?

Ambari agents not talking to the ambari server! The deploy_from_blueprint script is clever. It stops at the first sign of trouble so you can go and investigate it. In this case it was telling me that there were nodes that were not connecting to the ambari server, so not appearing in the hosts of the ambari server. We first noticed that both selinux and iptables were still running on the nodes. We turned this off, and then at least some of the nodes were visible by the ambari server.
Ambari agents still not talking to the ambari server! One of the three data nodes was still refusing to talk to the ambari server. We tried a complete re-install of Ambari, using the clean.sh script, we tried restarting the nodes themselves, but this was not the problem. Testing the ambari agent logs there was some problem authenticating to the server over SSL. This was odd, since all the other nodes could communicate fine. By a coincidence we noticed that data001 was the only node whose system clock was randomly set slightly earlier than the ambari node. We installed ntp, reset the clocks based on the existing DNS clock, and then performed a re-install of the agents to get allocated new SSL certificates by ambari. Since many of the machines were running under the same hypervisor, those were all already sychnronized, and then of the three data nodes which could be out of sync, only data001 was set earlier in time with respect to the rest. This was a very interesting problem to debug, in future we will add "install ntp" to the instructions!
KaveToolbox 1.2 installation failing! At one point of KaveToolbox installation, it uses rsync to copy files. However, KaveToolbox 1.2 does not itself install rsync, and it was not already installed on the centos6 minimal images we were starting from. So, we used pdsh to issue the correct install command to all nodes (pdsh -R ssh -g our yum -y install rsync).
Machines becoming isolated in the network! During the installation of FreeIPA, reverse lookups suddenly stopped working within the cluster, causing a lot of problems, and no external urls could be resolved, meaning the installations of other services were just timing out trying to download/install packages. We installed bind-utils to try and debug this and quickly realized the problem. The FreeIPA "forwarders" parameter was incorrectly set. FreeIPA was taking over the DNS, but not forwarding lookups to the existing DNS gateway to the rest of the world. To fix this all we needed to do was go to the FreeIPA web interface on the ambari node and configure the forwarders to use the correct IP.
Hive/tez/oozie apps not existing on hdfs! When we first booted the cluster, running sql queries through Hive was failing, it was complaining it could not access certain jar/tar files which were supposed to be on hdfs. This problem fixed itself. Due to all the other installation problems where were several points where the datanodes could not talk to each other or the namenode and visa versa, and so these files had not been replicated correctly. The next time we sat down to debug this, the files had been replicated and there was no more problem to fix.
Optimization Very small usage of the cluster by yarn jobs. The first test jobs we ran used only 10% of the available CPU and <10% of the available memory, and took a long time to complete. We needed to optimize the parameters of both yarn and mapreduce. In yarn we needed to match the possible memory size to what was available on the nodes (ssh into the nodes, free -m, and then leave some margin for the OS). We also upped the ram for mappers and reducers. In the end we settled on 2GB per mapper, 4GB per reducer, and 1GB sorting buffer. (yarn.nodemanager.resource.memory-mb, mapreduce.map.memory.mb, mapreduce.reduce.memory.mb, mapreduce.io.sort.mb)
Order-by jobs not working. During optimization we saw errors in some jobs like: Unable to initialize any output collector. This was because we'd accidentally set the mapreduce.io.sort.mb too high. As explained here. Fix was to set the mapreduce.io.sort.mb to at most 1/2 of mapreduce.map.memory.mb
Error: Java Heap Size During optimization we saw some errors in jobs like: Error: Java Heap Size. This was because the two other mapreduce properties to creating java child tasks were not set properly as explained ehre. here. Fix was to set mapreduce.map.java.opts to be equal to mapreduce.map.memory.mb and set mapreduce.reduce.java.opts to be equal to mapreduce.reduce.memory.mb
Hive jobs not using memory Hive also has several memory-usage parameters which needed tuning before jobs really started to use the memory they were assigned. hive.heapsize was set equal to mapreduce.map.memory.mb. hive.exec.reducers.bytes.per.reducer was set to 1GB. After this tweaking, the hive jobs finally started to use a lot of memory and got a lot faster as a result.

HDP has a set of tools to help optimizing Yarn memory settings

It was great to see the cluster working so quickly!

KaveToolbox 1.2-Beta

Kave on Azure

Kave on Azure Home

For contributors

Developer Home

For someone who modifies the AmbariKave code itself and contributes to this project. Persons working on top of existing KAVEs or developing solutions on top of KAVE don't need to read any of this second part.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly