Skip to content

Detailed Guides Walkthru

Riccardo Vincenzo Vincelli edited this page Aug 30, 2016 · 21 revisions

Detailed guides: Walkthru/Blog

This page is meant for Beta Testers. If you are not intending to do an installation yourself, please go back to the top page.

In this page we will store user stories of how we have installed KAVEs, what procedure we followed and how well this worked. We will also follow up on different post-install actions we made.

These stories are divided by the package and version that was installed

AmbariKave 2.0-Beta KPMG DE

An official AmbariKave installation as D&A PoC Reference Implementation for KPMG DE

Introduction

This installation exercise has been carried out by R.V. Vincelli @rvvincelli in collaboration with the colleague J. Behrang from KPMG DE. The supervisors are R. Lambert @rwlambert and A. Schlosser.

The cluster is a virtual one, physically located in a KPMG IT Global datacenter location in Bulgary; the contact people are I. Georgiev and A. Atanasov. The installation is extranet-accessible from both the Dutch and German KNet's.

In charge of the physical cluster setup and management there is a specialized Linux DevOps team.

Cluster

The final cluster was correctly deployed after a couple improvement iterations. It must be pointed out that this AmbariKave installation is a proof-of-concept and not meant for real data science work and project assignments.

VMs

  • OS: CentOS6
  • OS disk: 60GBx1
  • Extra data disk: 500GBx1
  • nodes: gateway node, ambari server, two Hadoop namenodes and three Hadoop datanodes (extra data disk: datanodes only)

Disk layouts

gate:

  • /usr/hdp: 4GB
  • /var/lib/ambari-agent: 4GB
  • /opt: 10GB
  • /(OS): 32GB

ambari:

  • /var/log: 10GB
  • /usr/hdp: 4GB
  • /var/lib/ambari-agent: 4GB
  • /var/lib/ambari-server: 4GB
  • / (OS): 38GB

namenode: see gate

datanode OS disk: see gate

datanode data disk: /hadoop, 500GB

Connectivity

  • inbound: SSH to the gateway; access is granted from the KNets only
  • outbound: all (NAT); we must have Internet access from every node, they have to install packages from the Internet

The FQDN (eg. gate.cloud.kpmg.com) may not be longer than 32 characters. kave.io has been chosen but this being a real Internet domain we have encountered a glitch (see below).

General requirements

Available here. This list of requirements is the valid reference when preparing for a new AmbariKave cluster installation.

Screenshots

[root@ambari ~]# ip addr  show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1458 qdisc mq state UP qlen 1000
    link/ether 00:1d:d8:b7:43:1e brd ff:ff:ff:ff:ff:ff
    inet 192.168.55.3/24 brd 192.168.55.255 scope global eth0
    inet6 fe80::21d:d8ff:feb7:431e/64 scope link
       valid_lft forever preferred_lft forever
 

[root@ambari ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.55.4    gate.kave       gate
192.168.55.3    ambari.kave     ambari
192.168.55.5    nno-0.kave      nno-0
192.168.55.6    nno-1.kave      nno-1
192.168.55.7    data-0.kave     data-0
192.168.55.8    data-1.kave     data-1
192.168.55.9    data-2.kave     data-2

[root@ambari ~]# sestatus
SELinux status:                 enabled
SELinuxfs mount:                /selinux
Current mode:                   permissive
Mode from config file:          permissive
Policy version:                 24
Policy from config file:        targeted

[root@ambari ~]# service iptables status
iptables: Firewall is not running.

[root@ambari ~]# lsblk
NAME                                 MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                                    8:0    0   60G  0 disk
+-sda1                                 8:1    0  500M  0 part /boot
+-sda2                                 8:2    0 59.5G  0 part
  +-VolGroup-lv_root (dm-0)          253:0    0 33.1G  0 lvm  /
  +-VolGroup-lv_swap (dm-1)          253:1    0  4.9G  0 lvm  [SWAP]
  +-VolGroup-lv_hdp (dm-2)           253:2    0  3.9G  0 lvm  /usr/hdp
  +-VolGroup-lv_ambari_server (dm-3) 253:3    0  3.9G  0 lvm  /var/lib/ambari-server
  +-VolGroup-lv_ambari_agent (dm-4)  253:4    0  3.9G  0 lvm  /var/lib/ambari-agent
  +-VolGroup-lv_log (dm-5)           253:5    0  9.8G  0 lvm  /var/log
sr0                                   11:0    1 1024M  0 rom


[root@ambari ~]# grep ^server  /etc/ntp.conf
server 0.de.pool.ntp.org
server 1.de.pool.ntp.org
server 2.de.pool.ntp.org
server 3.de.pool.ntp.org


[root@ambari ~]# date
Wed Aug  3 19:31:37 CEST 2016



[root@ambari ~]# hostname
ambari.kave
[root@ambari ~]# hostname -s
ambari
[root@ambari ~]# hostname -d
kave
[root@ambari ~]# hostname -f
ambari.kave
[root@ambari ~]# uname -n
ambari.kave

In particular:

  • sufficient computing resources and OS version: in particular the setup of this cluster is sufficient for an explorative installation but a high-performance cluster suited for data science work does need more resources
  • fresh/non conflicting images: we require to install the operating system only, no other additional software except software to be temporary installed during this setup
  • static internal addresses: the VMs must have static local network addresses ie they are NOT obtained via DHCP and do not change after a reboot
  • FQDNs: we set a limit of 32 chars; no capital letters; hostname -d must be equal to hostname -f; also, the result of uname -n should show you the same hostname as hostname -f
  • forward and reverse lookup: the FQDN is associated to the local IP address and vice versa (arpa lookup); details on the requirements link above
  • passwordless root access: the root user must have access to all of the nodes of the cluster from the ambari node without password; this means using ssh-copy-id against every node from ambari as root
  • disabling iptables and selinux: these components interfere with our product

Setup walkthrough

Before you start

Assumption: you are installing the latest-greatest (and you should really do so!). This means: x.y = master. In our case it was 2.1.

To reset an installation: use clean.sh. This should also remove PostegreSQL and its directories /var/lib/{pgsql, postgresql}, /opt/postgres, as well as /etc/ambari-server/* /etc/ambari-agent/* /root/.pgpass.

Passwordless SSH must be enabled from ambari to:

  • all nodes, eg gate, nno-0...
  • all nodes.domain.xy, eg gate.kave.io, nno-0.kave.io
  • localhost

All hosts in your system must be configured for DNS and Reverse DNS. If you are unable to configure DNS and Reverse DNS, you must edit the hosts file on every host in your cluster to contain the address of each of your hosts and to set the Fully Qualified Domain Name hostmane of each of those hosts. Please refer to your specific operating system documentation for the specific details for your system. We remind that AmbariKave officially supports CentOS, RedHat and Ubuntu only.

Check the cluster is fresh: uptime.

Setup

yum install -y epel-release
yum install -y pdsh

pdsh -w ambari,gate,nno-0,nno-1,data-0,data-1,data-2 "mkdir /etc/kave; /bin/echo http://repos:[email protected]/ >> /etc/kave/mirror"

yum remove -y pdsh

Upload/paste the Ambari blueprint and cluster files.

service iptables stop
chkconfig iptables off
echo 0 >/selinux/enforce
sed -i s/SELINUX=enforcing/SELINUX=disabled/g /etc/selinux/config

Obtain AmbariKave:

  • yum -y install wget tar zip unzip gzip
  • wget http://repos:[email protected]/centos6/AmbariKave/2.1-Beta/ambarikave-installer-centos6-2.1-Beta.sh

We had to replace repos.kave.io with 94.143.213.26 because the DNS configured by the team was authoritative for the domain we chose, which is kave.io, but of course no Internet subdomain, like repos, was entered, with the net result that the repo domain could not be resolved.

Obtain AmbariKave dev:

  • wget https://github.com/KaveIO/AmbariKave/archive/master.zip
  • unzip master.zip

pdsh -w ambari,gate,nno-0,nno-1,data-0,data-1,data-2 "yum clean all"

AmbariKave-master/deployment/deploy_from_blueprint.py germany.blueprint.json germany.cluster.json --verbose --not-strict In case of SSH errors, try to login to the unaccessible node and make sure you can do so without any prompt. For yum errors, just retry a few times; option -d 10 for maximum debug verbosity.

The gate is the only extranet-reachable node, but we are interested in following the installation progress on the Ambari webapp. In order to do so, let us create a tunnel to the ambari machine. How to do it (requires ssh):

  • ssh -L 8081:ambari:8080 [email protected]
  • open your browser at localhost:8081 - you will be redirected to the Ambari webapp
  • password is: admin/admin, if it has not been changed

If all the services are yellow/install-failed then the problem is most probably FreeIPA server. Retry the installation and investigate the console output, amending the configuration as needed. Then, take care of the FreeIPA client's. Because of a bug in Ambari their installation must be retried from the command line, if needed. To check whether this is needed, ie whether the ipa clients were installed correctly or not, connect to ambari, copy the pass found in /root/admin-password; then connect to any other node and type: kinit admin inserting the password. If this works then the client is (most probably) correctly installed. Otherwise the command is: ipa-client-install and here is a sample interaction:

[root@gate ~]# ipa-client-install
DNS discovery failed to determine your DNS domain
Provide the domain name of your IPA server (ex: example.com): kave.io
Provide your IPA server name (ex: ipa.example.com): ambari.kave.io
The failure to use DNS to find your IPA server indicates that your resolv.conf file is not properly configured.
Autodiscovery of servers for failover cannot work with this configuration.
If you proceed with the installation, services will be configured to always access the discovered server for all operati
ons and will not fail over to other servers in case of failure.
Proceed with fixed values and no DNS discovery? [no]: yes
Hostname: <CURRENTVMNAME>.kave.io
Realm: KAVE.IO
DNS Domain: kave.io
IPA Server: ambari.kave.io
BaseDN: dc=kave,dc=io

Continue to configure the system with these values? [no]: yes
User authorized to enroll computers: admin
Synchronizing time with KDC...
Unable to sync time with IPA NTP server, assuming the time is in sync. Please check that 123 UDP port is opened.
Password for [email protected]:<CONTENTS OF /root/admin-password>
Successfully retrieved CA cert
    Subject:     CN=KAVE.IO Certificate Authority
    Issuer:      CN=KAVE.IO Certificate Authority
    Valid From:  Fri Aug 05 14:48:20 2016 UTC
    Valid Until: Wed Aug 05 14:48:20 2026 UTC

Enrolled in IPA realm KAVE.IO
Attempting to get host TGT...
Created /etc/ipa/default.conf
New SSSD config will be created
Configured sudoers in /etc/nsswitch.conf
Configured /etc/sssd/sssd.conf
Configured /etc/krb5.conf for IPA realm KAVE.IO
trying https://ambari.kave.io/ipa/xml
Forwarding 'env' to server u'https://ambari.kave.io/ipa/xml'
Hostname (gate.kave.io) not found in DNS
Failed to update DNS records.
Adding SSH public key from /etc/ssh/ssh_host_rsa_key.pub
Adding SSH public key from /etc/ssh/ssh_host_dsa_key.pub
Forwarding 'host_mod' to server u'https://ambari.kave.io/ipa/xml'
Could not update DNS SSHFP records.
SSSD enabled
Configuring kave.io as NIS domain
Configured /etc/openldap/ldap.conf
NTP enabled
Configured /etc/ssh/ssh_config
Configured /etc/ssh/sshd_config
Client configuration complete.

Check again the client was correctly installed by running the kinit admin test above. Now reinstall and start the services as needed from the Ambari webapp.

Create a sudoer (eg kaveadmin) on gate: see here If this is not enough as root uncomment for wheel under ## Allows people in group wheel to run all commands in the sudo conf file.

Enable VNC on gate: vncserver. Now we are ready to VNC (hoping no port is blocked...).

Change root pass.

Additional configuration could be needed in the future to comply to security risks. For example TLS is mandatory if installing XRDP.

AmbariKave 2.0-Beta

Installing a hadoop cluster within a test big-data environment

Dante, A KAVE beta tester with an already existing Greenplumb DB-based big-data test cluster (including the general apache haddop distribution) was looking to expand capabilities in the hadoop world to benefit from the hadoop ecosystem. Here is what Dante has to say of the experience:

Introduction

I like to think of myself as a contributor and not merely a consumer of free software and support. I test things all the time and I do my best to alert developers of any problems I find.

My clients know I’m accustomed to uncharted waters. For weird state-of-the-art POCs, they come to me. Recently, I was asked “What do you think about this Kave thing that KPMG is working on? Is it just another Hadoop distro? Should we use it? Why don’t you try it out and let us know?” So there I went and here is my first look for the community at large.

Approach

I started browsing the tool site (which, I must say, is concise and well organized) and to my utter surprise found out that KAVE was Open Source. I concede that, at first, before I saw that it was open source, I was quite terrified about testing any closed source software, my big fear was this could be a Trojan, or a spy. Also, there were, at the time, no toy servers to spare. I would have to use my work laptop for testing. I called it Battlefield KAVE... into the belly of the beast!

I decided to install in my notebook. It’s a nice Dell running CentOS. I find it quite fast. Kave is surprisingly compact. Exercise some caution and allow some time for installation because it installs a lot of dependencies.

What issues were encountered and how were they fixed?

As you may imagine, I use other tools on this same machine, Greenplum DB and Apache Hadoop for example. A day or two after installing Kave, I repeated a previously successful experiment with Greenplum DB, and it would not start. During Kave installation, I saw that it installed PostgreSQL, and Greenplum DB is based on PostgreSQL, so I was inclined to blame Kave. After a period of inquisition, (pun intended) I found Kave was innocent.

I also experienced problems with DHCP and non-fixed internal IPs in my environment. ( Ed.: now added to install guide! ). Once I'd fixed this temporarily I would have this general advice: most big data tools are designed to be run in a closed environment, with fixed local IP addresses. Installing them [Ed.: e.g. installing hadoop from AmbariKave] directly on a laptop or other LAN PC may be a bad idea. Your best option for development works may be to install them in a VM environment with static network addresses.

Aside from that installation was fine, no retry of any part was needed. So far I have no issues with running the cluster. The installation unfolded without complications. The tool is promising. If you share my view that open-source constructs are by definition more transparent and trustworthy, I recommend you consider Kave for your Big Data project.

AmbariKave 1.2-Beta

Installing a hadoop cluster in situ in a data centre

This was a really interesting project. A group was interested in processing 300GB of transactional data in a hadoop cluster to benchmark test against their standard proprietary solution.

What did we start from?

We started with having no hardware, having a good understanding of Ambari and hadoop, and the example blueprints. This is a pretty basic place to start!

How long did it take?

  • Resourcing lead time: 1 week
  • Stage A deployment (installing Centos6 nodes/Hypervisors) : 2 days
  • Stage B deployment (installing Kave from the installer): 5 hours
  • Relocating to the data centre: 1 week
  • Stage C deployment (fine-tuning, loading data and fixing issues): 2 days

So, all-in-all it took one week of work and two weeks of downtime besides, waiting for the blades, and waiting for the cluster to move to the final datacenter.

How did we do it?

  1. We first considered what software we would be running. In this case we opted for a simple three-node hadoop setup with an additional gateway for the data scientists to have a playground, and a separate machine to run gitlabs for code management.
  2. This helped us to choose the resources we needed. We followed the recommended specs of Ambari, and ended up with a 4-blade setup, where each blade was 24 cores and 112 GB ram, 12 TB of HDD space.
  3. We allocated three of these blades as physical data nodes, and installed Centos6 on them. On the other we installed a kvm hypervisor and sub-divided this into two namenodes, a gateway machine, an ambari machine, and a dev machine for gitlabs.
  4. While this was happening, we started from the simple hadoop blueprint example, and we iterated over it, installing and re-installing it, on Amazon (AWS). This really sped up the whole process. Using the existing AmbariKave test structure we created three files, test.aws.json test.blueprint.json and test.cluster.json, and we matched the aws resources to the same or similar size we knew we were getting from the blades. After around 6 hours we had a very workable blueprint for a small hadoop cluster, gitlabs, kavetoolbox and the kavelanding page.
  5. In what turned out to be a really really good idea, the network administrator for the data centre configured a fake network architecture for the blades which matched exactly the network setup in the data centre, with one exception, these machines were located in an office where we could reach them for the initial setup, and the nodes were all given access to the internet for software installation. The network was identical down to the assigned IPs, domain names, nameserver, subnet structure, firewall, so that we could be sure that the nodes would work when rebooted inside the data centre.
  6. The network admin/sys admin followed the requirements shortlist to ensure the cluster was setup correctly.
  7. Once the servers were configured, with Centos6 installed, we went to the office with the blades and began the installation from the pre-prepared blueprint. We needed: (a) the quickstart instructions on the website (b) the deployment package from the repo, to get the deploy_from_blueprint script (http://repos.kave.io/noarch/AmbariKave/1.2-Beta/ambarikave-deployment-1.2-Beta.tar.gz) (c) Our pre-tested blueprint with us on a memory stick (d) A copy of the restart_all_services script. (e) A copy of the clean.sh script.
  8. Installed AmbariKave onto the ambari node from the quickstart instructions on the website
  9. Tried to deploy our blueprint using deploy_from_blueprint.py our.blueprint.json our.cluster.json (many problems encountered here)
  10. Monitored through the web interface on the ambari node :8080
  11. Made fixes and changes where necessary (see below)
  12. Once the cluster was up and running, it was shut down, moved to the datacenter and restarted. This worked great :)
  13. With the cluster back up and running, we obtained access through the new firewalls now in place and we could immediately work on our KAVE.
  14. We found we needed to adjust the yarn resource parameters from Ambari to better match the available CPU and memory of the data nodes. We also needed to add users in FreeIPA, on HUE and go to the namenode and add all users expected to use hadoop to the hadoop group separately (sudo usermod -a -G hadoop myuser). HDP has a set of tools to help optimizing Yarn memory settings

What issues were encountered and how were they fixed?

  1. Ambari agents not talking to the ambari server! The deploy_from_blueprint script is clever. It stops at the first sign of trouble so you can go and investigate it. In this case it was telling me that there were nodes that were not connecting to the ambari server, so not appearing in the hosts of the ambari server. We first noticed that both selinux and iptables were still running on the nodes. We turned this off, and then at least some of the nodes were visible by the ambari server.
  2. Ambari agents still not talking to the ambari server! One of the three data nodes was still refusing to talk to the ambari server. We tried a complete re-install of Ambari, using the clean.sh script, we tried restarting the nodes themselves, but this was not the problem. Testing the ambari agent logs there was some problem authenticating to the server over SSL. This was odd, since all the other nodes could communicate fine. By a coincidence we noticed that data001 was the only node whose system clock was randomly set slightly earlier than the ambari node. We installed ntp, reset the clocks based on the existing DNS clock, and then performed a re-install of the agents to get allocated new SSL certificates by ambari. Since many of the machines were running under the same hypervisor, those were all already sychnronized, and then of the three data nodes which could be out of sync, only data001 was set earlier in time with respect to the rest. This was a very interesting problem to debug, in future we will add "install ntp" to the instructions!
  3. KaveToolbox 1.2 installation failing! At one point of KaveToolbox installation, it uses rsync to copy files. However, KaveToolbox 1.2 does not itself install rsync, and it was not already installed on the centos6 minimal images we were starting from. So, we used pdsh to issue the correct install command to all nodes (pdsh -R ssh -g our yum -y install rsync).
  4. Machines becoming isolated in the network! During the installation of FreeIPA, reverse lookups suddenly stopped working within the cluster, causing a lot of problems, and no external urls could be resolved, meaning the installations of other services were just timing out trying to download/install packages. We installed bind-utils to try and debug this and quickly realized the problem. The FreeIPA "forwarders" parameter was incorrectly set. FreeIPA was taking over the DNS, but not forwarding lookups to the existing DNS gateway to the rest of the world. To fix this all we needed to do was go to the FreeIPA web interface on the ambari node and configure the forwarders to use the correct IP.
  5. Hive/tez/oozie apps not existing on hdfs! When we first booted the cluster, running sql queries through Hive was failing, it was complaining it could not access certain jar/tar files which were supposed to be on hdfs. This problem fixed itself. Due to all the other installation problems where were several points where the datanodes could not talk to each other or the namenode and visa versa, and so these files had not been replicated correctly. The next time we sat down to debug this, the files had been replicated and there was no more problem to fix.
  6. Optimization Very small usage of the cluster by yarn jobs. The first test jobs we ran used only 10% of the available CPU and <10% of the available memory, and took a long time to complete. We needed to optimize the parameters of both yarn and mapreduce. In yarn we needed to match the possible memory size to what was available on the nodes (ssh into the nodes, free -m, and then leave some margin for the OS). We also upped the ram for mappers and reducers. In the end we settled on 2GB per mapper, 4GB per reducer, and 1GB sorting buffer. (yarn.nodemanager.resource.memory-mb, mapreduce.map.memory.mb, mapreduce.reduce.memory.mb, mapreduce.io.sort.mb)
  7. Order-by jobs not working. During optimization we saw errors in some jobs like: Unable to initialize any output collector. This was because we'd accidentally set the mapreduce.io.sort.mb too high. As explained here. Fix was to set the mapreduce.io.sort.mb to at most 1/2 of mapreduce.map.memory.mb
  8. Error: Java Heap Size During optimization we saw some errors in jobs like: Error: Java Heap Size. This was because the two other mapreduce properties to creating java child tasks were not set properly as explained ehre. here. Fix was to set mapreduce.map.java.opts to be equal to mapreduce.map.memory.mb and set mapreduce.reduce.java.opts to be equal to mapreduce.reduce.memory.mb
  9. Hive jobs not using memory Hive also has several memory-usage parameters which needed tuning before jobs really started to use the memory they were assigned. hive.heapsize was set equal to mapreduce.map.memory.mb. hive.exec.reducers.bytes.per.reducer was set to 1GB. After this tweaking, the hive jobs finally started to use a lot of memory and got a lot faster as a result.

HDP has a set of tools to help optimizing Yarn memory settings

It was great to see the cluster working so quickly!

KaveToolbox 1.2-Beta

Table of Contents

For users, installers, and other persons interested in the KAVE, or developing solutions on top of a KAVE.

Kave on Azure

For contributors

For someone who modifies the AmbariKave code itself and contributes to this project. Persons working on top of existing KAVEs or developing solutions on top of KAVE don't need to read any of this second part.

Clone this wiki locally