Skip to content

Detailed Guides Examples

[email protected] edited this page May 1, 2015 · 5 revisions

Detailed Guides - Examples

The KAVE comes with a whole raft of possible tools aimed to solve any and all problems you may or may not encounter in data science. This is useful to know, but without more concrete examples how will you ascertain whether that can be translated to your own data science problem?

In this short guide we present a few different "solved problems" which we have personal experience working with KAVE. For more concrete examples, see the KAVE in Action document on the main beta-testing website.

Case 1: Vendor-locked data silo for a retailer

Case premise:

A retailer employs a third party contractor to manage a relational database of all of their transactions. The contractor also manages the infrastructure, from hardware through to network, the access rights and in principle the schemas of the database.

Because the data are stored in a propitiatory format, and due to existing policies it is extremely difficult to work with these data in an exploratory and scientific way. It is also prohibited to allow an external data science team into the original data silo, and prohibited to install third-party software within the data silo for the time being.

Case solution:

Here we would use:

  • KAVE hosted at a third-party hosting provider (ISO27001/SOC2) with a dedicated contract. The original data owner holds this contract, they then still control the hardware on which the data are residing, as an effective extension of their existing data centre.

  • ETL from the existing database into a relational database stored on hadoop (Hive). ETL is a complex process and can take many subsequent steps to address data quality issues and validate incoming data. We would use pentaho kettle to manage the workflow. The workflow is initiated by the infrastructure engineers of the original data owner, meaning they remain the only persons with access to their original data warehouse. By moving the data outside of the production environment, we cannot interfere with production systems in any way. Any ODBC-compliant database is compatible with sqoop, which also has dedicated implementations for Oracle and other modern and popular databases.

  • Hive and iPython for data exploration. Hive exposes an SQL interface to your relational data which is a very common way to interact with data. Hive can be used to parellelize the largest part of the data handling, using the full power of the hadoop cluster. iPython contains the tools we are familiar with, integrates into ROOT and R, enabling a common interface for our data scientist with which they are familiar. A whole host of examples of complex statistical methods are available online for bootstrapping into your code and getting as close to the data as possible. Apache Spark can be used to bridge the gap between what is possible in the stand-alone iPython notebook and what is possible with the full power of hadoop, coupling with ROOT can avoid the limitations of local RAM size.

  • Gitlab and Twiki can be combined for co-developing a codebase of analytical functions, SQL queries and even complete workflows without stepping on each-other's toes, and provide a route to a production-quality embedded system.

  • MongoDB and JBOSS. The end point of the analysis is likely to be a series of reports and plots. These can be stored directly into the intermediate database of MongoDB and served back into the existing business infrastructure via JBOSS. In this case the firewall of the KAVE would be specifically configured to allow JBOSS to be connected to one or more external web applications, or to re-load data back into an existing BI system.

Case 2: The sudden data influx short-term PoC

Case Premise:

A sudden new dataset becomes available, that appears a little too large to effectively process on one laptop. It contains no personally identifiable information but is considered in some ways sensitive. Some of the data is relational, other parts are plain free-text data. It's necessary to process on dedicated hardware in the building of the data owner, but the lead time required to get started is only a matter of days. Within a week or so a business case must be develop to use this data within the organization.

Case solution:

Here we would use:

  • KAVE hosted on off-the-shelf hardware bought into the data-owner's building, installed with a single click with a previously tested blueprint. We would connect to this perhaps a couple of development laptops, powerful laptops with the KaveToolbox installed, one per analyst.

  • Hadoop provides a common place to store the relational data in a hive database, and also the non-relational data in a well-backed-up manner. Mining free-text data is one of the key purposes for which Hadoop was first developed, and it retains its strengths there to this day. Data can be processed on hadoop to form summary infomration from the free text data, and couple to complex tools to perform sentiment analysis, for example.

  • KaveToolbox provides common tools across the developer laptops and the KAVE enabling fast analysis turn-around thanks to an integrated common approach. Small amounts of data would be transferred back to the laptops as a result of HIVE queries or exploratory hadoop jobs.

  • iPython notebooks are again the key tool here. In the fast-paced PoC python is ideal, it is a very quick language in which to prototype code and has the tools necessary for the most powerful statistical analyses thanks to integration with ROOT, R and numpy.

Case 3: Realtime data just keeps coming

Case premise

A new opportunity has been identified to mine streaming data to improve the customer experience. However, the data volumes expected are very large and the output will need to be embedded within production systems in near-realtime. The eventual use cases for the data collected are unknown and all personal identifiable information must be stripped before storage.

Case Solution

This is the most complicated possible case, requiring every feature of the KAVE, and is the purpose for which the KAVE was originally designed. The main problem with realtime data is that the data size just continues to grow with time, as more data are available also what you would like to do with that data also increases, meaning scalability is needed not only horizontally but also functionally.

Here we would use:

  • KAVE hosted at a third-party hosting provider (ISO27001/SOC2) with a dedicated contract. If the data after stripping PII are not considered very sensitive it makes sense to consider hosting in a virtual private cloud with one of the large providers.

  • A dedicated contract with a trusted third party to pseudo-randomise the data as it is streaming, replacing personal information with unique random strings before ever entering the KAVE.

  • KAVE running a complete lambda architecture, with a common java codebase (managed with Archiva) that can run equally well in batch mode on hadoop or in real-time mode on Storm. Kafka and storm used to ingest the data, and Hadoop and MongoDB used to store the processed or pre-processed data.

  • The complete development line and analytics tool set. Data scientists are granted access to a gateway machine and from there can explore the available data, work on improvements to the code and to the analytics applied, and embed the solution back into the existing workflow with continuous integration and code quality checks made by Jenkins, Gitlab and SonarQube integration.

Table of Contents

For users, installers, and other persons interested in the KAVE, or developing solutions on top of a KAVE.

Kave on Azure

For contributors

For someone who modifies the AmbariKave code itself and contributes to this project. Persons working on top of existing KAVEs or developing solutions on top of KAVE don't need to read any of this second part.

Clone this wiki locally