Skip to content

Access Security and Privacy

Dr. Rob Lambert, PhD edited this page Aug 24, 2016 · 18 revisions

Administration

The KAVE software stack is almost always installed within a dedicated secure network. In this case we recognise the following general roles from the context of security and KAVE provides interfaces for administrators for these roles. When only a small team (or one person) is installing/using a KAVE it is likely they will assume all these roles.

  • Infrastructure Administrator: provisions the cluster (creates/manages hypervisor/vm configuration), configures firewalls and other network elements. The Infrastructure Admin will use existing services such as VCloud or those provided by their cloud provider.
  • Ambari Administrator: Ambari installs and configures services within the cluster. The Ambari Admin manages these configurations and installations via the Ambari admin interface which includes a configuration management framework and auditable logging.
  • User Administrator: FreeIPA provides a common user adminstration framework for KAVE, providing system users, LDAP and Kerberos. A user administrator interacts with the FreeIPA web interface to administer all users including the creation of superusers (system administrators) and host-based-access-control.
  • System Administrator: Users on the linux command line can be promoted to super users (systems administrators) by the User Administrator, giving them access to all services and system tools on specified groups of machines in the cluster. This system admin can then effect repairs or troubleshooting across the cluster, at least two System Admins are usually required.
  • Service Administrator: The Ambari Admin and User Admin can together delegate service administration to individual service admins, for example delegating Jenkins user control or JBoss management privilages to a specific user. This is usually assessed on a case-by-case basis. A combination of FreeIPA and Ambari and the service itself will provide the management interfaces.
  • Data Strategy and Governance: In the case of personal or sensitive data a mature environment will have a dedicated team structure, procedure, hierarchy and tools for data governance. The tools Apache Knox, Ranger, Falcon and Atlas are the HDP solutions for data governance with Hadoop, included with KAVE, used especially within a data-lake paradigm.

Separate from these roles we see two distict types of users of the system which differ as follows:

  • Data Science Team: work within the existing services, inside of the cluster, adding data or working with existing data, escalating any issues to the relevant administrator
  • Insights consumer: an external webpage or individual who is granted rights by the combination of Infrastructure Administrator, Data Governer and User Administrator to view the processed data being output over whatever frontend is created.

Control and Governance

Security Standards

KAVE software is modular and configured by the end user and provided as is with no warranties or guarantees of any kind. However, we take security concerns very seriously during KAVE development. We include reviews from the context of security in our release cycle.

Security can be split into three main categories, security of the software itself and development process, security of a implementated/provisioned KAVE cluster and security of usage of the cluster in terms of access procedures and data transfer. We (the KAVE team) release the software and so we as far as possible take care to validate the software itself and development process. We (the KAVE team) have no responsibility for the implementation/provisioning of KAVEs by unknown third parties or their usage. However, we try and be as helpful as possible by performing regular penetration testing on our own internal test KAVE instances that we have provisioned with security in mind, and giving the following guidelines to help make decisions about how to setup and run your cluster.

The software itself and development process is in principle OWASP level 1 compliant. Higher levels are achievable through careful provisioning, control and implementation by the party installing KAVE.

Why is this important?

Data security is of paramount importance when that data falls into one of the following categories:

  1. Personally Identifiable Information (PII)
  2. Sensitive (internal) information
  3. Secret: Information of interest to competitors, particularly financial information
  4. Financial data used to generate financial statements, tax statements, or reports whose handling must be understood from key auditable principles or is covered by relevant financial regulations

In your country, there is likely to be a specific legal framework which binds to the nature of the data in question and requires you to make very strict decisions on how you handle your own data. KAVE is built with this in mind. We know you have many different requirements and this can often reduce the flexibility needed to take advantage of the full power of your data. KAVE is designed to give you back your peace of mind.

  1. Strip: off PII or sensitive data if possible. Only send to the KAVE the data you choose to send there, all communication should be end-to-end asymmetrically encrypted.
  2. Think: consider if the PII you are sending falls within the purposes for which you have gathered it, and also consider if it is absolutely necessary for the solution you envisage.
  3. Pseudonyminize your PII. A trusted third party can be used to remove the 1:1 relationship between customers/individuals and entries in your database, but still allow the combination of different sources through a pseudonym. "Frank Jones" is replaced by "RandomCustID#2202291919".
  4. Separate: Provision KAVEs for different purposes as necessary. Throwing all your data together into a multi-tenant environment is costly, time-consuming, and dangerous. It provides a honey-pot for hackers. It is also difficult to then guarantee analyst independence and to track usage for auditability. Avoid this by creating dedicated KAVEs for certain purposes and only sharing the insights between them.
  5. Blind: Consider "blind analysis", a blind analysis can refer to many different approaches, in this case we refer to the case where an analysis can be developed on simulated data and a small subsample, and then applied to a full sample later without the possibility of adjusting the final result. This engenders impartiality in your outcomes and complete statistical accuracy. It also helps reduce the impact of using PII and prevents discriminatory profiling.

Recommended network features

  • very specific firewall rules
  • INBOUND: only on ssh or https ports (guaranteed encrypted)
  • INBOUND: only to the gateway or JBOSS machine (only two edge nodes), also to the gitlabs machine if you wish to re-use the code in other environments
  • OUTBOUND: permit everything, ideally not using a proxy or blacklist
  • INTERNAL: restrict outgoing destinations of the JBOSS server to forbid direct access to the source data
  • INTERNAL: Permit all other traffic on all ports

Data security:

  • Your own internal policy, access to the KAVE as an analyst is considered the same as granting access to the data stored on the KAVE
  • With a large enough gateway node, data never needs to leave your KAVE, although sometimes you may want the flexibility to allow analysts to download parts of the data
  • Provide the minimal set of data to the KAVE to accomplish your purpose

Privacy concerns:

Privacy of personally identifiable information is the responsibility of all parties concerned. It is not always necessary to take extreme measures, but each situation needs to be evaluated on a case-by-case basis. The privacy of potentially sensitive information is often less legislated and relies on your own judgement, along with potential NDAs which might need to be signed at your discretion.

  • You must understand and comply with local rules and regulations which extend to your own data and how you obtained that data
  • Strip if possible: reduce your risk and ensure compliance by striping certain data away from the packet before sending to your KAVE
  • Pseudonyminise if necessary: a third party can ensure your data are no longer personally identifiable
  • Consider NDA: perhaps the corporation, or each analyst, will need to sign your NDA
  • Reduce cross-feed: Ensure different KAVEs for different data science teams if working on different data for different solutions
  • Independence and Professionalism: Consider the independence and professionalism of your teams in light of their connections to potential third parties

Remember the KAVE is "yours" a dedicated environment, in principle you are sending your own internal data to your own internal system

Remember, with your own KAVE, the control of the data is in your hands, only send the data you are comfortable with

Analyst Access:

  • Restrict heavily the ability to modify the firewall rules, this is the main concern.
  • One account per analyst, ideally able to see all the data stored in the KAVE and use all tools installed there
  • ssh-key based authentication where possible, otherwise an industry standard password policy
  • Consider granting sudoer rights for all analysts on the gateway node, or at least one member of each analytics team
  • Grant at least one member administrator privileges across the cluster

"the root user"

  • Do not allow anyone to ssh into your system as the root user unless they come with the correct SSH key-pair authentication

Keep up-to-date

  • Ensure your software is kept up to date with regards to security vulnerabilities and mandatory patches

Accessing your KAVE

Your KAVE is not treated as a remote static system where analysts are restricted to a short-list of very basic tools. Instead the KAVE grants your data science team the possibility to work in the way which is best for them, it is flexible enough to work with your analysts and data to really use the power of the data.

KAVE is a linux-based system. This implies at some point your data scientists will need to access a linux system. Most often this is achieved using a combination of firewall rules, VPNs, secure shell connections (ssh) and remote desktop sessions (either using the microsoft RDP, open-source VNC, or other related remote desktop systems such as NoMachine or aws workspaces).

Exactly how you accomplish this is up to you, the regulations of your department or company, your network configuration, where the KAVE is located, and your level of expertise.

Video tutorial

Accessing KAVE is covered in the video tutorial here: https://www.youtube.com/watch?v=eBgr2wXjOZw

Connecting to KAVE

There is also a text guide below, and a separate user-based wiki page here: User-Manuals

Most basic access

In most cases, to grant analysts access to the data is the same as granting analysts access to your KAVE. To access your KAVE the most basic requirement is:

  • A ssh client locally installed (for example PuTTY for Windows, or the standard client for Mac/Linux (available by default))
  • A port open through your firewall to connect to the gateway of your KAVE
  • Knowledge of your local proxy setting (if any)

Given that you have these three very simple and standard things, you can access your KAVE.

Note that:

  • SSH is always encrypted
  • SSH communication is end-to-end encypted, avoiding man-in-the-middle attacks
  • SSH allows for multiple hops, in case you need to access different resources in your KAVE, you can double-hop through SSH, and this is then doubly encypted
  • SSH is a globally used standard on a wide range of web servers and solutions
  • SSH does not give anyone else access to your local computer
  • SSH cannot be 'sniffed' for passwords and does not grant anonymous access
  • There is no sufficiently mature alternative to ssh-based connection
  • Permitting only ssh-key access is a choice which can make brute force password hacking impossible

It is likely that you will also want to install a file-copying tool, such as WinSCP.

If you are unable to install a ssh client, usually because you are on a centrally managed windows machine or behind a very restrictive firewall, then it is likely you will need to setup an external server where an ssh client can be installed. This is not recommended as every additional server hop in the network implies

Advanced option 1: connecting to internal web UIs

Given that you have a working ssh connection, your analysts (or yourself) will most definitely need access to the software services running in your KAVE, and one easy way to do this is through the web-UIs that all our standard software usually permits. For this there are two options:

  • VPN or dedicated network

Your network administrator and the hosting provider for your KAVE can help you in setting up a VPN or dedicated network. With these methods your local laptop/terminal will behave as if it is within the KAVE network, able to access all resources.

  • An ssh proxy redirect.

If you are unable to setup some sort of VPN, you can route specific network traffic through your secure ssh session. This is equivalent of a VPN, however it does not require the infrastructure overhead of a VPN, you only need an ssh connection and some local software that will facilitate the correct local proxy settings. This can be accomplished for example by Firefox and the FoxyProxy plugin.

Advanced option 2: A remote-desktop experience

You may wish the entire development of solutions and visualization of results to take place on your own dedicated KAVE without the option of copying data back to a developer/analyst laptop. In this case you will need a desktop experience within the KAVE network. There are several options for this also, one free way to do this is to use VNC over ssh.

VNC is the natural equivalent of RemoteDesktop for a linux platform. You can pass your remote desktop through your encrypted ssh connection. This is a very secure way to obtain a full development environment within your KAVE.

What do we gain from the complete KAVE graphical experience?

Since you have a full desktop environemnt, with controlled/administered sudoer privilages and access within the KAVE, encrypted natively, from your windows/mac/linux desktop you gain:

  • Security of code and solutions as well as data. Often the code you use to do your analysis should be considered proprietary and may give insights about the data you have, if it falls into the wrong hands. By implementing one of the two above solutions you secure the code just as well as you secure the data itself.
  • Developing close to the data. The shorter the round-trip between developer and data the better, the analysts will love to get their hands stuck straight into the data from where they are and use the tools they find the most appropriate.
  • Transferrable skillset and very-highly-sought-after expertise. Data Scientists comfortable on a linux platform and able to design their own solutions from first principles are incredibly valuable to your organization.

Typical role-based access

The following roles can be managed separately if required, ordered from least restricted to most restricted:

  • Access to services: an individual with access to the KAVE is granted access to at least to some list of services holding at least some of the data. Access to all services can be managed separately using groups within FreeIPA. It is possible to configure data access based on roles also but we don't really recommend KAVE for a multi-tenant environment.
  • Administration rights on certain nodes: it is usual for certain data science tasks to require installation of software, and this often needs so-called superuser or sudoer rights.
  • Ambari configuration management: access to the Ambari web interface to control and configure the cluster, add and monitor services.
  • User management role: access to the FreeIPA management interface only. Presumably from a dedicated network.
  • Network and infrastructure administration: configuring the firewall rules, spawning machines within the cluster, maintaining and monitoring the cluster's health.
  • Terminal access to the Ambari administration node: should be heavily restricted, be default forbidden to all users.
  • Admin access to the Ambari administration node: Granting admin rights on the admin node gives a person total control over the cluster. It is not needed for most purposes, usually access to the Ambari interface and FreeIPA is sufficient, but for certain new installation procedures, security updates, etc, this might be necessary.

This naturally leads to at least two roles within a small data science team:

  • Data Scientist: Access to all services and data on the cluster, ingress over SSH on the gateway machine only. Usually members of the team.
  • Data Architect: Access to all services and data on the cluster, ingress over SSH on the gateway machine only, access to the Ambari web interface, sudo rights on certain nodes. Usually at least one member of the data science team.
  • Supporting Administrators: Access to all services and data on the cluster, ingress over SSH on the gateway machine only, sudoer on all nodes, access to the Ambari interface and FreeIPA interface. Perhaps also granted to a member of the data science team.
  • Global Administrator: Complete control, using different usernames/passwords where required, from a very restricted location. Usually a member of the hosting provider's team, or a dedicated infrastructure team.

This naturally leads to more roles in a production environment, potentially with time-dependent role escalations. Read more about FreeIPA inside a KAVE

Table of Contents

For users, installers, and other persons interested in the KAVE, or developing solutions on top of a KAVE.

Kave on Azure

For contributors

For someone who modifies the AmbariKave code itself and contributes to this project. Persons working on top of existing KAVEs or developing solutions on top of KAVE don't need to read any of this second part.

Clone this wiki locally