Data Storage Workshop – Dr Jean-Paul Ebejer

In this workshop (typically for Data Science and Big-Data sessions) we are going to look at storing data in database systems, as opposed to storing them in text files (e.g. .csv).

Requirements

The following assumes you have a base Linux (Ubuntu 22.04 LTS) installation. Other more recent versions may work too (let me know otherwise). If you are a windows fan, may I suggest you create a beefy VM (using VirtualBox or on the cloud) and install Ubuntu there.

You will need a ton of different software (configuration, etc.) for this workshop. At the very least you will need mysql, memcached, mongodb, cassandra, neo4j, and Python drivers for all of these. Normally, this would be a pain, but we will be using docker so everything is conveniently installed and set-up for you and placed in different docker containers (you're spoilt these days!).

The rest of the workshop (and practical) assumes you have docker installed. This can be achieved readily (on Ubuntu 22.04 LTS) with:

# Install dependencies
sudo apt-get install -y ca-certificates curl gnupg lsb-release
# Add repository key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
# Add repository for docker packages
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
# Update the apt package index
sudo apt-get update
# Install new packages
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
# Check docker service is running
sudo systemctl status docker
# Add docker role to your currrent user (to avoid sudo each time, but groups are updated at login)
sudo usermod -aG docker ${USER}
# Check all of the installed software works with:
sudo docker run hello-world

After adding the docker group to your user, you will need to log out and in again to show the new group (check with the id Bash command). An alternative to logging in and out again is to use su - $USER (this effectively re-logs you).

The Ubuntu instructions to install docker where taken from the docker site.

Clone this repository using git clone [email protected]:jp-um/data_storage_workshop.git. (If you do not have your ssh keys and/or account set up in github you can simply git clone https://github.com/jp-um/data_storage_workshop.git instead). You should then run this workshop from the root directory of the repository. To install git on Ubuntu, simply sudo apt install git.

SQL (Relational) Databases

We will use the official latest mysql containers for the relational databases part of the workshop.

This is easily achieved as follows:

# Get the latest (official) image for mysql
docker pull mysql/mysql-server:latest
# List all images available locally (you should see mysql)
docker images
# Start a container from the mysql/mysql-server:latest image
# You need to give this some time (takes longish to start)
docker run --name some-mysql -e MYSQL_ROOT_HOST=% -e MYSQL_ROOT_PASSWORD=my-secret-pwd -d mysql/mysql-server:latest
# This will tell you what is going on with your instance
docker logs some-mysql
# What are the docker containers running, you should see mysql
docker ps
# From the sql directory, import the structure of the database (and some 
# data) from the insurance.sql script file.
docker exec -i some-mysql mysql -h0.0.0.0 -uroot -pmy-secret-pwd < sql/insurance.sql
# Use this if you want to run the client directly in the docker container
docker exec -it some-mysql mysql -h0.0.0.0 -uroot -pmy-secret-pwd
# Once in the mysql terminal, list all databases
show databases;
# Use our newly created db
use insurance;
# See the tables
show tables;
# See some data
select * from driver;
# To quit the mysql terminal
\q

Skip to the Jupyter Lab section, if you are only interested in running the mysql exercise.

NoSQL Databases

During this workshop I will show you the basic concepts of NoSQL systems. We will use four NoSQL technologies, each representing a different data model -- Memcached, MongoDB, Cassandra, and Neo4j.

NoSQL Database Engines

We will use (the latest) official docker instances to install the NoSQL database engines. Follow the instructions presented in each section. Each of these steps may take a few minutes. All the names of the containers will be prefixed with some- (this is an arbitrary decision).

Memcached

To run memcached execute:

docker run --init --name some-memcache -p 21211:11211 -d memcached

-d detaches the docker container once initialized.

MongoDB

To run mongodb execute:

docker run --name some-mongo -d mongo:4.4.6

Note that the version at the end (i.e. 4.4.6) is important, as the latest (5+) mongo image requires the CPU to have AVX support which may not be available if running via some virtualization software (e.g. Windows host with an Ubuntu Guest using VirtualBox).

Cassandra

To run cassandra execute:

docker run --name some-cassandra -d cassandra

Neo4J

To run neo4j execute:

docker run --name some-neo4j -e NEO4J_AUTH=neo4j/testtest -p 7474:7474 -p 7687:7687 -d neo4j

Note that the above switches off Neo4J authentication (by setting an environment variable in the container).

You should be able to see all running containers with docker ps. Look at the status column of the resulting output.

Jupyter Lab

I have developed example usages of each NoSQL database, using Jupyter lab. This is a browser-based IDE which uses Jupyter Notebooks. First, you should copy all code (as Jupyter notebooks) from the sql and nosql directories in a local directory on your Ubuntu installation (alternatively it is easier to git clone this repository as suggested above). Remember to change the /REPLACE/WITH/YOUR/LOCAL/PATH path below. The last line -- jpebe/data_storage_workshop -- identifies the image, stored on docker hub, which we will be using. Do not change this.

docker run --init -ti --rm \
    --name data-storage-workshop \
    --link some-mysql:mysql \
    --link some-memcache:memcache \
    --link some-mongo:mongo \
    --link some-cassandra:cassandra \
    --link some-neo4j:neo4j \
    -v /REPLACE/WITH/YOUR/LOCAL/PATH:/notebooks \
    -p 8888:8888 \
    jpebe/data_storage_workshop

This container is linked to all the containers we set up (and are running).

You should then be able to copy the Jupyter notebook URL from the terminal into your browser (ctrl-click will open a browser automatically). Note that you may need to change the hostname to 127.0.0.1 (the local host).

Explore and run the NoSQL and SQL notebooks, and understand each command does.

Limitations

The databases we run within the containers have no volume mounted (from Ubuntu). This means that all data created during our exercises is stored in the container which gets lost when you stop running the container. If you want your data to persist, you have to mount volumes (on your Ubuntu installation) and use these as the data directory of the NoSQL engine. For example, to keep a copy of the neo4j data, simply run the docker container in the following manner:

docker run --name some-neo4j \
    -e NEO4J_AUTH=neo4j/test \
    -p 7474:7474 -p 7687:7687 \
    -v $HOME/neo4j/data:/data \
    -v $HOME/neo4j/logs:/logs \
    -d neo4j

Where $HOME/neo4j/data and $HOME/neo4j/logs will be mounted in the container under /data and /logs respectively.

Note that in the above workshop, changes to the jupyter notebooks will, however, be persisted as these are mounted from your own local directory.

How to build

To build this image from the dockerfile, all you need is (from the dockerfile directory)

docker build -t jpebe/data_storage_workshop .

When you are done

You may want to reclaim some space on your Ubuntu installation. Remove all images and containers in the following way.

docker rm -f $(docker ps -a -q)
docker rmi -f $(docker images -a -q)

Conclusion

Pull requests (with fixes, even to this guide) will be recieved with appreciation and thanks.

I hope you enjoy this session!

Dr Jean-Paul Ebejer

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
nosql		nosql
sql		sql
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Storage Workshop – Dr Jean-Paul Ebejer

Requirements

SQL (Relational) Databases

NoSQL Databases

NoSQL Database Engines

Memcached

MongoDB

Cassandra

Neo4J

Jupyter Lab

Limitations

How to build

When you are done

Conclusion

About

Releases

Packages

Contributors 2

Languages

jp-um/data_storage_workshop

Folders and files

Latest commit

History

Repository files navigation

Data Storage Workshop – Dr Jean-Paul Ebejer

Requirements

SQL (Relational) Databases

NoSQL Databases

NoSQL Database Engines

Memcached

MongoDB

Cassandra

Neo4J

Jupyter Lab

Limitations

How to build

When you are done

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages