In this code pattern historical shopping data is analyzed with Spark and PixieDust. The data is loaded, cleaned and then analyzed by creating various charts and maps.
When you have completed this code patterns, you will understand how to:
- Use Jupyter Notebooks in IBM Watson Studio
- Load data with PixieDust and clean data with Spark
- Create charts and maps with PixieDust
The intended audience is anyone interested in quickly analyzing data in a Jupyter notebook.
- Log in to IBM Watson Studio
- Load the provided notebook into Watson Studio
- Load the customer data in the notebook
- Transform the data with Apache Spark
- Create charts and maps with PixieDust
- IBM Watson Studio: a suite of tools and a collaborative environment for data scientists, developers and domain experts
- IBM Apache Spark: an open source cluster computing framework optimized for extremely fast and large scale data processing
- Jupyter notebooks: an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text
- PixieDust: Open source Python package, providing support for Javascript/Node.js code.
- Sign up for Watson Studio
- Create a project
- Create a notebook
- Load customer data in the notebook
- Transform the data with Apache Spark
- Create charts and maps with PixieDust
Sign up for IBM Watson Studio. By creating a project in Watson Studio a free tier Object Storage
service will be created in your IBM Cloud account.
- Create a new project by clicking on the tile as below. Choose
Complete
and clickOK
.
-
Give your Project a name.
-
Select an Object Storage from the drop-down menu or create a new one for free. This is used to store the notebooks and data. Do not forget to click refresh when returning to the Project page.
-
click
Create
. -
Associate the project with an Apache Spark service instance. Go to the
Settings
tab at the top of the Project page, and then scroll down to Associated Services. Click + and select Spark from the drop-down menu. Select an existing service or create a new one for free.
- Add a new notebook. Go to the
Assets
tab at the top of the Project page. Scroll down toNotebooks
and click +. Choose new notebookFrom URL
. Give your notebook a name and copy this URL: https://raw.githubusercontent.com/IBM/analyze-customer-data-spark-pixiedust/master/notebooks/analyze-customer-data.ipynb - Make sure you select Spark as your runtime and click
Create Notebook
.
- The notebook will load.
- Run the cells one at a time. Select the cell, and then press the
Play
button in the toolbar. - Make sure the latest version of PixieDust is installed. If you get a warning run this code in a new cell:
pip install --user --upgrade pixiedust
. - Load the data into the notebook.
Before analyzing the data, it needs to be cleaned and formatted. This can be done with a few pyspark commands:
- Select only the columns you are interested in with
df.select()
- Convert the AGE column to a numeric data type so you can run calculations on customer age with a user defined function (udf).
- Derive the gender information for each customer based on the salutation and rename the GenderCode column to GENDER with a second
udf
.
The data can now be explored with PixieDust:
-
With
display()
explore the data in a table. -
Then click on the below button to create one of the charts in the list.
-
Drag and drop the variables you want to display into the
Keys
andValues
fields. Select the aggregation from the drop-down menu and clickOK
. -
From the menu on the right of the chart you can select which renderer you want to use, where each one of them visualises the data in a different way. Other options are clustering by a variable, the size and orientation of the chart and the display of a legend.
-
Below are two examples of a bar chart and a map created in the notebook.
Build a recommender with Apache Spark and Elasticsearch
Create a web-based mobile health app using Watson services on IBM Cloud and IBM Watson Studio
Use machine learning to predict U.S. opioid prescribers with Watson Studio and scikit-learn
- Watson Studio: Master the art of data science with IBM's Watson Studio
- Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
- With Watson: Want to take your Watson app to the next level? Looking to utilize Watson Brand assets? Join the With Watson program to leverage exclusive brand, marketing, and tech resources to amplify and accelerate your Watson embedded commercial solution.
This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer [Certificate of Origin, Version 1.1 (DCO)] (https://developercertificate.org/) and the [Apache Software License, Version 2] (http://www.apache.org/licenses/LICENSE-2.0.txt).
ASL FAQ link: http://www.apache.org/foundation/license-faq.html#WhatDoesItMEAN