Inventory that stores goods that are related to Lego was analyzed by using Apache Spark.
Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python and R.
It also supports pandas API on Spark for pandas workloads.
This analysis was performed to find out about quantity of items, parts, sets and themes and their properties for example either colours or transparent and the stock of these parts in the inventory.
Analysis was perfomed by PySpark. PySpark is the Python API for Apache Spark.
Advanced functions for ETL/ELT transformations were used to create this project.
Before analysis, data was cleaned using advanced PySpark functions & methods
All steps with detailed information is also described in PySpark_Project2_Lego.ipynb
- Minifigure Series 1 [Random Bag] has the greatest quantity of sets (60)
- A Town Theme that belongs to a Dacta Buildings Set has the greatest quantity in the Inventory (22)
- A Fence 1 x 4 x 1 part has the greatest quantity in the Inventory (100)
- Part Category Name Minifigs has the greatest quantity in the Inventory (24)
- A NHL Action Set with Stickers set has the greatest quantity in the Inventory (12)
- The majority of parts has the black colour (63)
- The Inventory has the stock for only 15 parts
- The Inventory has only 20 parts that are transparent.
- The oldest sets is Bungalow (54 years)
- The greatest quantity of parts belongs to Technic Pin with Friction Ridges Lengthwise and Center Slots (black) (18056). It belongs to Technic Pins Part Category Name
- There are 2180 Themes that do not possess parts
- A Dacta Buildings, Lego Road Safety Kit Poster ,Set K1062 Activity Booklet and {Town Vehicles} Sets have the greatest quantity of parts (136)
In-depth analysis with detailed information is included in PySpark_Project2_Lego.ipynb