This project performs EDA on the Microsoft data containing details of Malware infected devices. For this, we make use of SparkSQL using PySpark library.
- Kaggle account
- https://www.kaggle.com/c/microsoft-malware-prediction/data
- In order to access data files
- Install pyspark library
pip install pyspark
- We try to find infected class-wise data distribution.
- Operating System-wise Malware infection count
- Determining distribution of malware-infected touch-enabled devices
- Percent of Malware infected Devices with Enginer Version
- Percentage of infected devices with App version
- Percentage of infected devices with Antivirus version
- Percentage of infected devices with Antivirus product count
- Percentage of infected devices with OS Platform SubRelease
- Percentage of infected devices with OS Build
- Percentage of infected devices with Processor's Core Count
- Percentage of infected devices with Physical RAM
- Percentage of infected devices with System Chassis Type
- Percentage of infected devices with OS Edition
- Percentage of infected devices with Processor Type
- Percentage of infected devices with OS Genuine State
- Percentage of infected devices with Primary Disk Type
- Writing data to parquet file
df.write.format("parquet").save("file:///data.parquet")
The dataset has an equal count of both classes (malware-infected and non-infected devices). Moreover, it is observed that the devices running Windows10 are more vulnerable to malware, especially touch-enabled devices. The devices with App version 4.18.1807.18075 is at high risk as most of the affected systems have this version.
From the analysis, we can also observe that the devices having more than 2 antivirus products installed are less affected. However surprisingly, the devices with no antivirus are also least affected. Also, 64-bit devices are largely affected and that too whose OS is genuine.