Skip to content

sudip-padhye/EDA-of-Malware-Infected-Devices-using-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploratory Data Analysis of Malware infected devices using PySpark

This project performs EDA on the Microsoft data containing details of Malware infected devices. For this, we make use of SparkSQL using PySpark library.

Requirements

  1. Kaggle account
  2. Install pyspark library

pip install pyspark

Analysis Performed

  1. We try to find infected class-wise data distribution.

drawing

  1. Operating System-wise Malware infection count

drawing

  1. Determining distribution of malware-infected touch-enabled devices

drawing

  1. Percent of Malware infected Devices with Enginer Version

drawing

  1. Percentage of infected devices with App version

drawing

  1. Percentage of infected devices with Antivirus version

drawing

  1. Percentage of infected devices with Antivirus product count

drawing

  1. Percentage of infected devices with OS Platform SubRelease

drawing

  1. Percentage of infected devices with OS Build

drawing

  1. Percentage of infected devices with Processor's Core Count

drawing

  1. Percentage of infected devices with Physical RAM

drawing

  1. Percentage of infected devices with System Chassis Type

drawing

  1. Percentage of infected devices with OS Edition

drawing

  1. Percentage of infected devices with Processor Type

drawing

  1. Percentage of infected devices with OS Genuine State

drawing

  1. Percentage of infected devices with Primary Disk Type

drawing

  1. Writing data to parquet file

df.write.format("parquet").save("file:///data.parquet")

Conclusion

The dataset has an equal count of both classes (malware-infected and non-infected devices). Moreover, it is observed that the devices running Windows10 are more vulnerable to malware, especially touch-enabled devices. The devices with App version 4.18.1807.18075 is at high risk as most of the affected systems have this version.

From the analysis, we can also observe that the devices having more than 2 antivirus products installed are less affected. However surprisingly, the devices with no antivirus are also least affected. Also, 64-bit devices are largely affected and that too whose OS is genuine.

About

Explore factors associated with Malware Infection using Spark SQL

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published