-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPersonal Project - EDA Salary Data.Rmd
134 lines (108 loc) · 4.29 KB
/
Personal Project - EDA Salary Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: "Personal Project - Data Gaji"
author: "Agnes Septilia"
date: "23/10/2021"
output:
word_document: default
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Here I’m practicing EDA with the data set of Data Gaji (Salary Data).
Let's start with loading the library and take a look on the dataset.
```{r}
library(tidyverse)
library(dplyr)
library(readxl)
library(scales)
library(descr)
```
```{r}
# take a look on the dataset
salary <- read_xlsx("Data Gaji 2.xlsx")
glimpse(salary)
```
## Task 1 : We want to check whether any duplicate data based on NIP
```{r}
salary %>% count(NIP)
```
## Result on Task 1 : Total data of NIP is equal with total row in dataframe.
So there's no duplicate data (person) on the data.
## Task 2 : Check Turnover rate
```{r}
resign <- salary %>%
select(`Masa Perolehan Awal`, `Masa Perolehan Akhir`) %>%
mutate(Resigned = ifelse(`Masa Perolehan Akhir` == 12, 0, 1)) %>%
group_by(Resigned) %>%
count() %>%
pull(n)
turnover_rate <- (resign[2] / sum(resign) * 100)
cat(sprintf("Turnover rate is %.0f%s", turnover_rate, '%'))
```
## Task 3 : Make new column PTKP amount
Here's the basic rule of PTKP amount:
- if `Status PTKP` = TK -> 54,000,000
- if `Status PTKP` = K -> 58,500,000
- then add each with `Jumlah Tanggungan` * 4,500,000
```{r}
salary <- salary %>%
mutate(PTKP_Amount = ifelse(`Status PTKP` == 'TK', (54000000 + `Jumlah Tanggungan` * 4500000),
(58500000 + `Jumlah Tanggungan` * 4500000))) %>%
relocate(PTKP_Amount, .after = `Jumlah Tanggungan`)
glimpse(salary)
```
## Task 4 : Make new column PTKP_to_Bruto, to check whether `Jumlah Penghasilan Bruto` exceed PTKP or not.
The column will have value: 'Under PTKP' if Bruto <= PTKP, and 'Over PTKP' if otherwise.
Note: In real calculation, there will be element of reduction over `Bruto` before it was compared to PTKP.
But here, we assume the reduction element is none.
```{r}
salary <- salary %>%
mutate(PTKP_to_Bruto = ifelse(PTKP_Amount <= `Jumlah Penghasilan Bruto`, 'Under PTKP', 'Over PTKP'))
glimpse(salary)
```
## Task 5 : Single, Married, or Divorced?
Make new column called `Marriage_Status' with below condition:
- if `Status PTKP` == TK and `Jumlah Tanggungan` == 0 -> Then Single
- if `Status PTKP` == K -> Then Married
- if `Status PTKP` == TK and `Jumlah Tanggungan` != 0 -> Then Divorced
Note: Of course, not all TK/1/2/3 are divorced in real life.
So the category here is only for practice.
```{r}
salary <- salary %>%
mutate(Marriage_Status = case_when(`Status PTKP` == 'K' ~ 'Married',
`Status PTKP` == 'TK' & `Jumlah Tanggungan` == 0 ~ 'Single',
TRUE ~ 'Divorced')) %>%
relocate(Marriage_Status, .after = `Jumlah Tanggungan`)
glimpse(salary)
```
## Task 6 : How much Single that makes over 100jt per year ?
```{r}
rich_single <- salary %>%
filter (Marriage_Status == 'Single', `Jumlah Penghasilan Bruto` > 100000000) %>%
count()
cat(sprintf("There are %d person which are single and make over 100million per year", rich_single$n))
```
## Task 7 : Visualize the correlation between gender and salary using Dot Plot
gender -> using column `Jenis Kelamin`
salary -> using column `Gaji Pokok dan Tunjangan Tetap`
```{r}
salary %>%
ggplot(aes(x=`Jenis Kelamin`,
y = `Gaji Pokok dan Tunjangan Tetap`/1000000,
color = `Jenis Kelamin`)) +
geom_dotplot(binaxis = "y", stackdir = "center", dotsize = 0.5) +
labs(title = "Salary distribution based on gender",
y = "Fixed Salary in million IDR",
x = "Gender",
col = "Gender")
```
## Result on Task 7 :
- In general, for the same salary amount, there are more female employees than male.
- Few females get paid higher than the rest of the company.
## Task 8 : Check correlation between Gender and Marriage Status using CrossTable
```{r}
CrossTable(x=salary$`Jenis Kelamin`, y=salary$Marriage_Status, prop.c = FALSE, prop.r = TRUE, prop.chisq = FALSE, chisq = TRUE)
```
## Result on Task 8 : p value is less than 0.05 (alpha), so there is correlation between Gender and Marriage Status.