-
Notifications
You must be signed in to change notification settings - Fork 109
/
Copy pathaskamanager_salary_survey.Rmd
93 lines (67 loc) · 6.51 KB
/
askamanager_salary_survey.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# "Ask A Manager" salary survey dataset
Kliment Mamykin
A survey posted in a popular blog "Ask A Manager"<https://www.askamanager.org> asked the readers to anonymously fill out how much money they were making. After the [blog post](https://www.askamanager.org/2019/04/how-much-money-do-you-make-3.html) was published, it went viral, and 31407 people took time and responded to the survey. A [Google sheet](https://docs.google.com/spreadsheets/d/1rGCKXIKt-7l5gX06NAwO3pjqEHh-oPXtB8ihkp0vGWo) was used to aggregate the responses. The data is interesting, since the self-reported compensation data is hard to come by, and useful for anyone on the job market. However the analysis of this dataset is made difficult by the lack of normalization in the responses. The survey uses freeform entered responses, with little standartization on the values allowed (probably to minimize the time/friciton of filling out the form).
*The goal for this project is to clean up the data and make it available for community for further analysis.* Most of the work was done using [OpenRefine](http://openrefine.org/), with the Industries list normalized with a machine learning model.
The repository is https://github.com/kmamykin/askamanager_salary_survey
## Obtaining the dataset
Version 1 of the cleaned up dataset is located at (https://github.com/kmamykin/askamanager_salary_survey/raw/master/data/v1/Ask-A-Manager-Salary-Survey-2019.csv)
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggplot2)
```
```{r warning=FALSE}
salary_survey <- readr::read_csv("https://github.com/kmamykin/askamanager_salary_survey/raw/master/data/v1/Ask-A-Manager-Salary-Survey-2019.csv")
```
```{r fig.height=4, fig.width=3}
salary_survey %>%
drop_na(Industry) %>%
group_by(Industry) %>%
summarise(Freq = n()) %>%
ggplot(aes(x=fct_reorder(Industry, Freq), y=Freq)) +
geom_bar(stat = "identity", fill = "cornflowerblue") +
geom_text(aes(label = Freq), position = position_dodge(width = 1), hjust = -0.2, size = 2) +
coord_flip() +
ylim(0, 3500) +
xlab("Industry") +
ylab("Responses") +
ggtitle("Responses by Industry") +
theme_bw()
```
## Description of fields
* `Timestamp` - timestamp of the submittd response
* `Age` - age bracket: "18-24", "25-34", "35-44", ... One of the few fields that was a choice and did not need a cleanup
* `Industry` - This is the field that took the most time to normalize. While there are many different industry classification taxonomies, the industry list in https://www.webspidermount.com/features/generic-job-taxonomy/ looked short, simple and targeted for a job search domain. The values were normalized based on `Industry (Original)` using KNN classifier (see below for details).
* `JobTitle` - user entered field, clustered to remove small variations (e.g. "sr.engineer" vs "Sr Engineer"). There is still too much variation and this field need more normalization work.
* `City`, `State`, `Country` - these fields were normalized from `Location (Original)` field by calling Google Locations API and further manually tweaking the data inside OpenRefine.
* `Location` - normalized location in format "Nashville, TN, USA". Only USA, Australia, Canada locations contain states in this dataset, for the other countries the field contains <City>, <Country>, e.g. "Manchester, UK"
* `Experience` - experience in the responder, e.g. "8 - 10 years", "11 - 20 years" etc. Also did not need normalization.
* `Currency` - currency of the Base pay, normalized from a freeform entry.
* `Base` - base salary (in specified `Currency`). This is the main metric. A lot of parsing and regex transforms went to this field to extract the numbers from freeform `AnnualSalary (Original)`. Extra perks got separated into `Extras` field. For the records where it was detected that the compensation was at an hourly rate, the rate was extracted to `HourlyRate` and the `Base` was calculated with `HourlyRate` * 1650 hours/year (this is a big assumption here). You can filter those records out where `HourlyRate` is not NA.
* `HourlyRate` - extracted hourly rate from `AnnualSalary (Original)`. Only some records contain values (hourly rate workers), otherwise the value is empty.
* `Extras` - extracted extra perks/comments from `AnnualSalary (Original)`
* `Notes` - user provided notes
* `AnnualSalary (Original)` - original field
* `Location (Original)` - original field
* `JobTitle (Original)` - original field
* `Industry (Original)` - original field
* `Industry (Clustered)` - `Industry (Original)` clustered with OpenRefine to remove small variations of spelling/capitalization
```{r}
str(salary_survey)
```
## Data cleanup process
[OpenRefine](http://openrefine.org/) was used for most of the heavy lifting of data transformations.
> OpenRefine (previously Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
### Industry classification
The initial attempt to manually cleanup the original `Industry (Original)` field proved to be taking to long, and in the middle I decided to change the taxonomy. To automate this process a simple machine learning model was used to classify the *input* user entered industry into a *target* normalized set of industry taxonomy values.
For both sets (the target industries and the input industries) I scraped Google search result and the resulting organic web sites, creating a rich text document describing each input or target industry. Using bag of words document representation and KNN classifier (k=1), each input industry was classified into a target industry based on their document vector similarity (using eucledian metric)
See [Processing.ipynb](https://github.com/kmamykin/askamanager_salary_survey/Processing.ipynb) and [Classification.ipynb](https://github.com/kmamykin/askamanager_salary_survey/Classification.ipynb) for further details.
### Job Title classification
TODO - transform the freeform entered job titles into `Job Seniority` field and `Job Area` field, e.g. "Senior Data Scientist" would be transformed to "Senior" seniority and "Data Science" job area.
### Contributing
Contributions are welcome! If you notice an error in the dataset, or would like to add further cleanup - please follow this process:
1. Download the dataset and create a project in OpenRefine.
2. Perform desired data operations.
3. Switch to "Undo/Redo" tab and "Extract..." transformation file in JSON.
4. Submit your changes in a Pull Request with transformation file attached.
![](resources/askamanager_salary_survey/Contribute-transforms.png)