03-cleaning.Rmd

# Data transformation

```{r, echo=FALSE,message=FALSE,warning=FALSE}
#install.packages('spotifyr')
library('spotifyr')
#library(here)
library(readr)
library(tidyverse)
library(socviz)
library(dplyr)
library('tidytuesdayR')
```
```{r,echo=FALSE,message=FALSE,warning=FALSE}
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv',show_col_types = FALSE)
#tuesdata <- tidytuesdayR::tt_load('2020-01-21') 
#tuesdata <- tidytuesdayR::tt_load(2020, week = 4)
#spotify_songs <- tuesdata$spotify_songs
```
There are 12 audio features, we categorize those variables into 3 types:<br>
confidence features: `acousticness`, `liveness`, `speechiness` and `instrumentalness `<br>
perceptual features: `energy`, `loudness`, `danceability` and `valence` (positiveness) <br>
descriptive features: `duration`, `tempo`, `key`, and` mode`.<br>
<br>
The analysis focus on the relationship among genre vs perceptual features vs popularity, so we will be mainly focusing on the popularity, genre and perceptual features and we cleaned this dataset in following steps:
- Select columns we needed. i.e. `track_popularity`, `playlist_genre`, track`energy`, `loudness`, `danceability` and `valence`
- Check duplicates and fix typos
<br>
After clearning it, it now has 32833 rows and 14 columns.<br>


```{r,echo=FALSE,message=FALSE,warning=FALSE}
songs_info <- spotify_songs %>%
  select(-c(track_id,track_name,track_artist,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_subgenre))
songs_info
```
Then, we created a new dataframe summarizing the information of genre for the purpose of analysis in later chapter.
```{r,echo=FALSE,message=FALSE,warning=FALSE}
songs_info %>%
  group_by(playlist_genre) %>%
  summarise(genre_total=n(),average_popularity=sum(track_popularity)/genre_total) %>%
  knitr::kable()

songs_info %>% ggplot(aes(x=fct_infreq(playlist_genre),fill=playlist_genre)) + geom_bar(width=0.3)+labs(title = 'Genre Counts',x='Genre',y='#of tracks')
```