-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPredicting Airbnb prices with kNN, regression, and neural network
1894 lines (1409 loc) · 75.9 KB
/
Predicting Airbnb prices with kNN, regression, and neural network
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "DA5030 - Final Project - Fall 2018"
author: "Brian Gridley"
date: "December 8, 2018"
output: pdf_document
---
$$\Large{\textbf{Predicting Boston Airbnb Prices}} $$
$$\textbf{Phase 1: Business Understanding}$$
Airbnb is an online short-term housing rental service, where users can rent space in a home for as short as one night's stay. The short-term rental service is creating a lot of drama surrounding its impact on local neighborhoods, particularly in high demand real estate markets. People claim it is reducing the stock of housing units available for residents by turning them into short-term hotel-style investment units. Boston has already imposed regulations to limit the impact this service will have on the city's housing market and many other cities are considering following a similar path.
This project aims to understand the drivers behind the pricing of listing on this service. To achieve this, I will examine a dataset containing the price and identifying characteristics for Airbnb listings across the entire city of Boston. I will build three different machine learning models in R (multiple linear regression, neural network, and k nearest neighbor) to predict the price of a listing given its characteristics. All steps of the CRISP-DM process will be followed. This will involve collecting and exploring the data, preparing the data for modeling by cleaning/formatting/transforming it, building and testing the three different prediction models, evaluating the results of the models and comparing the accuracies between the models.
$$\textbf{Phase 2: Data Understanding}$$
The data was downloaded as a CSV file from Kaggle.com (the "listings.csv" file located at the following link: https://www.kaggle.com/airbnb/boston?login=true#listings.csv). It contains data related to Airbnb listings across Boston. Because Airbnb does not have an API or release its data, it was scraped from the Airbnb website. Here, I will download the data, explore the data and examine the quality for potential issues.
```{r}
# Import the data
library(tidyverse)
# I am importing it with the read_csv() function from the tidyverse
# package, which automatically imports strings not as factors
listings <- as.data.frame(read_csv("listings.csv"))
#head(listings)
# looks like it imported correctly
# take a look at the structure of the dataset
str(listings)
# there are 3,585 records and 95 columns
# Looking at the variables, they are a mixture of character, numeric, and date variables.
# The data includes the price of each individual unit (our response variable), full text
# descriptions of the unit, neighborhood details, unit and listing details, host details,
# and review scores.
# a lot of the variables are not useful for this analysis and will be excluded from the analysis
# look at a summary of the full dataset
summary(listings)
# This is a lot of information to take in with all variables included, but a few interesting
# things can be noticed from this...
# - The "host_since" field is a date that tells how long the host of the unit has been hosting.
# This can probably be converted into a numeric variable, as a measure of the total time.
# - A lot of the character fields actually represent a number and will have to be cleaned up
# and converted (such as "host_response_rate", "host_acceptance_rate", "zipcode",
# "price", and "cleaning_fee").
# - There appear to be categorical variables that are in character format now that can be
# converted to dummy variables that may be useful predictors (such as "host_is_superhost",
# "neighborhood", "property_type", "room_type", and "bed_type"). I will explore these
# further during cleansing stage.
# - The "square_feet" variable appears to be useless although it would have been a valuable
# predictor variable. It has 3,529 NA values out of the 3,585 records, so imputation would be
# close to impossible since there is no data to use for that.
# - most of the other numeric variables that may be used have a much lower number of NA records
# so data imputation will be done.
# - Maybe the most valuable information is that there is no date field, and the listing data
# appears to be a snapshot of listings from one day (the "last_scraped" field is the same day
# for all records). This is great news for the prediction models because there will not be
# seasonality in the price data, so no additional preprocessing will need to be done to
# account for that.
# - The range of the numeric variables all vary quite widely, so data transforms will be needed
# to convert to similar scales.
# I will explore the data further in the next section, I will need to clean it to get it into a
# workable format before examining the distributions, outliers, collinearity, and correlations in
# more detail.
```
After examining the data, there is a lot of pre-processing work that will need to be done in the next step in order to prepare the data for modeling and examine the data further. First there is re-formatting that is needed. After the data is formatted appropriately (mainly converting numeric fields to numeric format), I can further examine the distributions to continue the data understanding phase. Also, there are missing values that will need to be dealt with through imputation and deletion, transforms are needed, and variables will need to be dummy coded. There do appear to be a good number of potentially valuable predictor variables that can be used in the models after the data is cleansed. Although the data requires a good amount of cleansing and pre-processing, I would say it is high quality data that may provide useful prediction models.
$$\textbf{Phase 3: Data Preparation}$$
In this stage, I will need to select the attributes to use in the modeling, re-format certain variables to make it usable, create dummy codes for categorical variables, re-examine the data to better understand it, impute missing values, and re-scale the data. The neural network and kNN algoithms require data on a small scale, so I will min/max normalize the features, to bring within a range between 0 and 1.
```{r}
# SELECT FEATURES TO BE USED
# First, I want to select the features that will be useful for the prediction models
# then I will move on to the cleansing
#head(listings)
# I will keep the "id" field for now, as it might come in handy when checking for duplicate
# records and splitting data into training and testing sets
# To avoid any unecessary and complicated text analysis, I will not be using any of the descriptive
# text fields that provide detailed descriptions of the listings, units, hosts, or neighborhoods.
# I will stick to the numeric and categorical variables.
# Looking into a few fields that may be of interest before deciding to delete or keep...
# Checking unique records in the "experiences_offered" field
listings %>%
group_by(experiences_offered) %>%
summarise(count = n()) %>%
arrange(desc(count))
# this field is useless...
# look at host_location field
head(listings %>%
group_by(host_location) %>%
summarise(count = n()) %>%
arrange(desc(count)))
# I will keep this field and recode it... there is useful information here,
# I will bin the data and turn this into a binary variable ("Local Host") which will be TRUE if
# in Massachusetts and FALSE if outside of Massachusetts. It might be a good predictor of price.
# neighbourhood_cleansed
listings %>%
group_by(neighbourhood_cleansed) %>%
summarise(count = n()) %>%
arrange(desc(count))
# There are 25 different neighborhoods. This is too many to be used as a categorical variable,
# as it would require 24 dummy variables... I'll look into the "zipcode" field instead
# zipcode
listings %>%
group_by(neighbourhood_cleansed) %>%
summarise(count = n()) %>%
arrange(desc(count))
# Again, way too many to be used categorically... and although it's a numeric variable,
# it technically shouldn't used as a predictor in that way because an increase in zipcode
# has no meaning.
# However, I would like to use a location variable in the model because it should be a good
# predictive variable. If there were fewer neighborhood or zipcode unique values, I could use
# the appropriate method and create dummy variables with a 0 or 1 for each neighborhood. If
# there was a distance to city center field or distance to public transit field, that could
# be used numerically. Unfortunately, the data doesn't offer either of those options, so I will
# use the "zipcode" field as a proxy for location by using it numerically. I will note that it is
# important to recognize that an increase in zipcode has no real meaning and should not be used
# when examining the effect on price. Rather, the zipcode field will just be used to see if a
# change in zipcode is significant in regards to the listing price (which I expect it to be).
# property_type
listings %>%
group_by(property_type) %>%
summarise(count = n()) %>%
arrange(desc(count))
# This could be useful but it would require a lot of dummy coding.. 13 uniques values
# room_type
listings %>%
group_by(room_type) %>%
summarise(count = n()) %>%
arrange(desc(count))
# This might be a better predictor variable than "property_type", so I will keep this variable,
# creating dummy variables and will exclude "property_type".
# bed_type
listings %>%
group_by(bed_type) %>%
summarise(count = n()) %>%
arrange(desc(count))
# This seems like it will be very similar in predictive value to room_type,
# so I will just use room_type
# after reviewing each column, I selected the features that I want to keep for my prediction
# analysis based on personal judgement about their relevance to predicting price of a listing
# (the data mining goal), based on the data type (removing any long text variables that cannot
# be converted to useful variables), and based on data quality (as mentioned earlier, although
# square feet would be a great predictor variable, I will not use it because there are too many
# NA values... there is poor data quality).
# Now I will bring all of the records for the variabes of interest into a new table
listings_narrow <- listings %>%
select(c("id","price","host_since","host_location","host_acceptance_rate",
"calculated_host_listings_count","zipcode","room_type","accommodates",
"bathrooms","beds","security_deposit","cleaning_fee","guests_included",
"minimum_nights","maximum_nights","number_of_reviews","review_scores_rating"))
head(listings_narrow)
# CLEAN THE DATA
# ... INCLUDING BINNING AND DUMMY CODING WHERE APPROPRIATE
# There is a lot of work that is needed to make the data workable
# I will look at each variable individually and determine how to clean it... looking at whether
# it needs to be converted to a different data type and whether it needs to be
# re-binned and turned into dummy variables.
# Price...
# It is in character format, with a preceding '$' in each record. I want to convert this to a
# numeric field for analysis
# bring all of the listings_narrow data into a new dataframe to preserve the original data
listings_narrow_clean <- listings_narrow
head(listings_narrow_clean$price)
# Take out the '$' sign and convert to numeric
listings_narrow_clean$price <- as.numeric(gsub("\\$","",listings_narrow_clean$price))
summary(listings_narrow_clean$price)
# good, I will explore this data more and impute the NA values later
# 'host_since' field... it is a date, I want to convert it to a numeric field representing the
# time interval (in years) for how long the host has been active on Airbnb, by calculating the
# difference between the date this data was pulled (which is found in the original file in the
# last_scraped field) and the date of the 'host_since' field
summary(listings$last_scraped)
# the date the data was pulled was "2016-09-07", so I will calculate the time between "2016-09-07"
# and the 'host_since' field
summary(listings_narrow_clean$host_since)
# using the difftime function, but it only goes to up to "weeks" unit, so I will calculate the
# years from there
listings_narrow_clean$host_since <- as.numeric(difftime(as.Date("2016-09-07"),
as.Date(listings_narrow_clean$host_since),
units = "weeks"))/52
summary(listings_narrow_clean$host_since)
# 'host_location' field... as mentioned earlier, I will convert this into a binary field,
# 'host_local' identifying whether the host lives within Massachusetts ("1") or outside ("0")
# look at the unique records again
head(listings_narrow_clean %>%
group_by(host_location) %>%
summarise(count = n()) %>%
arrange(desc(count)))
# they are in the format "city, state, country", so I will search for the word "Massachusetts"
# in each record and code that as a 1
listings_narrow_clean$host_location <- as.numeric(str_detect(listings_narrow_clean$host_location,
"Massachusetts"))
# rename the column to "host_local"
colnames(listings_narrow_clean)[4] <- "host_local_yes"
head(listings_narrow_clean)
# good
summary(listings_narrow_clean$host_local)
# the NA values will be handled later after all data is cleaned
# 'host_acceptance_rate' field... I want to convert this into a numeric field
# I need to remove the '%' and convert to numeric
listings_narrow_clean$host_acceptance_rate <- as.numeric(gsub("\\%","",
listings_narrow_clean$host_acceptance_rate))
# calculated_host_listings_count... this field is okay as is, no cleaning needed
summary(listings_narrow_clean$calculated_host_listings_count)
# there is a clear right-skew in the distribution, which I will address later
# zipcode... as mentioned earlier, this will be used numerically and as a location proxy.
# I will convert it to numeric format
listings_narrow_clean$zipcode <- as.numeric(listings_narrow_clean$zipcode)
summary(listings_narrow_clean$zipcode)
# there are NA values, which will be addressed later
# room_type... as mentioned earlier, I will create dummy variables for this variable
# There are 3 unique values, so I will create 2 dummy variables... "Entire home/apt" and
# "Private room"
# create the dummy variables
listings_narrow_clean <- listings_narrow_clean %>%
mutate(room_type_Entire_home_apt = ifelse(room_type == "Entire home/apt", 1, 0),
room_type_Private_room = ifelse(room_type == "Private room", 1, 0))
# verify counts are accurate
listings_narrow_clean %>%
group_by(room_type) %>%
summarise(count = n())
# there are 2127 "Entire home/apt" and 1378 "Private room"
sum(listings_narrow_clean$room_type_Entire_home_apt)
# 2127
sum(listings_narrow_clean$room_type_Private_room)
# 1378
# that's correct
# now remove the room_type field
listings_narrow_clean <- listings_narrow_clean[,-8]
# accommodates... this field is okay, no cleaning needed
summary(listings_narrow_clean$accommodates)
# bathrooms... this field is okay, no cleaning needed
summary(listings_narrow_clean$bathrooms)
# NA values will be dealt with later
# beds... this field is okay, no cleaning needed
summary(listings_narrow_clean$beds)
# NA values will be dealt with later
# security_deposit... this is a character field representing numeric data, however, I am
# not interested in the numeric information. I am only interested in whether there is a
# security deposit or not. There are 2243 NA values, so a lot of listings do not require
# a security deposit. I would like to test whether requiring a security deposit is a
# significant predictor of price. So I will convert this into a binary variable, where if
# the value is NA it equals 0 and if not NA it equals 1
head(listings_narrow_clean %>%
group_by(security_deposit) %>%
summarise(count = n()) %>%
arrange(desc(count)))
# 2243 NA records
# convert to binary field
listings_narrow_clean$security_deposit <-
as.numeric(ifelse(is.na(listings_narrow_clean$security_deposit),0,1))
# check counts
sum(listings_narrow_clean$security_deposit == 0)
# 2243... this is correct
# rename column
colnames(listings_narrow_clean)[11] <- "security_deposit_yes"
# cleaning_fee... I will do the same thing for this field
head(listings_narrow_clean %>%
group_by(cleaning_fee) %>%
summarise(count = n()) %>%
arrange(desc(count)))
# 1107 NA records
# convert to binary field
listings_narrow_clean$cleaning_fee <-
as.numeric(ifelse(is.na(listings_narrow_clean$cleaning_fee),0,1))
# check counts
sum(listings_narrow_clean$cleaning_fee == 0)
# 1107 this is correct
# rename column
colnames(listings_narrow_clean)[12] <- "cleaning_fee_yes"
# guests_included... this field is okay, no cleaning needed
summary(listings_narrow_clean$guests_included)
# minimum_nights... this field is okay, no cleaning needed
summary(listings_narrow_clean$minimum_nights)
# maximum_nights... this field is okay, no cleaning needed
summary(listings_narrow_clean$maximum_nights)
# the max number appears to be an outlier, which I will address later
# number_of_reviews...this field is okay, no cleaning needed
summary(listings_narrow_clean$number_of_reviews)
# the distribution appears to be right-skewed, which will be addressed later
# review_scores_rating... this field is okay, no cleaning needed
summary(listings_narrow_clean$review_scores_rating)
# the NA values will be addressed later
# now every column in the data has been cleaned/re-formatted if necessary,
# I now want to explore it a bit further, to verify everything is okay
# Back to re-examining the data fields, now that they are cleanly formatted
str(listings_narrow_clean)
# the structure looks good now, NA values will need to be addressed through imputation
# and transforms will be needed since the ranges vary widely
# check that all records are unique, using the 'id' variable
count(listings_narrow_clean %>%
group_by(id) %>%
summarise(count = n()))
# 3585... this is the total number of records in the data, so they are all unique.
# No duplicates
# OUTLIER DETECTION
# identifying and removing outliers
# I will run a 'for loop' to detect and compile all outliers for the non-binary numeric
# variables
# create vector of the column names I want to check in the loop
columns <- c("price", "host_since", "host_acceptance_rate",
"calculated_host_listings_count", "accommodates", "bathrooms",
"beds", "guests_included", "minimum_nights", "maximum_nights",
"number_of_reviews","review_scores_rating")
# create a dataframe with the appropriate headings, which I will compile each iteration
# of the loop into
outliers_total <- head(listings_narrow_clean,1)
outliers_total <- outliers_total[-1,]
# run the loop... identifying outliers for each variable by z-score, and compiling them
# into that dataframe I will detect and compile records that have a z-score greater than
# 3 or less than -3... the standard
for (i in 1:12) {
temp_outliers <- filter(mutate(listings_narrow_clean,
zscore = (c(listings_narrow_clean[,columns[i]])-
mean(c(listings_narrow_clean[,columns[i]]), na.rm = TRUE))
/sd(c(listings_narrow_clean[,columns[i]]), na.rm = TRUE)), abs(zscore) > 3)
outliers_total <- rbind(outliers_total, temp_outliers)
}
# check to make sure it worked
head(outliers_total)
# looks good
# count the number of outliers
count(outliers_total)
# 674 total outliers... this is a lot, but there may be duplicates since each column was
# checked separately
# look at unique outliers
count(outliers_total %>%
group_by(id) %>%
summarise(count = n()))
# there are 555 unique records from the data that have outliers...
# this is quite a lot. I would rather not remove this much data...
# it's around 15% of the data
# I will raise the z-score threshold to see if I can retain more data
# run the loop, identifying outliers as above a zscore of 4.5 or below -4.5
# clear the outliers_total dataframe
outliers_total <- head(listings_narrow_clean,1)
outliers_total <- outliers_total[-1,]
# run the loop again...
for (i in 1:12) {
temp_outliers <- filter(mutate(listings_narrow_clean,
zscore = (c(listings_narrow_clean[,columns[i]])-
mean(c(listings_narrow_clean[,columns[i]]), na.rm = TRUE))
/sd(c(listings_narrow_clean[,columns[i]]), na.rm = TRUE)), abs(zscore) > 4.5)
outliers_total <- rbind(outliers_total, temp_outliers)
}
# check it again
head(outliers_total)
# count
count(outliers_total)
# 150
# look at unique outliers
count(outliers_total %>%
group_by(id) %>%
summarise(count = n()))
# 133 outliers, this is much better. I'll use this threshold and delete these records
# before moving on to modeling
# remove the outliers... bringing into new data frame to preserve full clean data frame
# using the unique identifier field to identify each record to exclude
listings_narrow_clean_no_outliers <- filter(listings_narrow_clean, !(id %in% outliers_total$id))
# check appropriate number of records was removed
count(listings_narrow_clean) - count(listings_narrow_clean_no_outliers)
# 133, that's correct
# Good, now the data is clean with extreme outliers removed
# EXPLORATORY PLOTS
# look at distributions of the numeric variables... with pairs.panels chart
library(psych)
pairs.panels(listings_narrow_clean_no_outliers[c("price", "host_since",
"host_acceptance_rate", "calculated_host_listings_count",
"accommodates", "bathrooms", "beds", "guests_included",
"minimum_nights", "maximum_nights", "number_of_reviews",
"review_scores_rating")])
# the price field looks okay, with a relatively normal distribution,
# but I'll see if a transform would improve it a lot
# original distribution
hist(listings_narrow_clean_no_outliers$price)
# it is a little right-skewed even with outliers removed
# does a log transform improve it?
hist(log10(listings_narrow_clean_no_outliers$price))
# not perfect, maybe a little better
# look at sqrt transform
hist(sqrt(listings_narrow_clean_no_outliers$price))
# same, but I will keep it as the original data without a transform,
# as it is relatively normal
# host_since
hist(listings_narrow_clean_no_outliers$host_since)
# this is relatively normal, will keep as is
# host_acceptance_rate
hist(listings_narrow_clean_no_outliers$host_acceptance_rate)
# skewed left
# does log transform help?
hist(log10(listings_narrow_clean_no_outliers$host_acceptance_rate))
# not really... what about sqrt transform?
hist(sqrt(listings_narrow_clean_no_outliers$host_acceptance_rate))
# no transforms will improve it, so leave as is
# calculated_host_listings_count
hist(listings_narrow_clean_no_outliers$calculated_host_listings_count)
# skewed right
# does log transform help?
hist(log10(listings_narrow_clean_no_outliers$calculated_host_listings_count))
# not really... what about sqrt transform?
hist(sqrt(listings_narrow_clean_no_outliers$calculated_host_listings_count))
# no transforms will improve it, so leave as is
# accommodates
hist(listings_narrow_clean_no_outliers$accommodates)
# look at transforms
hist(log10(listings_narrow_clean_no_outliers$accommodates))
hist(sqrt(listings_narrow_clean_no_outliers$accommodates))
# I will keep it as is
# bathrooms
hist(listings_narrow_clean_no_outliers$bathrooms)
# look at transforms
hist(log10(listings_narrow_clean_no_outliers$bathrooms))
hist(sqrt(listings_narrow_clean_no_outliers$bathrooms))
# I will keep it as is, transform doesn't improve
# beds
hist(listings_narrow_clean_no_outliers$beds)
# look at transforms
hist(log10(listings_narrow_clean_no_outliers$beds))
hist(sqrt(listings_narrow_clean_no_outliers$beds))
# I will keep it as is, transform doesn't improve
# guests_included
hist(listings_narrow_clean_no_outliers$guests_included)
# look at transforms
hist(log10(listings_narrow_clean_no_outliers$guests_included))
hist(sqrt(listings_narrow_clean_no_outliers$guests_included))
# I will keep it as is, transform not a huge improvement
# minimum_nights
hist(listings_narrow_clean_no_outliers$minimum_nights)
# look at transforms
hist(log10(listings_narrow_clean_no_outliers$minimum_nights))
hist(sqrt(listings_narrow_clean_no_outliers$minimum_nights))
# I will keep it as is, transform doesn't improve
# maximum_nights
hist(listings_narrow_clean_no_outliers$maximum_nights)
# look at transforms
hist(log10(listings_narrow_clean_no_outliers$maximum_nights))
hist(sqrt(listings_narrow_clean_no_outliers$maximum_nights))
# keep as is
# number_of_reviews
hist(listings_narrow_clean_no_outliers$number_of_reviews)
# look at transforms
hist(log10(listings_narrow_clean_no_outliers$number_of_reviews))
hist(sqrt(listings_narrow_clean_no_outliers$number_of_reviews))
# I will keep it as is, transforms don't really improve it
# review_scores_rating
hist(listings_narrow_clean_no_outliers$review_scores_rating)
# look at transforms
hist(log10(listings_narrow_clean_no_outliers$review_scores_rating))
hist(sqrt(listings_narrow_clean_no_outliers$review_scores_rating))
# I will keep it as is, transforms don't help
# after looking into possible transforms, it looks like they will not improve
# much, so I will leave the data as is without any transforms
# IMPUTATION
# Now I want to impute the missing values now that outliers are removed and won't affect
# imputed values
summary(listings_narrow_clean_no_outliers)
# 7 of the variables have missing values
# price, host_local_yes, host_acceptance_rate, zipcode,
# bathrooms, beds, review_scores_rating
# investigate imputation methods
# price variable
# look at overall average
mean(listings_narrow_clean_no_outliers$price, na.rm = TRUE)
# look at average by number of beds, which seems an appropriate price grouping
listings_narrow_clean_no_outliers %>%
group_by(beds) %>%
summarise(avg = mean(price, na.rm = TRUE), count = n())
# it varies widely... this is a better way to impute
# I could also impute by using the kNN algorithm or with multiple regression, to take
# all variables into account and predict the missing values, but it looks like enough
# information can be taken from the other variables through grouping or through
# judgement calls
listings_narrow_clean_no_outliers %>%
filter(is.na(price)) %>%
group_by(beds) %>%
summarise(count = n())
# there are 9 missing values... with values of 1,2, and 5 for # beds
# impute by using the mean rounded price by # of beds
listings_narrow_clean_no_outliers[is.na(listings_narrow_clean_no_outliers$price) &
listings_narrow_clean_no_outliers$beds == 1,c("price")] <- 129
listings_narrow_clean_no_outliers[is.na(listings_narrow_clean_no_outliers$price) &
listings_narrow_clean_no_outliers$beds == 2,c("price")] <- 207
listings_narrow_clean_no_outliers[is.na(listings_narrow_clean_no_outliers$price) &
listings_narrow_clean_no_outliers$beds == 5,c("price")] <- 291
# host_local_yes variable
# this is binary, look at overall counts
listings_narrow_clean_no_outliers %>%
group_by(host_local_yes) %>%
summarise(count = n())
# 2566 '1' values... 75% of the data
# 10 NA values
# since nothing in the data seems really indicative of a host being local or not,
# I will assign the NA values as '1' because it is the most common value
listings_narrow_clean_no_outliers[is.na(listings_narrow_clean_no_outliers$host_local_yes),
c("host_local_yes")] <- 1
# host_acceptance_rate variable
# look into groupings
listings_narrow_clean_no_outliers %>%
group_by(beds) %>%
summarise(avg = mean(host_acceptance_rate, na.rm = TRUE), count = n())
# this has no variation
listings_narrow_clean_no_outliers %>%
group_by(host_local_yes) %>%
summarise(avg = mean(host_acceptance_rate, na.rm = TRUE), count = n())
# this has pretty good variation, I will use this for imputation
listings_narrow_clean_no_outliers[is.na(listings_narrow_clean_no_outliers$host_acceptance_rate) &
listings_narrow_clean_no_outliers$host_local_yes == 0,
c("host_acceptance_rate")] <- 75
listings_narrow_clean_no_outliers[is.na(listings_narrow_clean_no_outliers$host_acceptance_rate) &
listings_narrow_clean_no_outliers$host_local_yes == 1,
c("host_acceptance_rate")] <- 87
# zipcode variable
# for this, I will assign the zipcode that occurs the most often
# identify the most common zipcode
head(listings_narrow_clean_no_outliers %>%
group_by(zipcode) %>%
summarise(count = n()) %>%
arrange(desc(count)),1)
# zipcode = 2116
listings_narrow_clean_no_outliers[is.na(listings_narrow_clean_no_outliers$zipcode),
c("zipcode")] <- 2116
# bathrooms variable
# number of beds should be a good indicator
listings_narrow_clean_no_outliers %>%
group_by(beds) %>%
summarise(avg = mean(bathrooms, na.rm = TRUE), count = n())
# look at beds values of missing bathrooms records
c(listings_narrow_clean_no_outliers %>%
filter(is.na(bathrooms)) %>%
select(beds))
# they are all NA, 1, or 2
# according to the previous table, they will all most likely have 1 bathroom
listings_narrow_clean_no_outliers[is.na(listings_narrow_clean_no_outliers$bathrooms),
c("bathrooms")] <- 1
# beds variable
# look at overall distribution first
listings_narrow_clean_no_outliers %>%
group_by(beds) %>%
summarise(count = n())
# the vast majority have 1 or 2 beds... 9 NA values
# can # bathrooms provide a good indicator?
listings_narrow_clean_no_outliers %>%
group_by(bathrooms) %>%
summarise(avg = mean(beds, na.rm = TRUE), count = n())
# this looks okay, look at the bathroom values for NA bed records
c(listings_narrow_clean_no_outliers %>%
filter(is.na(beds)) %>%
select(bathrooms))
# they all have 1 bathroom
# I will impute them as 1 bed
listings_narrow_clean_no_outliers[is.na(listings_narrow_clean_no_outliers$beds),
c("beds")] <- 1
# review_scores_rating variable
# since this is a subjective rating by the customer, dependng on their experience,
# I don't think there are any variables that will help with imputation here
# so I will assign the average rating
mean(listings_narrow_clean_no_outliers$review_scores_rating, na.rm = TRUE)
# will impute as 92, since this is an integer field
listings_narrow_clean_no_outliers[is.na(listings_narrow_clean_no_outliers$review_scores_rating),
c("review_scores_rating")] <- 92
# now check to make sure there are no NA values in the data
summary(listings_narrow_clean_no_outliers)
# looks good
# CORRELATION/COLLINEARITY
# investigating the pairwise correlations, before moving onto modeling
cor(listings_narrow_clean_no_outliers[c("price", "host_since", "host_local_yes",
"host_acceptance_rate", "calculated_host_listings_count",
"zipcode", "accommodates", "bathrooms", "beds",
"security_deposit_yes", "cleaning_fee_yes",
"guests_included", "minimum_nights", "maximum_nights",
"number_of_reviews", "review_scores_rating",
"room_type_Entire_home_apt", "room_type_Private_room")])
# looking at the crrelations that price has with the other variables in this chart,
# I expect 'accommodates', 'beds', and 'room_type_Entire_home_apt' to have the
# strongest positive impact on price, while 'room_type_Private_room' should have a
# fairly strong negative impact on price. The other correlations don't suggest much
# of a relationship
# looking at collinearity of the full data set
# There are too many variables to create a legible pairs.panel plot.
# Looking into the correlations between predictor variables in the prvious chart,
# I can see that there is a high correlation between 'beds' and 'accommodates' (0.80),
# which makes perfect sense.
# With this in mind, I will remove the 'beds' variable from my modeling,
# and just include 'accommodates'.
# I will bring the data (minus the 'beds' field) into a new dataframe that I will then normalize
listings_normalized <- listings_narrow_clean_no_outliers[,-10]
# NORMALIZE FEATURES
# Now want to scale the numeric variables to get them closer to zero
# for a smaller interval. Both the kNN and neural network algorithms
# require small intervals and the ranges vary quite a bit as is.
# I will apply min/max normalization to all variables
# first create a normalize function
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
# Apply the function to all records in the all columns excluding the id variable. Won't need
# the id anymore
listings_normalized <- as.data.frame(lapply(listings_normalized[c(2:18)], normalize))
head(listings_normalized)
# looks good
# original range of numeric data
summary(listings_narrow_clean_no_outliers)
# normalized range
summary(listings_normalized)
# good all are between 0 and 1 now
# the data is ready for modeling
```
$$\textbf{Phase 4: Modeling}$$
In this stage I will create the training, validation, and testing subsets, construct the three machine learning prediction models that will be used (kNN, multiple linear regression, and neural network), and tune the models to end up with an optimal model for each. I chose to build kNN, regression, and neural network models for my analysis because my goal is to come up with a numeric prediction of the price of Airbnb listings based on their features. These models are all able to handle numeric predictions well, while some of the other models learned in class are better for handling categorical classification problems.
```{r}
# CREATE TRAINING/TESTING SUBSETS
# I will create a training subset, a validation subset, and a testing subset
# because I will be using the holdout method to evaluate the models.
# The training dataset will train the model and I will use the validation
# dataset to tune the model to find the optimal parameters in minimzing error.
# I will hold out the testing dataset to be used at the very end for final
# numeric predictions on the optimal tuned models.
# This is ideal because the data from the testing subset will be completely
# independent, having yet to be seen by the models.
# I will randomly split the data into training/validation/testing with
# a 50/25%/25% split
# I am randomly splitting the records because I do not know if they are arranged
# in any sort of order in the original dataset. So taking a random sample ensures
# this.
# calculate sizes of each set based on the percentage split
listings_size <- nrow(listings_normalized)
listings_train_size <- round(listings_size * .50)
listings_validation_size <- round(listings_size * .25)
listings_test_size <- listings_size-listings_train_size-listings_validation_size
# create random order of records for the random splits
set.seed(250)
random_order <- order(runif(listings_size))
# split based on that random order
listings_train <- listings_normalized[random_order[1:listings_train_size], ]
listings_validate <- listings_normalized[random_order[(listings_train_size+1):
(listings_train_size+listings_validation_size)], ]
listings_test <-
listings_normalized[random_order[(listings_train_size+listings_validation_size+1):
listings_size], ]
# BUILD MODELS
# 1) kNN
# I will start with the kNN model. I will attemot to build my own function for this.
# Since I am building a numeric prediction model, I want to return the mean of the
# neighbors.
# I will first create a distance function to calculate the euclidean distance
# between two vectors, p and q
dist <- function(p,q)
{
d <- 0
for (i in 1:length(p)) {
d <- d + (p[i] - q[i])^2
}
dist <- sqrt(d)
}
# testing the distance function on the first row of the full data and the second row
p <- listings_normalized[1,]
q <- listings_normalized[2,]
w <- dist(p,q)
w
# 1.841978
# now do the manual calculation to see if the function is working properly
sqrt(((p[1]-q[1])^2) + ((p[2]-q[2])^2) + ((p[3]-q[3])^2) +
((p[4]-q[4])^2) + ((p[5]-q[5])^2) + ((p[6]-q[6])^2) +
((p[7]-q[7])^2) + ((p[8]-q[8])^2) + ((p[9]-q[9])^2) +
((p[10]-q[10])^2) + ((p[11]-q[11])^2) + ((p[12]-q[12])^2) +
((p[13]-q[13])^2) + ((p[14]-q[14])^2) + ((p[15]-q[15])^2) +
((p[16]-q[16])^2) + ((p[17]-q[17])^2))
# 1.841978, it matches what the dist function gave
# now I will create a function to calculate the distances for all rows in the training data
all_dist <- function(training, unknown)
{
m <- nrow(training)
ds <- numeric(m)
q <- unknown
for (i in 1:m) {
p <- training[i,]
ds[i] <- dist(p,q)
}
all_dist <- ds
}
# check to see if the function works with the first case in the training set
n <- all_dist(listings_train[,2:17],listings_validate[1,2:17])
head(n)
# it works
# now identify the k nearest neighbors
nearest_neighbors <- function(neighbors,k)
{
ordered_neighbors <- order(neighbors)
nearest_neighbors <- ordered_neighbors[1:k]
}
# look at the 10 closest neighbors from the 'n' object I created previously
f <- nearest_neighbors(n,10)
f
# it works, returning the index of the records
# now I can combine these into a kNN function that predicts the price for
# a single record
knn_average <- function(training, unknown, k)
{
nb <- all_dist(training[,names(training) != "price"], unknown)
f <- nearest_neighbors(nb,k)
knn_average <- mean(training$price[f])
}
# now predicting the price for the first record in validation set... using k = 10
nn1 <- knn_average(listings_train, listings_validate[1,2:17], 10)
nn1
# the prediction is 0.1609375 remember this is the normalized value
# good, it works
# since I will want to test every record in the validation dataset, I will create a
# final function that predicts each record separately and combines them into
# a vector of all predicted prices in the test set
knn_average_all <- function(training, validation, k)
{
m <- nrow(validation)
knns <- numeric(m)
for (i in 1:m) {
unknown <- validation[i,]
knns[i] <- knn_average(training, unknown, k)
}
knn_average_all <- knns
}
# now ready to run the kNN model... run it with k = 10
# kNN_predictions10 <- knn_average_all(listings_train, listings_validate[,2:17], 10)
# I tried running the kNN function that I built and it takes far too long. It got stuck with
# all of the calculations and did not finish. It is not a very efficient function, so
# I will use the knnreg function from caret package... which returns the mean of the neighbors
library(caret)
# try it with k = 10