-
Notifications
You must be signed in to change notification settings - Fork 0
/
Final.Rmd
553 lines (488 loc) · 30.2 KB
/
Final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
---
title: "Final Project"
author: "Hanh Ta"
date: "12/12/2019"
output:
html_document:
toc: true
toc_depth: 4
theme: united
highlight: tango
---
<h1> DC Airbnb Analysis </h1>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
<h2> Import the data </h2>
```{r import, message=FALSE, comment=FALSE}
library(tidyverse)
airbnb_df <- read_csv("listings_2.csv")
```
<h2> Clean the data </h2>
```{r clean}
airbnb_df[,c("host_response_rate",
"bathrooms",
"weekly_price",
"monthly_price",
"cleaning_fee",
"security_deposit",
"guests_included",
"extra_people",
"review_scores_rating")] <- NULL
```
```{r, eval=FALSE,include=FALSE}
#str(airbnb_df)
#$neighbourhood <- as.factor(airbnb_df$neighbourhood)
#airbnb_df$last_review <- as.Date(airbnb_df$last_review, "%m/%d/%y")
```
It is necessary to convert the data type of some categorical variables of interest to factor.
```{r convert}
airbnb_df$neighbourhood_cleansed <- as.factor(airbnb_df$neighbourhood_cleansed)
airbnb_df$neighbourhood <- as.factor(airbnb_df$neighbourhood)
airbnb_df$property_type <- as.factor(airbnb_df$property_type)
airbnb_df$room_type <- as.factor(airbnb_df$room_type)
airbnb_df$bed_type <- as.factor(airbnb_df$bed_type)
airbnb_df$cancellation_policy <- as.factor(airbnb_df$cancellation_policy)
```
### 1. Describe your dataset. Give the source of the dataset and a metadata listing for each variable.
<p>Source: Detailed Listings data for Washington, D.C. from Inside Airbnb (http://insideairbnb.com/get-the-data.html)</p>
Variable | Type | Description
-----------------------|----------|-------------------------------------------------
host_id | num | Host identification number
host_name | char | Name of host
neighbourhood_cleansed | factor | Property's neighborhood group
neighbourhood | factor | Property's neighborhood
zipcode | char | Property's zipcode
latitude | num | Latitude coordinate of property
longitude | num | Longitude coordinate of propert
property_type | factor | Type of property
room_type | factor | Type of room
accommodates | num | Number of people the property can accommodate
bedrooms | num | Number of available bedrooms
beds | num | Number of available beds
bed_type | factor | Type of bed
price | num | Listing price
minimum_nights | num | Minimum of night per stay
availability_365 | num | Property's availaility in the next 365 days
availability_30 | num | Property's availaility in the next 30 days
availability_60 | num | Property's availaility in the next 60 days
availability_90 | num | Property's availaility in the next 90 days
reviews_per_month | num | Number of reviews per month
cancellation_policy | factor | Cancellation policy
### 2. Read in your dataset and calculate
#### a. The number of missing values in your dataset
#### b. The percentage of missing values in your dataset.
```{r}
airbnb_df %>% summarise(count=sum(is.na(airbnb_df)))
sum(is.na(airbnb_df))/prod(nrow(airbnb_df),ncol(airbnb_df))*100
```
There are 2000 missing values (1.04%) in this dataset recognized by R as NA.
### 3. Give TWO questions about your dataset that you are going to investigate.
<p><b>Question 1:</b> What are the most common Airbnb properties in D.C.? What is the variation in price for different types of property?</p>
<p><b>Question 2:</b> How does location influence property rental price?
### 4. Perform EDA to answer your research questions.
#### Question 1. What are the most common Airbnb properties in D.C.? What is the variation in price for different types of property?
```{r}
summary(airbnb_df$price)
sd(airbnb_df$price)
```
<p>
<ul>
<li>$115 is the average listing price for properties on Airbnb in this data sample.</li>
<li>Minimum price: $10</li>
<li>Maximum price: $10,000</li>
</ul>
</p>
```{r}
prop_type <- airbnb_df %>% group_by(property_type) %>% summarise(average_price=mean(price,na.rm = TRUE),
min_price=min(price, na.rm=TRUE),
max_price=max(price, na.rm = TRUE),
std=sd(price,na.rm = TRUE),
min_night=min(minimum_nights,na.rm=TRUE),
accom=median(accommodates, na.rm=TRUE),
range=max(price, na.rm=TRUE)-min(price,na.rm=TRUE))
head(arrange(prop_type, average_price, min_night),3)
tail(arrange(prop_type, average_price, min_night),3)
```
<p>
<ul>
<li>Hostel has the lowest mean price of all, followed by Bungalow and Guest Suite.</li>
<li>Dome house and Resort have the highest mean prices with indeterminate standard variation since there is only one entry of data recorded for each of these properties.</li>
</ul>
</p>
```{r,echo=FALSE, fig.align="center" }
ggplot(airbnb_df, aes(price,property_type)) +
geom_count(color="#D99543") +
labs(title="Common Listing on Airbnb",
x="Price [$]",
y="Accomodation Types") +
theme(legend.position = "bottom",
legend.key.size = unit(c(0.3),"lines")) +
scale_fill_brewer(palette="Set3")
```
<p> From the visualization, apartment is the most common property on Airbnb. Now let's look at the frequency table of the property type to confirm:</p>
```{r}
table_prop <- as.data.frame(table(airbnb_df$property_type)/length(airbnb_df$property_type)*100)
head(arrange(table_prop,desc(Freq)),10)
```
<ul>
<li>Apartments: 46% of listing in this data sample</li>
<li>House: 21.2% of listing </li>
<li>Townhouse: 15.4% of listing</li>
<li>Condominium: 8% of listing</li>
</ul>
```{r, eval=FALSE,include=FALSE}
sub_prop <- subset(airbnb_df,
property_type == "Apartment"
| property_type == "Townhouse"
| property_type == "Condominium"
| property_type == "Hostel"
| property_type == "House")
```
```{r, echo=FALSE, fig.align="center"}
ggplot(airbnb_df,
aes(x = reorder(property_type, price, FUN = median),
y = price,
group=property_type,
fill=property_type)) +
geom_boxplot(outlier.alpha=0.2,
outlier.size = 0.5,
lwd=0.3,
mapping= aes(fill=property_type)) +
labs(title="Price Variation of Property Type",
y="Price [$]",
x="Property Type") +
theme(axis.text.x = element_text(angle = 0, hjust=0.5, size=9),
legend.position = "none") +
coord_flip(ylim=c(0,1500))
```
<p>
<ul>
<li>Resort has a highest median price, but there is only one entry of data for this property.</li>
<li>House has the most outliers and variation in price compared to other properties.</li>
<li>Hostel is the cheapest Airbnb property.</li>
<li>Townhouse and Condominium appear to have the same price range, but Condominium is slightly cheaper. </li>
</ul>
<p>
#### Question 2. How does location influence property rental price?
```{r, eval=FALSE,include=FALSE}
#levels(neighborhood_df$neighbourhood_cleansed) <- gsub("\n","",levels(neighborhood_df$neighbourhood_cleansed))
neighborhood_df <- airbnb_df %>% group_by(neighbourhood_cleansed) %>%
summarise(average_price=mean(price,na.rm = TRUE),
min_price=min(price, na.rm=TRUE),
max_price=max(price, na.rm = TRUE),
std=sd(price,na.rm = TRUE),
median_price=median(price, na.rm=TRUE),
min_night=min(minimum_nights,na.rm=TRUE),
accom=median(accommodates, na.rm=TRUE),
range=max(price, na.rm=TRUE)-min(price,na.rm=TRUE))
neighborhood_df %>% mutate(neighbourhood_cleansed=fct_reorder(neighbourhood_cleansed,median_price,.desc = FALSE)) %>%
ggplot(neighborhood_df,
mapping = aes(x=neighbourhood_cleansed,
y=median_price,fill=median_price)) +
geom_col() +
coord_flip() +
labs(title="Median Price Among D.C. Neighborhoods ",
y="Median Price [$]") +
theme(axis.text.y = element_text(angle = 0, hjust=1, size=7),
#aspect.ratio = 1.2,
axis.title.y = element_blank(),
legend.position = "none") +
#plot.title = element_text(hjust=4,vjust=0.2))
scale_fill_distiller(palette="Blues") +
scale_x_discrete(labels=c("Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir" = "Spring Valley, Palisades",
"Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont" = "Twining, Fairlawn, Randle Highlands",
"Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point" = "Southwest/Waterfront",
"North Michigan Park, Michigan Park, University Heights" = "North Michigan Park",
"Lamont Riggs, Queens Chapel, Fort Totten, Pleasant Hill" = "Fort Totten, Pleasant Hill",
"Kalorama Heights, Adams Morgan, Lanier Heights" = "Kalorama Heights, Adams Morgan",
"Friendship Heights, American University Park, Tenleytown" = "Friendship Heights, Tenleytown",
"Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street" = "Downtown, Chinatown, Penn Quarters",
"Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights" = "Deanwood, Burrville, Grant Park",
"Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace" = "Cleveland Park, Woodley Park",
"Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View" = "Columbia Heights, Mt. Pleasant",
"Fairfax Village, Naylor Gardens, Hillcrest, Summit Park" = "Fairfax Village, Naylor Gardens",
"Woodland/Fort Stanton, Garfield Heights, Knox Hill" = "Woodland/Fort Stanton"))
neighborhood_df %>% mutate(neighbourhood_cleansed=fct_reorder(neighbourhood_cleansed,average_price,.desc = FALSE)) %>%
ggplot(neighborhood_df,
mapping = aes(x=neighbourhood_cleansed,
y=average_price,fill=average_price)) +
geom_col() +
coord_flip() +
labs(title="Average Price Among D.C. Neighborhoods ",
y="Average Price [$]") +
theme(axis.text.y = element_text(angle = 0, hjust=1, size=6),
aspect.ratio = 1.2,
axis.title.y = element_blank(),
legend.position = "none",
plot.title = element_text(hjust=5,
vjust=0.2)) +
scale_fill_distiller(palette="Blues")
```
```{r, echo=FALSE, fig.align="center",comment=FALSE}
neighborhood_df <- airbnb_df %>% group_by(neighbourhood_cleansed) %>%
summarise(average_price=mean(price,na.rm = TRUE),
min_price=min(price, na.rm=TRUE),
max_price=max(price, na.rm = TRUE),
std=sd(price,na.rm = TRUE),
median_price=median(price, na.rm=TRUE),
min_night=min(minimum_nights,na.rm=TRUE),
accom=median(accommodates, na.rm=TRUE),
range=max(price, na.rm=TRUE)-min(price,na.rm=TRUE))
neighborhood_df %>%
mutate(neighbourhood_cleansed=fct_reorder(neighbourhood_cleansed,std,.desc = FALSE)) %>%
ggplot(neighborhood_df,
mapping = aes(x=neighbourhood_cleansed,
y=std,fill=std)) +
geom_col() +
coord_flip() +
labs(title="Standard Deviation in Price Among D.C. Neighborhoods ",
y="Standard Deviation [$]") +
theme(axis.text.y = element_text(angle = 0, hjust=1, size=6),
#aspect.ratio = 1.2,
axis.title.y = element_blank(),
legend.position = "none") +
#plot.title = element_text(hjust=1.5, vjust=0.2)) +
scale_fill_distiller(palette="Blues") +
scale_x_discrete(labels=c("Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir" = "Spring Valley, Palisades",
"Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont" = "Twining, Fairlawn, Randle Highlands",
"Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point" = "Southwest/Waterfront",
"North Michigan Park, Michigan Park, University Heights" = "North Michigan Park",
"Lamont Riggs, Queens Chapel, Fort Totten, Pleasant Hill" = "Fort Totten, Pleasant Hill",
"Kalorama Heights, Adams Morgan, Lanier Heights" = "Kalorama Heights, Adams Morgan",
"Friendship Heights, American University Park, Tenleytown" = "Friendship Heights, Tenleytown",
"Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street" = "Downtown, Chinatown, Penn Quarters",
"Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights" = "Deanwood, Burrville, Grant Park",
"Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace" = "Cleveland Park, Woodley Park",
"Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View" = "Columbia Heights, Mt. Pleasant",
"Fairfax Village, Naylor Gardens, Hillcrest, Summit Park" = "Fairfax Village, Naylor Gardens",
"Woodland/Fort Stanton, Garfield Heights, Knox Hill" = "Woodland/Fort Stanton"))
```
<p>
<ul>
<li>Georgetown area has the highest standard deviation in listing price.</li>
<li>Sheridan, Barry Farm, Buena Vista has the smallest dispersion of price.</li>
</ul>
<p>
<p>Let's plot some boxplots to have a further insight into the price of each neighborhood:</p>
```{r, echo=FALSE, fig.align="center"}
ggplot(airbnb_df,
aes(x = reorder(neighbourhood_cleansed, price, FUN = median),
y = price)) +
geom_boxplot(outlier.alpha=0.2,
outlier.size = 0.5,
lwd=0.3,
aes(group=neighbourhood_cleansed,
fill=neighbourhood_cleansed)) +
labs(title="Price Variation Among D.C. Neighborhoods",
y="Price [$]",
x="Neighborhood") +
theme(axis.text.x = element_text(angle = 0, hjust=0.5, size=9),
axis.text.y = element_text(size = 7),
axis.title.y = element_blank(),
legend.position = "none") +
coord_flip(ylim=c(0,1200)) +
scale_x_discrete(labels=c("Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir" = "Spring Valley, Palisades",
"Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont" = "Twining, Fairlawn, Randle Highlands",
"Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point" = "Southwest/Waterfront",
"North Michigan Park, Michigan Park, University Heights" = "North Michigan Park",
"Lamont Riggs, Queens Chapel, Fort Totten, Pleasant Hill" = "Fort Totten, Pleasant Hill",
"Kalorama Heights, Adams Morgan, Lanier Heights" = "Kalorama Heights, Adams Morgan",
"Friendship Heights, American University Park, Tenleytown" = "Friendship Heights, Tenleytown",
"Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street" = "Downtown, Chinatown, Penn Quarters",
"Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights" = "Deanwood, Burrville, Grant Park",
"Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace" = "Cleveland Park, Woodley Park",
"Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View" = "Columbia Heights, Mt. Pleasant",
"Fairfax Village, Naylor Gardens, Hillcrest, Summit Park" = "Fairfax Village, Naylor Gardens",
"Woodland/Fort Stanton, Garfield Heights, Knox Hill" = "Woodland/Fort Stanton"))
```
<p>
<ul>
<li>West End, Foggy Botttom, GWU neighborhood group has the most variation in price range, followed by Southwest/Waterfront.</li>
<li>Downtown, Chinatown, Penn Quarters area has the highest median price.</li>
<li>Eastland Gardens, Kenilworth area has the lowest median price.</li>
<li>Columbia Heights-Mt.Pleasant and Cathedral Heights appear to have the same price range with Cathedral Heights having a slightly cheaper median price.</li>
</ul>
<p>
```{r,eval=FALSE,include=FALSE, fig.align="center"}
ggplot(airbnb_df,
mapping = aes(neighbourhood_cleansed,
fill=room_type)) +
geom_bar() +
coord_flip() +
labs(title="Room Type Distribution Among D.C. Neighborhoods") +
theme(legend.position = "bottom",
axis.text.y = element_text(angle = 0, hjust=1, size=6),
axis.title.y = element_blank(),
legend.title = element_blank(),
legend.key.size = unit(c(0.6),"lines"),
plot.title = element_text(hjust=2.5,vjust=0.2)) +
guides(fill = guide_legend(nrow = 2, byrow=TRUE)) +
scale_fill_brewer(palette="Set3") +
scale_x_discrete(labels=c("Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir" = "Spring Valley, Palisades",
"Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont" = "Twining, Fairlawn, Randle Highlands",
"Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point" = "Southwest/Waterfront",
"North Michigan Park, Michigan Park, University Heights" = "North Michigan Park",
"Lamont Riggs, Queens Chapel, Fort Totten, Pleasant Hill" = "Fort Totten, Pleasant Hill",
"Kalorama Heights, Adams Morgan, Lanier Heights" = "Kalorama Heights, Adams Morgan",
"Friendship Heights, American University Park, Tenleytown" = "Friendship Heights, Tenleytown",
"Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street" = "Downtown, Chinatown, Penn Quarters",
"Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights" = "Deanwood, Burrville, Grant Park",
"Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace" = "Cleveland Park, Woodley Park",
"Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View" = "Columbia Heights, Mt. Pleasant",
"Fairfax Village, Naylor Gardens, Hillcrest, Summit Park" = "Fairfax Village, Naylor Gardens",
"Woodland/Fort Stanton, Garfield Heights, Knox Hill" = "Woodland/Fort Stanton"))
```
### 5. Write a two paragraph summary about what your EDA is telling you about your data.
<P> On average, airbnb users visiting DC should expect to pay \$115 per night. The most expensive accomodation costs \$10,000 for a minimum of four nights, and the cheapest option costs only \$10 for a night. The three most affordable airbnbs are hostel (\$46/night), bungalow (\$90.00/night), and guest suite (\$100/night) with standard deviation of \$32, \$70, and $74 respectively. On other other hand, resort is the most expensive lodging with indeterminate standard deviation since there is only one listing of this property type. From the visualization, the most common listing in DC with fairly low price is apartments followed by townhouses, single homes, and condominiums. Apartments account for 46% of the total listing, whereas only 15% of listed property are hostels. Looking at the "Price Variation of Property Type" graph, the single house category has the greatest variation in price with the highest price outlier. </p>
<p> To determine the Airbnb price range for D.C. neighborhoods, we first look at the standard deviation visualization. Georgeotown, Burleith-Hillandale area has a greatest dispersion in price. This is due to an outlier - the $10,000 Historic Georgetown Residence. In contrast, Sheridan, Barry Farm, Buena Vista neighborhood has the lowest price's standard deviation. Furthermore, Downtown, Chinatown, Penn Quarters area has the highest median price. Columbia Heights-Mt.Pleasant and Cathedral Heights appear to have the same price range with Cathedral Heights having a slightly cheaper median price. </p>
```{r not use, eval=FALSE,include=FALSE,warning=FALSE,echo=FALSE}
library(geojsonio)
dc_spdf <- geojson_read("neighbourhoods.geojson", what = "sp")
library(sp)
library(broom)
library(mapproj)
#'fortify' the data to get a dataframe format required by ggplot2
dc_fortified <- tidy(dc_spdf)
ggplot() +
geom_polygon(data=dc_fortified,
aes(x=long, y=lat, group=group),
fill="#69b3a2",
color="white") +
theme_void() +
coord_map() +
labs(title="D.C. Boundary") +
geom_point(data = point_coord, size = 0.2, shape = 19, na.rm=TRUE,
aes(x=longitude, y=latitude, colour=price)) +
scale_fill_gradient(low="red", high="yellow")
library(leaflet.extras)
point_coord[1:5000,c("longitude","latitude","price","neighbourhood")] %>%
leaflet() %>%
addTiles() %>%
addHeatmap(lng = ~longitude,
lat = ~latitude,
intensity= ~price,
group = ~neighbourhood,
minOpacity = 0.05,
max = 1,
radius = 20,
blur = 15)
```
<br>
<p>In case you are bored, here is an interactive map for detailed listing of Airbnb properties in D.C.</p>
```{r,echo=FALSE,warning=FALSE,message=FALSE,fig.align="center"}
library(leaflet)
library(leaflet.extras)
point_coord <- data.frame(airbnb_df[,c("latitude","longitude","price","neighbourhood","property_type")])
point_coord[1:9152,c("longitude","latitude","price","property_type")] %>%
leaflet() %>%
addTiles() %>%
addMarkers(clusterOptions = markerClusterOptions(),
popup = paste("Property Type:", point_coord$property_type,"<br>",
"Price: $", point_coord$price)) %>%
addResetMapButton()
```
### 6. Give a third question about your dataset that you want to investigate using a two-sample t-test.
<p><b>Question:</b> Do townhouses have a higher average price compared to condominiums?</p>
<p> A two sample t-test will be performed at the 95% confidence level. </p>
Null hypothesis | Alternative hypothesis
-----------------------------------------------------|-------------------------------------------------------
The average price of townhouses is equal to condos | The average price of townhouses is higher than condos.
$H_{o}:\mu_{T} = \mu_{C}$ | $H_{a}:\mu_{T} > \mu_{C}$
```{r}
townhouse <- airbnb_df %>% filter(property_type == "Townhouse")
condo <- airbnb_df %>% filter(property_type == "Condominium")
t.test(townhouse$price, condo$price, alternative="greater", conf.level = 0.95)
```
<p><b>P-value:</b> 0.1729 > $0.05 = \alpha$ </p>
<p><b>Conclusion:</b> Fail to reject null hypothesis. </p>
<p><b>Real-world interpretation:</b> The difference between the two samples' means is statistically nonsignificant. There is not enough evidence in our data to prove that the average price of townhouses is higher than condominiums. The price difference we observed in the visualization occurs likely due to chance. </p>
<p> The 95% confidence interval means we can be 95% sure that the 95% confidence interval contains the true difference between the means of these two groups. Here a one-tail confidence interval from -6.36 to $\infty$ was used. This confidence interval contains 0 which implies that 0 is a reasonable possibility for the true value of the difference. Hence, we fail to reject the null hypothesis. </p>
```{r, echo=FALSE, warning=FALSE,message=FALSE,fig.align="center"}
library(plotly)
p <- ggplot(mapping = aes(fill=property_type, outlier.alpha=0.1)) +
geom_boxplot(data=townhouse, mapping = aes(x=property_type, y=price)) +
geom_boxplot(data=condo, mapping = aes(x= property_type, y=price)) +
labs(title="Distribution in Price of Townhouse and Condominiums", y="Price [$]", x="Property Type") +
theme(legend.position = "none",
axis.title.x = element_blank())
ggplotly(p)
```
### 7. Give a fourth question about your dataset that you can investigate using a Chi-Square test.
<p><b>Question:</b> Is there a relationship between room type and bed type? </p>
We'll perform a Chi-square test with $\alpha = 0.05$.
Null hypothesis | Alternative hypothesis
------------------------------------------------------|------------------------------------------------------------
$H_{o}:$ Room type and bed type are independent. | $H_{a}:$ Room type and bed type are dependent.
```{r, echo=FALSE, fig.align="center",comment=FALSE}
room_bed <- airbnb_df %>% count(room_type, bed_type)
#levels(neighbor_prop$neighbourhood_cleansed) <- gsub("\n", " ", levels(neighbor_prop$neighbourhood_cleansed))
ggplot(data=room_bed) +
geom_tile(mapping = aes(y = room_type,
x = bed_type,
fill = n)) +
theme(axis.title = element_blank(),
legend.key.size = unit(c("0.8"), "lines"),
axis.text.x = element_text(angle = 0,
size = 9,
hjust=0.5,
vjust=0.5),
axis.text.y = element_text(size = 9),
panel.grid.major = element_blank()) +
scale_fill_continuous(limits=c(0,6500), breaks=seq(0,6500,by=1000))
```
```{r, eval=FALSE,include=FALSE}
scale_y_discrete(labels=c("Spring Valley, Palisades, Wesley Heights, Foxhall Crescent, Foxhall Village, Georgetown Reservoir" = "Spring Valley, Palisades",
"Twining, Fairlawn, Randle Highlands, Penn Branch, Fort Davis Park, Fort Dupont" = "Twining, Fairlawn, Randle Highlands",
"Southwest Employment Area, Southwest/Waterfront, Fort McNair, Buzzard Point" = "Southwest/Waterfront",
"North Michigan Park, Michigan Park, University Heights" = "North Michigan Park",
"Lamont Riggs, Queens Chapel, Fort Totten, Pleasant Hill" = "Fort Totten, Pleasant Hill",
"Kalorama Heights, Adams Morgan, Lanier Heights" = "Kalorama Heights, Adams Morgan",
"Friendship Heights, American University Park, Tenleytown" = "Friendship Heights, Tenleytown",
"Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street" = "Downtown, Chinatown, Penn Quarters",
"Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights" = "Deanwood, Burrville, Grant Park",
"Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace" = "Cleveland Park, Woodley Park",
"Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View" = "Columbia Heights, Mt. Pleasant",
"Fairfax Village, Naylor Gardens, Hillcrest, Summit Park" = "Fairfax Village, Naylor Gardens",
"Woodland/Fort Stanton, Garfield Heights, Knox Hill" = "Woodland/Fort Stanton"))
```
```{r}
table_room_bed <- table(airbnb_df$room_type, airbnb_df$bed_type)
result <- chisq.test(table_room_bed)
result
```
<p><b>P-value:</b> 3.491E-14 < $0.05 = \alpha$.</p>
<p><b>Conclusion:</b> Reject null hypothesis.</p>
<p><b>Real-world interpretation:</b> There is enough evidence to show that room type and bed type are related. However, the result may not be valid due to the test's error.</p>
### 8. Give a fifth question about your dataset that involves the covariation of two quantitative variables.
<p><b>Question:</b> Assuming I'm a hostel owner, I would like to predict the price depending on the number of beds I have. How does the price per night relate to the number of beds?<p>
<p><b>Null hypothesis:</b> $H_{o}:$ There is no correlation between the number of beds and the price.</p>
<p><b>Alternative hypothesis:</b> $H_{a}:$ There is correlation between the number of beds and the price.<p>
```{r,eval=FALSE,include=FALSE}
#apartment[c("price","accommodates","minimum_nights","availability_365","review_scores_rating","reviews_per_month")]
#pairs(airbnb_data, col="blue")
```
```{r, echo=FALSE,fig.align="center"}
ggplot(data = airbnb_df, mapping = aes(y = price, x = beds)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title="Price Prediction based on the Number of Beds",
x="Number of Beds",
y="Price [$]")
```
```{r}
lin_model <- lm(airbnb_df$price ~ airbnb_df$beds)
lin_model
summary(lin_model)
```
The linear model is: $$price = 50.52\ *\ number\ of\ beds\ +\ 95.96$$
<p><b>P-value:</b> 2.2E-16 < 0.05 = $\alpha$
<p>Our model is statistically significant, and there is a relationship between the number of beds and the price.</p>
```{r}
cor(airbnb_df$price, airbnb_df$beds, use = 'na.or.complete')
```
<p> With $r^2 = 0.06401$, we understand that 6.4% of variation in the price is due to the the number of beds. </p>
<p> $r = 0.2532$ indicates a weak positive correlation between the two variables.</p>
### 9. Write a two-paragraph summary of any ethical concerns about your dataset and/or project.
<p>There are some limitations I encounterred while analyzing the dataset. The Airbnb data from Inside Airbnb was last retrieved on November 22 in 2019, so the information will be solely based on what have been scraped from Airbnb website on that date. In addition, historical data for the property prices are not available in this dataset.</p>
<p> While performing Chi-squared test for the two categorical variables room_type and bed_type, the test results came with a warning "Chi-squared approximation may be incorrect". This refers to the small expected counts of the varibles in the dataset; hence, the approximation may be poor. Since the p-value is relatively small compared to alpha, the null hypothesis was rejected.