forked from compsocialscience/summer-institute
-
Notifications
You must be signed in to change notification settings - Fork 0
/
reddit_.Rmd
130 lines (92 loc) · 8.11 KB
/
reddit_.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: "SICSS-HSE Tutorial: Reddit as a source of data"
output: html_document
---
## Prepared by Elizaveta Sivak, SICSS-HSE
[Reddit](reddit.com) is a discussion website, or a network of communities based on people's interests. People can use Reddit as a newsfeed - read posts in different communities, or topical subforums (called "subreddits"), or also post something and discuss with other users in comments. Reddit is very popular - mostly in the US, but also in other countries (the UK, Europe, Australia, Canada, etc.).
Reddit also presents an interesting source of data for research in social sciences.
The main strength of Reddit as a data source seems to be that it gives access to clusters of hard-to-reach populations and people with not very common characteristics (interests, kinds of behavior, etc.) (people with depression; stock market enthusiasts; QAnon casualties, etc.). These people are conveniently clustered in the related subreddits which can be seen as sampling frames.
Reddit is often used for research. Some examples:
1. [Maria Antoniak et al. (2019)](https://dl.acm.org/doi/abs/10.1145/3359190) who used r/BabyBumps (subreddit with birth stories) to analyse narratives abour births and positions of power of different agents
2. [Munmun de Choudhury et al.](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/viewFile/8075/8107) analysed r/Depression
For methodological questions about using Reddit for research purposes see Adams, Artigiani and Wish (2019) who discusses comparative strengths of Twitter and Reddit for conducting social media drug research and Shatz (2016) who describes advantages and disadvantages of recruiting participants on Reddit.
In this tutorial we will give some introductory information about accessing Reddit data.
The first question is how to select subreddits for you research?
1. Use search http://www.reddit.com/reddits
2. Consult with people who have knowledge of the platform
3. Identify subreddits that are most active related to your keyword (code is below)
Ways to get Reddit data:
1. Use [Pushshift repository](https://files.pushshift.io/reddit/). Thanks to Jason Baumgartner we actually have all Reddit data (posts and comments). Please, consider donations if you download large abounts of data. The only drawbacks are that data are posted with a time lag (you can't access most recent posts; for instance, 2020 Reddit data were published on Pushshift in July 2021). In some situations, when you need to get recent data instantly (and also if you need only a specific subreddit or time period) it is impossible or not very rational to download datasets from Pushshift.
2. Use Reddit API by building an URL
https://www.reddit.com/r/{subreddit}/{listing}.json?limit={count}&t={timeframe}
timeframe = hour, day, week, month, year, all
listing = controversial, best, hot, new, random, rising, top
Example: https://www.reddit.com/r/depression/top.json?limit=100&t=year
You can access the page by constructing the link depending on which reddit and time frame you'd like and then parse the content of the page using traditional web-parsing instruments.
However, there are already more convenient tools for getting data from Reddit using it's API:
3. If you are a Python user, there is a [PRAW package](https://praw.readthedocs.io/en/stable/). Limitation: can’t extract submissions between specific dates. Given Reddit API's limitations - 1000 posts per request - it's not possible to get all the data from a subreddit (only 1000 most recent or 1000 most popular submissions)
4. For R users there is [RedditExtractoR package](https://cran.r-project.org/web/packages/RedditExtractoR/RedditExtractoR.pdf) (author Ivan Rivera). Although with this package you can easily get neat data frames the limitation is the same - is can’t extract submissions between specific dates
5. Good thing is, that there is a solution! Again, thanks to Jason (JSON) Baumgartner, one can use [Pushshift API](https://github.com/pushshift/api) to access data between specific dates
All the instructions about constructing a request using API are on this girhub page. For instance, if you'd like to search all subreddits for the term "parenting" and return posts (called 'submissions') made between 2 and 4 days ago, you construct the link like this:
https://api.pushshift.io/reddit/search/comment/?q=parenting&after=4d&before=2d&sort=asc
Another good thing about Pushshift API is that it returns data in a more human-readable format than standard Reddit API (open the link above and see for yourself)
6. One can also use Pushshift not directly, but via specific packages. We will use [pushshiftR package](https://github.com/nathancunn/pushshiftR) (author Nathan Cunningham) to access Pushshift API in R
Let's try tow ways to get Reddit data: via RedditExtractoR and pushshiftR
### WAY 1 (RedditExtractoR)
Loading the package:
```{r echo=T, results='hide',message=FALSE, warning=FALSE}
install.packages("RedditExtractoR", repos = "http://cran.us.r-project.org")
library(RedditExtractoR)
```
At first, let's find out which subreddits are most active related to your keyword (in this case "parenting"). Page threshold controls the number of pages is going to be searched for a
given search word (1 by default). You can either get most new submissions (sort_by = 'new') or submissions with the most comments (sort_by = 'comments')
```{r echo=T, results='hide',message=FALSE, warning=FALSE}
links <- reddit_urls(
search_terms = "parenting",
page_threshold = 10,
sort_by = 'new'
)
```
Then we can access the content of the submission using reddit_content()
```{r echo=T, results='hide',message=FALSE, warning=FALSE}
content <- reddit_content(links$URL[3])
```
Let's visualize the result:
```{r message=FALSE, warning=FALSE}
#install.packages('dplyr')
#install.packages('ggplot2')
library(dplyr)
library(ggplot2)
links2 <- links %>% group_by(subreddit) %>% mutate(count = n()) %>% filter(count>2)
g <- ggplot(links2, aes(x = reorder(subreddit,-count))) +
geom_bar(stat="count")+
theme_classic()+
theme(axis.text.x = element_text(angle=90, hjust=1))
g
```
The resulting plot shows the most relevant communities (subreddits), if you are interested in getting Reddit data on parenting (where relevance = mentioning the term "parenting")
one of these subreddits is r/Parenting. Let's extract some content from this subreddit with get_reddit() function.
As before, we can get either most recent posts (sort_by = 'new') or posts with more comments (sort_by = 'comments')
The dataset includes posts, comments of each post, and also context information (date of each comment, number of upvotes, etc.). The interesting feature is that we also get the structure of comments. All with only one line of code (!)
```{r echo=T, results='hide',message=FALSE, warning=FALSE}
prnt <- get_reddit(subreddit = "Parenting", page_threshold = 2, sort_by = 'new')
```
But, as was mentioned before, using this function we can't get the whole subreddit (except very new\short subreddits)
How to get submissions between specific dates?
### WAY 1 (pushshiftR)
Firstly, let's install the package. Installing pushshuftR is a bit longer - at first you need to install devtools package
```{r echo=T, results='hide',message=FALSE, warning=FALSE}
install.packages("devtools", repos = "http://cran.us.r-project.org")
library(devtools)
devtools::install_github("https://github.com/nathancunn/pushshiftR",force = TRUE)
library(pushshiftR)
# may be you'll be asked to install Rtools35 too. In this case install it via link https://cran.r-project.org/bin/windows/Rtools/history.html
```
Now we can extract data using pushshiftR. We can get either posts' content ("submission"), or comments ("comment"). Using "after" parameter (+ using a loop) we can get older posts (not only the most recent)
```{r echo=T, results='hide',message=FALSE, warning=FALSE}
prnt2 <- getPushshiftData(postType = "submission",
size = 10,
after = "1546300800", # unix time https://www.unixtimestamp.com/index.php
subreddit = "Parenting",
nest_level = 1)
```