forked from compsocialscience/summer-institute
-
Notifications
You must be signed in to change notification settings - Fork 0
/
SICSS_Group_project_Reddit.html
3445 lines (3413 loc) · 437 KB
/
SICSS_Group_project_Reddit.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head>
<meta charset="utf-8">
<meta name="generator" content="quarto-1.3.353">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<meta name="author" content="Paul Binder">
<meta name="author" content="Julian Kohne">
<meta name="author" content="Johanna Mehltretter">
<meta name="author" content="[Mark Sparhuber]">
<meta name="author" content="Birgit Zeyer-Gliozzo">
<title>Relationship Advice on Reddit</title>
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
div.columns{display: flex; gap: min(4vw, 1.5em);}
div.column{flex: auto; overflow-x: auto;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
ul.task-list li input[type="checkbox"] {
width: 0.8em;
margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */
vertical-align: middle;
}
/* CSS for syntax highlighting */
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
{ counter-reset: source-line 0; }
pre.numberSource code > span
{ position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
{ content: counter(source-line);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
}
pre.numberSource { margin-left: 3em; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
</style>
<script src="SICSS_Group_project_Reddit_files/libs/clipboard/clipboard.min.js"></script>
<script src="SICSS_Group_project_Reddit_files/libs/quarto-html/quarto.js"></script>
<script src="SICSS_Group_project_Reddit_files/libs/quarto-html/popper.min.js"></script>
<script src="SICSS_Group_project_Reddit_files/libs/quarto-html/tippy.umd.min.js"></script>
<script src="SICSS_Group_project_Reddit_files/libs/quarto-html/anchor.min.js"></script>
<link href="SICSS_Group_project_Reddit_files/libs/quarto-html/tippy.css" rel="stylesheet">
<link href="SICSS_Group_project_Reddit_files/libs/quarto-html/quarto-syntax-highlighting.css" rel="stylesheet" id="quarto-text-highlighting-styles">
<script src="SICSS_Group_project_Reddit_files/libs/bootstrap/bootstrap.min.js"></script>
<link href="SICSS_Group_project_Reddit_files/libs/bootstrap/bootstrap-icons.css" rel="stylesheet">
<link href="SICSS_Group_project_Reddit_files/libs/bootstrap/bootstrap.min.css" rel="stylesheet" id="quarto-bootstrap" data-mode="light">
</head>
<body>
<div id="quarto-content" class="page-columns page-rows-contents page-layout-article">
<div id="quarto-margin-sidebar" class="sidebar margin-sidebar">
<nav id="TOC" role="doc-toc" class="toc-active">
<h2 id="toc-title">Table of contents</h2>
<ul>
<li><a href="#setup" id="toc-setup" class="nav-link active" data-scroll-target="#setup">Setup</a>
<ul>
<li><a href="#basic-idea" id="toc-basic-idea" class="nav-link" data-scroll-target="#basic-idea">Basic Idea</a></li>
<li><a href="#research-questions" id="toc-research-questions" class="nav-link" data-scroll-target="#research-questions">Research Questions</a></li>
<li><a href="#data-selection" id="toc-data-selection" class="nav-link" data-scroll-target="#data-selection">Data Selection</a></li>
<li><a href="#collecting-data" id="toc-collecting-data" class="nav-link" data-scroll-target="#collecting-data">Collecting Data</a>
<ul>
<li><a href="#things-we-tried" id="toc-things-we-tried" class="nav-link" data-scroll-target="#things-we-tried">Things we tried</a></li>
<li><a href="#main-problems" id="toc-main-problems" class="nav-link" data-scroll-target="#main-problems">Main Problems</a></li>
<li><a href="#our-solution" id="toc-our-solution" class="nav-link" data-scroll-target="#our-solution">Our Solution</a></li>
</ul></li>
</ul></li>
<li><a href="#preprocessing" id="toc-preprocessing" class="nav-link" data-scroll-target="#preprocessing">(Pre)processing</a>
<ul>
<li><a href="#set-up-loading-packages-and-data" id="toc-set-up-loading-packages-and-data" class="nav-link" data-scroll-target="#set-up-loading-packages-and-data">Set-Up: Loading Packages and Data</a></li>
<li><a href="#extracting-age-gender-information" id="toc-extracting-age-gender-information" class="nav-link" data-scroll-target="#extracting-age-gender-information">Extracting Age & Gender Information</a>
<ul>
<li><a href="#checking-with-visual-sanity-check" id="toc-checking-with-visual-sanity-check" class="nav-link" data-scroll-target="#checking-with-visual-sanity-check">Checking with visual sanity check</a></li>
<li><a href="#multiple-ages" id="toc-multiple-ages" class="nav-link" data-scroll-target="#multiple-ages">Multiple ages</a></li>
<li><a href="#taking-first-instances" id="toc-taking-first-instances" class="nav-link" data-scroll-target="#taking-first-instances">Taking first instances</a></li>
<li><a href="#dealing-with-weird-cases" id="toc-dealing-with-weird-cases" class="nav-link" data-scroll-target="#dealing-with-weird-cases">Dealing with weird cases</a></li>
<li><a href="#distribution-of-reasonable-age-range" id="toc-distribution-of-reasonable-age-range" class="nav-link" data-scroll-target="#distribution-of-reasonable-age-range">Distribution of reasonable age range</a></li>
<li><a href="#building-age-groups" id="toc-building-age-groups" class="nav-link" data-scroll-target="#building-age-groups">Building age groups</a></li>
<li><a href="#non-cis-gendered-authors" id="toc-non-cis-gendered-authors" class="nav-link" data-scroll-target="#non-cis-gendered-authors">Non cis-gendered Authors</a></li>
</ul></li>
<li><a href="#automated-gender-and-age-detection" id="toc-automated-gender-and-age-detection" class="nav-link" data-scroll-target="#automated-gender-and-age-detection">Automated Gender and Age detection</a></li>
<li><a href="#data-wrangling-formatting-and-subsetting" id="toc-data-wrangling-formatting-and-subsetting" class="nav-link" data-scroll-target="#data-wrangling-formatting-and-subsetting">Data Wrangling: Formatting and Subsetting</a></li>
<li><a href="#preprocessing-text-data" id="toc-preprocessing-text-data" class="nav-link" data-scroll-target="#preprocessing-text-data">Preprocessing Text Data</a></li>
</ul></li>
<li><a href="#analysis" id="toc-analysis" class="nav-link" data-scroll-target="#analysis">Analysis</a>
<ul>
<li><a href="#descriptive-statistics" id="toc-descriptive-statistics" class="nav-link" data-scroll-target="#descriptive-statistics">Descriptive Statistics</a>
<ul>
<li><a href="#distribution-of-author-age-groups" id="toc-distribution-of-author-age-groups" class="nav-link" data-scroll-target="#distribution-of-author-age-groups">Distribution of Author Age Groups</a></li>
<li><a href="#age-as-a-continuous-variable" id="toc-age-as-a-continuous-variable" class="nav-link" data-scroll-target="#age-as-a-continuous-variable">Age as a continuous variable</a></li>
<li><a href="#age-distribution-by-subreddit-and-gender-for-non-cis-gender-authors" id="toc-age-distribution-by-subreddit-and-gender-for-non-cis-gender-authors" class="nav-link" data-scroll-target="#age-distribution-by-subreddit-and-gender-for-non-cis-gender-authors">Age distribution by subreddit and gender for non-cis gender authors</a></li>
<li><a href="#distribution-of-author-gender" id="toc-distribution-of-author-gender" class="nav-link" data-scroll-target="#distribution-of-author-gender">Distribution of Author Gender</a></li>
</ul></li>
<li><a href="#structural-topic-modelling" id="toc-structural-topic-modelling" class="nav-link" data-scroll-target="#structural-topic-modelling">Structural Topic Modelling</a></li>
<li><a href="#assigning-meaning-to-topics" id="toc-assigning-meaning-to-topics" class="nav-link" data-scroll-target="#assigning-meaning-to-topics">Assigning Meaning to Topics</a></li>
<li><a href="#topics" id="toc-topics" class="nav-link" data-scroll-target="#topics">Topics</a></li>
<li><a href="#resulting-topics-sorted-by-frequency" id="toc-resulting-topics-sorted-by-frequency" class="nav-link" data-scroll-target="#resulting-topics-sorted-by-frequency">Resulting Topics (sorted by frequency)</a></li>
</ul></li>
<li><a href="#results" id="toc-results" class="nav-link" data-scroll-target="#results">Results</a>
<ul>
<li><a href="#effects-by-age" id="toc-effects-by-age" class="nav-link" data-scroll-target="#effects-by-age">Effects by Age</a></li>
<li><a href="#effects-by-gender" id="toc-effects-by-gender" class="nav-link" data-scroll-target="#effects-by-gender">Effects by Gender</a></li>
<li><a href="#relationships-between-topics" id="toc-relationships-between-topics" class="nav-link" data-scroll-target="#relationships-between-topics">Relationships between topics</a></li>
<li><a href="#if-we-had-unlimited-time-we-would-continue-by" id="toc-if-we-had-unlimited-time-we-would-continue-by" class="nav-link" data-scroll-target="#if-we-had-unlimited-time-we-would-continue-by">If we had unlimited time, we would continue by…</a></li>
</ul></li>
<li><a href="#takeaways" id="toc-takeaways" class="nav-link" data-scroll-target="#takeaways">Takeaways</a></li>
<li><a href="#thank-you" id="toc-thank-you" class="nav-link" data-scroll-target="#thank-you">Thank You!</a></li>
</ul>
</nav>
</div>
<main class="content" id="quarto-document-content">
<header id="title-block-header" class="quarto-title-block default">
<div class="quarto-title">
<div class="quarto-title-block"><div><h1 class="title">Relationship Advice on Reddit</h1><button type="button" class="btn code-tools-button" id="quarto-code-tools-source"><i class="bi"></i> Code</button></div></div>
</div>
<div class="quarto-title-meta">
<div>
<div class="quarto-title-meta-heading">Authors</div>
<div class="quarto-title-meta-contents">
<p>Paul Binder </p>
<p>Julian Kohne </p>
<p>Johanna Mehltretter </p>
<p>[Mark Sparhuber] </p>
<p>Birgit Zeyer-Gliozzo </p>
</div>
</div>
</div>
</header>
<section id="setup" class="level1">
<h1>Setup</h1>
<section id="basic-idea" class="level2">
<h2 class="anchored" data-anchor-id="basic-idea">Basic Idea</h2>
<p>We use data from subreddits where people ask for relationship advice, specifically <a href="https://www.reddit.com/r/relationships/">r/relationships</a> and <a href="https://www.reddit.com/r/relationship_advice/">r/relationships_advice</a> to understand how users discuss relationship problems online. As a first step, we aim to categorize posts using structural topic models according to explore prevalent topics and examine patterns regarding sociodemographic factors, such as age and gender. Additionally, we might scrape comments to these posts and investigate reactions towards topics as well as redditors’ interests in and judgement of different topics based on sentiment analysis.</p>
</section>
<section id="research-questions" class="level2">
<h2 class="anchored" data-anchor-id="research-questions">Research Questions</h2>
<p>We set out to answer the following questions:</p>
<ul>
<li><p>Which topics/relationship issues are discussed in reddit relationship posts?</p></li>
<li><p>Are there differences in the discussed topics according to gender/age/time?</p></li>
<li><p><strong>[BONUS]</strong> How do users react to posts about relationship issues?</p>
<ul>
<li><p>Do the upvotes of posts differ according to topic/gender/age/time?</p></li>
<li><p>How are different issues discussed/perceived/liked?</p></li>
</ul></li>
</ul>
</section>
<section id="data-selection" class="level2">
<h2 class="anchored" data-anchor-id="data-selection">Data Selection</h2>
<p>We decided to use two specific subreddits that are thematically relevant and have a high number of posts and active users.</p>
<ul>
<li><a href="https://www.reddit.com/r/relationships/">r/relationships</a>:
<ul>
<li><p><strong>Description</strong> /r/Relationships is a community built around helping people and the goal of providing a platform for interpersonal relationship advice between redditors. We seek posts from users who have specific and personal relationship quandaries that other redditors can help them try to solve.</p></li>
<li><p><strong>No of subscribers:</strong> 3.4 Million</p></li>
</ul></li>
<li><a href="https://www.reddit.com/r/relationship_advice/">r/relationships_advice</a>:
<ul>
<li><p><strong>Description</strong> Need help with your relationship? Whether it’s romance, friendship, family, co-workers, or basic human interaction: we’re here to help!</p></li>
<li><p><strong>No of subscribers:</strong> 9.6 Million</p></li>
</ul></li>
</ul>
</section>
<section id="collecting-data" class="level2">
<h2 class="anchored" data-anchor-id="collecting-data">Collecting Data</h2>
<p>Our aim was to collect as much data as possible to:</p>
<ol type="a">
<li><p>Enable us to see trends and patterns over time</p></li>
<li><p>Achieve sufficient sample sizes even for marginalized groups (trans peopl, non-binary redditors)</p></li>
<li><p>Enable us to look for temporal patterns between different sociodemographic groups</p></li>
</ol>
<section id="things-we-tried" class="level3">
<h3 class="anchored" data-anchor-id="things-we-tried">Things we tried</h3>
<p>We tried <strong>many</strong> different approached because unfortunately, many did not work (any more):</p>
<ul>
<li><strong>R wrappers</strong>
<ul>
<li><a href="https://github.com/ivan-rivera/RedditExtractor">RedditExractoR</a>️ ⚠️</li>
<li><a href="https://github.com/schochastics/PSAWR">PSAWR</a> 🛑</li>
<li><a href="https://github.com/mkearney/rreddit">rreddit</a> 🛑</li>
</ul></li>
<li><strong>Python wrapper</strong>
<ul>
<li><a href="https://github.com/praw-dev/praw">PRAW</a> ⚠️</li>
</ul></li>
<li><strong>Reddit API</strong>
<ul>
<li><a href="https://www.redditinc.com/policies/data-api-terms">Free API</a> ⚠️</li>
<li><a href="https://www.reddit.com/wiki/api/">Paid API</a> ⚠️</li>
</ul></li>
<li><strong>Webscraping</strong>
<ul>
<li><a href="https://github.com/tidyverse/rvest">rvest</a> 🛑</li>
<li><a href="https://github.com/ropensci/RSelenium">RSelenium</a> 🛑</li>
</ul></li>
</ul>
</section>
<section id="main-problems" class="level3">
<h3 class="anchored" data-anchor-id="main-problems">Main Problems</h3>
<ul>
<li><p>As of June 30th 2023 Reddit closed down it’s public API, likely to prevent providers of LLMs to use their data for free as training data (see <a href="https://www.theverge.com/2023/6/9/23755640/reddit-api-changes-apps-apollo-shut-down-ama-spez-steve-huffman">here</a>). All wrappers that rely on Reddits API directly are thus heavily limited or downright broken.</p></li>
<li><p>In addition, Reddit implemented measures to prevent webscraping, specifcally <a href="https://developer.mozilla.org/en-US/docs/Web/API/Web_components/Using_shadow_DOM">shadow DOMS</a>. While it’s still possible to get data by injecting Javascript, it is extremely tedious, takes lots of time and probabaly against Reddit TOS.</p></li>
</ul>
</section>
<section id="our-solution" class="level3">
<h3 class="anchored" data-anchor-id="our-solution">Our Solution</h3>
<ul>
<li><strong>Pre-scraped data</strong>
<ul>
<li>We luckily had some data that was previously scraped by a student assistant before the API was closed</li>
</ul></li>
<li><strong>Historic data dumps</strong>
<ul>
<li>Pushshift used to provide an API for accessing Reddit. With Reddits’ own API shutting down, Pushshift has shut down too. However, there are still dumps of historic data available that were extracted before Reddits’ API shut down. These dumps contain all posts and comments from the most frequently used Subreddits from 2005 - 2013.</li>
</ul></li>
</ul>
<p><strong><em>Course of Action:</em></strong> We will first use the prescraped data to develop our methods and write our code and then run it on the (much bigger) data dump that we will download and preprocess in the meantime.</p>
</section>
</section>
</section>
<section id="preprocessing" class="level1">
<h1>(Pre)processing</h1>
<section id="set-up-loading-packages-and-data" class="level2">
<h2 class="anchored" data-anchor-id="set-up-loading-packages-and-data">Set-Up: Loading Packages and Data</h2>
<p>We start by installing and loading all packages that we need for our project.</p>
<p>Next, we import the raw data that was previously scraped.</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>data <span class="ot"><-</span> <span class="fu">readRDS</span>(<span class="st">'data/previous/RedditData_Raw.rds'</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
<section id="extracting-age-gender-information" class="level2">
<h2 class="anchored" data-anchor-id="extracting-age-gender-information">Extracting Age & Gender Information</h2>
<p><img src="./img/Age_Gender_info_example_red.png" class="img-fluid"></p>
<p>In a next step, we extracted age and gender information on the author’s of a post from the titles of the Reddit posts. On the one hand, we did this extraction with regex:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="./img/Age_Gender_info_example_green.png" class="img-fluid figure-img"></p>
<figcaption class="figure-caption">Demographic Information on r/relationships</figcaption>
</figure>
</div>
<p>checking how much info we have per row</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"># how many author info boxes we detect per post</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="fu">table</span>(<span class="fu">sapply</span>(AuthorInfo,length))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
0 1 2 3
2181 70742 3189 24 </code></pre>
</div>
<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># how many posts have more than one author info box detected</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="fu">table</span>(<span class="fu">sapply</span>(AuthorInfo,length) <span class="sc">></span> <span class="dv">1</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
FALSE TRUE
72923 3213 </code></pre>
</div>
<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="fu">prop.table</span>(<span class="fu">table</span>(<span class="fu">sapply</span>(AuthorInfo,length) <span class="sc">></span> <span class="dv">1</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
FALSE TRUE
0.9577992 0.0422008 </code></pre>
</div>
</div>
<p><strong>PROBLEM:</strong> How do we decide which author info to take if we have multiple based on “I” and “My” indicators?</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="./img/Age_Gender_info_example_green.png" class="img-fluid figure-img"></p>
<figcaption class="figure-caption">Author vs. other Information on r/relationships</figcaption>
</figure>
</div>
<section id="checking-with-visual-sanity-check" class="level3">
<h3 class="anchored" data-anchor-id="checking-with-visual-sanity-check">Checking with visual sanity check</h3>
<div class="cell">
<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Visual sanity check</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>Test <span class="ot"><-</span> <span class="fu">cbind.data.frame</span>(data<span class="sc">$</span>title,AuthorInfo)</span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="fu">rownames</span>(Test) <span class="ot"><-</span> <span class="cn">NULL</span></span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(Test[,<span class="fu">c</span>(<span class="dv">1</span>,<span class="dv">2</span>)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code> data$title
1 I (26F) feel suffocated in my relationship with bf (25M) because he never wants alone time
2 I'm 33M and she's 32F, I just don't what to do.
3 Navigating an Open ( 38m) Relationship with 34F
4 Was it awkward I unintentionally looked at this gas ls coworker and she noticed? [33f] [22m]
5 How do I (26F) stop thinking about my last ex (24M), when I am a week from getting married to my fiance (21M)?
6 I (23F) want to break up with my financially irresponsible boyfriend (25M)
AuthorInfo
1 26f
2 <NA>
3 <NA>
4 22m
5 26f
6 23f</code></pre>
</div>
</div>
<p>### Splitting Age and Gender information</p>
</section>
<section id="multiple-ages" class="level3">
<h3 class="anchored" data-anchor-id="multiple-ages">Multiple ages</h3>
<div class="cell">
<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># there are some instances were we extracted multiple numbers</span></span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="co"># this is because some people are writing the information of multiple</span></span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="co"># people into one set of brackets. This is only the case however</span></span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="co"># for less than 0.003 % of posts</span></span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a><span class="co"># Examples:</span></span>
<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(data[<span class="fu">sapply</span>(author_age,length) <span class="sc">></span> <span class="dv">1</span>,]<span class="sc">$</span>title[<span class="dv">1</span><span class="sc">:</span><span class="dv">100</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "Seeing her with another guy has crushed me (M30 F28)"
[2] "What is happening? (26M) (18-24 F)"
[3] "(19m, 19f) Im kinda starting to freak out haha. Girl I'm talking too is a bad texter and it's throwing me off"
[4] "Keep forgetting stuff my friends and people tell me (23f (23f)"
[5] "(21m, 24f) Should I cut things off with her?"
[6] "(18F, 19M) my bf vents too much" </code></pre>
</div>
</div>
</section>
<section id="taking-first-instances" class="level3">
<h3 class="anchored" data-anchor-id="taking-first-instances">Taking first instances</h3>
<div class="cell">
<div class="sourceCode cell-code" id="cb12"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># From visual inspection, it seems safe to assume that the first</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="co"># indication is usually the age of the author, consequently, we</span></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="co"># always take the first value</span></span>
<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a>author_age <span class="ot"><-</span> <span class="fu">sapply</span>(author_age, <span class="st">`</span><span class="at">[[</span><span class="st">`</span>, <span class="dv">1</span>)</span>
<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a><span class="co"># rechecking</span></span>
<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a><span class="co">#table(sapply(author_age,length))</span></span>
<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a><span class="co">#prop.table(table(sapply(author_age,length)))</span></span>
<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a><span class="co"># finally, we transform to numeric</span></span>
<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a>author_age <span class="ot"><-</span> <span class="fu">as.numeric</span>(author_age)</span>
<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a><span class="co"># first results</span></span>
<span id="cb12-14"><a href="#cb12-14" aria-hidden="true" tabindex="-1"></a><span class="fu">hist</span>(author_age)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/unnamed-chunk-8-1.png" class="img-fluid" width="672"></p>
</div>
</div>
</section>
<section id="dealing-with-weird-cases" class="level3">
<h3 class="anchored" data-anchor-id="dealing-with-weird-cases">Dealing with weird cases</h3>
<div class="cell">
<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co"># we have some unreasonably large and small numbers in there</span></span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="co"># (and some zeros)</span></span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="co"># lets check what went wrong there:</span></span>
<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a><span class="co"># The zeros are almost exclusively from people pasting the example</span></span>
<span id="cb13-6"><a href="#cb13-6" aria-hidden="true" tabindex="-1"></a><span class="co"># title format as their own title, we should exclude those posts</span></span>
<span id="cb13-7"><a href="#cb13-7" aria-hidden="true" tabindex="-1"></a><span class="co"># later</span></span>
<span id="cb13-8"><a href="#cb13-8" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(data[author_age <span class="sc">==</span> <span class="dv">0</span> <span class="sc">&</span> <span class="sc">!</span><span class="fu">is.na</span>(author_age),]<span class="sc">$</span>title)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "Me [00 M/F/N] with my ___ [00 M/F/N] duration, short-description"
[2] "Me [00 M/F] with my ___ [00 M/F] duration, short-description;text"
[3] "Me [00 M/F/N] with my ___ [00 M/F/N] duration, short-description"
[4] "Me [00 M/F/N] with my ___ [00 M/F/N] last night, my neighbor sends me a weird text. I'm not sure if I responded correctly/appropriately because I'm autistic."</code></pre>
</div>
<div class="sourceCode cell-code" id="cb15"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="co"># The people who indicate an age above 100 are basically 50/50</span></span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="co"># between typos and trolls, but we can throw this away as it's</span></span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a><span class="co"># not that much data. It's also little enough data to manually</span></span>
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a><span class="co"># inspect</span></span>
<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(data[author_age <span class="sc">>=</span> <span class="dv">100</span> <span class="sc">&</span> <span class="sc">!</span><span class="fu">is.na</span>(author_age),]<span class="sc">$</span>title)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>character(0)</code></pre>
</div>
<div class="sourceCode cell-code" id="cb17"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="co"># These are usually people talking about their children, trolls</span></span>
<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a><span class="co"># or people with an unusual infobox e.g. (M2F) for transgender</span></span>
<span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a><span class="co"># As these cases are not that many, we can exlcude them</span></span>
<span id="cb17-4"><a href="#cb17-4" aria-hidden="true" tabindex="-1"></a><span class="co"># They also frequently contain typos like 2F instead of 23F or</span></span>
<span id="cb17-5"><a href="#cb17-5" aria-hidden="true" tabindex="-1"></a><span class="co"># anonymizations like 2XM for a man in his twenties</span></span>
<span id="cb17-6"><a href="#cb17-6" aria-hidden="true" tabindex="-1"></a>UnderageFrame <span class="ot"><-</span> data[(author_age <span class="sc"><</span> <span class="dv">11</span> <span class="sc">&</span></span>
<span id="cb17-7"><a href="#cb17-7" aria-hidden="true" tabindex="-1"></a> author_age <span class="sc">!=</span> <span class="dv">0</span> <span class="sc">&</span></span>
<span id="cb17-8"><a href="#cb17-8" aria-hidden="true" tabindex="-1"></a> <span class="sc">!</span><span class="fu">is.na</span>(author_age)),]</span>
<span id="cb17-9"><a href="#cb17-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-10"><a href="#cb17-10" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(UnderageFrame<span class="sc">$</span>title)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "My girlfriend (F23) has an incredibly low sex drive with me (M24) (2 year relationship)"
[2] "Do long-distance relationships work? (20f) + (22m) [5yrs]"
[3] "My gf (21F) (of 6 months) slept with my brother (23M) 2 years ago when we weren’t together"
[4] "(2m, 24f) how do I move on from someone while still seeing them?"
[5] "My (M 27) fiancée (F 27) is dragging her feet on moving in with me (7-year relationship)"
[6] "[PART 2] My (M21) girlfriend’s (F21) family hates us for moving in with my family" </code></pre>
</div>
</div>
</section>
<section id="distribution-of-reasonable-age-range" class="level3">
<h3 class="anchored" data-anchor-id="distribution-of-reasonable-age-range">Distribution of reasonable age range</h3>
<div class="cell">
<div class="sourceCode cell-code" id="cb19"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="co"># plotting age distribution of author age for ages between 10 and 100</span></span>
<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a><span class="fu">hist</span>(<span class="fu">as.numeric</span>(author_age[author_age <span class="sc">></span> <span class="dv">10</span> <span class="sc">&</span> author_age <span class="sc"><</span> <span class="dv">100</span>]))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/unnamed-chunk-10-1.png" class="img-fluid" width="672"></p>
</div>
</div>
</section>
<section id="building-age-groups" class="level3">
<h3 class="anchored" data-anchor-id="building-age-groups">Building age groups</h3>
<div class="cell">
<div class="sourceCode cell-code" id="cb20"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="co"># We create a factor level for age as well for easier breakdown in further analysis, levels are</span></span>
<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>author_age_group <span class="ot"><-</span> author_age</span>
<span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a>author_age_group[author_age_group <span class="sc"><=</span> <span class="dv">20</span>] <span class="ot"><-</span> <span class="st">"<21"</span></span>
<span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a>author_age_group[author_age_group <span class="sc">>=</span> <span class="dv">21</span> <span class="sc">&</span> author_age_group <span class="sc"><=</span> <span class="dv">30</span>] <span class="ot"><-</span> <span class="st">"21-30"</span></span>
<span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a>author_age_group[author_age_group <span class="sc">>=</span> <span class="dv">31</span> <span class="sc">&</span> author_age_group <span class="sc"><=</span> <span class="dv">40</span>] <span class="ot"><-</span> <span class="st">"31-40"</span></span>
<span id="cb20-6"><a href="#cb20-6" aria-hidden="true" tabindex="-1"></a>author_age_group[author_age_group <span class="sc">>=</span> <span class="dv">41</span> <span class="sc">&</span> author_age_group <span class="sc"><=</span> <span class="dv">50</span>] <span class="ot"><-</span> <span class="st">"41-50"</span></span>
<span id="cb20-7"><a href="#cb20-7" aria-hidden="true" tabindex="-1"></a>author_age_group[author_age_group <span class="sc">!=</span> <span class="st">"<21"</span> <span class="sc">&</span></span>
<span id="cb20-8"><a href="#cb20-8" aria-hidden="true" tabindex="-1"></a> author_age_group <span class="sc">!=</span> <span class="st">"21-30"</span> <span class="sc">&</span> </span>
<span id="cb20-9"><a href="#cb20-9" aria-hidden="true" tabindex="-1"></a> author_age_group <span class="sc">!=</span> <span class="st">"31-40"</span> <span class="sc">&</span></span>
<span id="cb20-10"><a href="#cb20-10" aria-hidden="true" tabindex="-1"></a> author_age_group <span class="sc">!=</span> <span class="st">"41-50"</span>] <span class="ot"><-</span> <span class="st">">50"</span></span>
<span id="cb20-11"><a href="#cb20-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-12"><a href="#cb20-12" aria-hidden="true" tabindex="-1"></a><span class="co"># checking variable</span></span>
<span id="cb20-13"><a href="#cb20-13" aria-hidden="true" tabindex="-1"></a><span class="co">#class(author_age_group)</span></span>
<span id="cb20-14"><a href="#cb20-14" aria-hidden="true" tabindex="-1"></a>author_age_group <span class="ot"><-</span> <span class="fu">factor</span>(author_age_group,</span>
<span id="cb20-15"><a href="#cb20-15" aria-hidden="true" tabindex="-1"></a> <span class="at">levels =</span> <span class="fu">c</span>(<span class="st">"<21"</span>,</span>
<span id="cb20-16"><a href="#cb20-16" aria-hidden="true" tabindex="-1"></a> <span class="st">"21-30"</span>,</span>
<span id="cb20-17"><a href="#cb20-17" aria-hidden="true" tabindex="-1"></a> <span class="st">"31-40"</span>,</span>
<span id="cb20-18"><a href="#cb20-18" aria-hidden="true" tabindex="-1"></a> <span class="st">"41-50"</span>,</span>
<span id="cb20-19"><a href="#cb20-19" aria-hidden="true" tabindex="-1"></a> <span class="st">">50"</span>))</span>
<span id="cb20-20"><a href="#cb20-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-21"><a href="#cb20-21" aria-hidden="true" tabindex="-1"></a><span class="co"># table</span></span>
<span id="cb20-22"><a href="#cb20-22" aria-hidden="true" tabindex="-1"></a><span class="fu">table</span>(author_age,author_age_group, <span class="at">useNA =</span> <span class="st">"ifany"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code> author_age_group
author_age <21 21-30 31-40 41-50 >50 <NA>
0 4 0 0 0 0 0
2 5 0 0 0 0 0
3 1 0 0 0 0 0
4 1 0 0 0 0 0
5 2 0 0 0 0 0
6 2 0 0 0 0 0
7 1 0 0 0 0 0
11 1 0 0 0 0 0
12 1 0 0 0 0 0
13 28 0 0 0 0 0
14 71 0 0 0 0 0
15 149 0 0 0 0 0
16 313 0 0 0 0 0
17 417 0 0 0 0 0
18 2324 0 0 0 0 0
19 2557 0 0 0 0 0
20 3126 0 0 0 0 0
21 0 3142 0 0 0 0
22 0 3572 0 0 0 0
23 0 3553 0 0 0 0
24 0 3400 0 0 0 0
25 0 3726 0 0 0 0
26 0 2778 0 0 0 0
27 0 2497 0 0 0 0
28 0 2148 0 0 0 0
29 0 1719 0 0 0 0
30 0 2027 0 0 0 0
31 0 0 1057 0 0 0
32 0 0 1071 0 0 0
33 0 0 741 0 0 0
34 0 0 629 0 0 0
35 0 0 698 0 0 0
36 0 0 400 0 0 0
37 0 0 282 0 0 0
38 0 0 259 0 0 0
39 0 0 181 0 0 0
40 0 0 233 0 0 0
41 0 0 0 98 0 0
42 0 0 0 96 0 0
43 0 0 0 76 0 0
44 0 0 0 59 0 0
45 0 0 0 65 0 0
46 0 0 0 37 0 0
47 0 0 0 34 0 0
48 0 0 0 33 0 0
49 0 0 0 18 0 0
50 0 0 0 34 0 0
51 0 0 0 0 15 0
52 0 0 0 0 15 0
53 0 0 0 0 9 0
54 0 0 0 0 5 0
55 0 0 0 0 13 0
56 0 0 0 0 11 0
57 0 0 0 0 3 0
58 0 0 0 0 9 0
59 0 0 0 0 12 0
60 0 0 0 0 9 0
61 0 0 0 0 3 0
62 0 0 0 0 3 0
63 0 0 0 0 2 0
64 0 0 0 0 2 0
65 0 0 0 0 6 0
67 0 0 0 0 1 0
70 0 0 0 0 1 0
71 0 0 0 0 1 0
78 0 0 0 0 1 0
89 0 0 0 0 1 0
<NA> 0 0 0 0 0 32348</code></pre>
</div>
<div class="sourceCode cell-code" id="cb22"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Missing values</span></span>
<span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a><span class="fu">table</span>(<span class="fu">is.na</span>(author_age_group))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
FALSE TRUE
43788 32348 </code></pre>
</div>
<div class="sourceCode cell-code" id="cb24"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true" tabindex="-1"></a><span class="fu">prop.table</span>(<span class="fu">table</span>(<span class="fu">is.na</span>(author_age_group)))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>
FALSE TRUE
0.5751287 0.4248713 </code></pre>
</div>
<div class="sourceCode cell-code" id="cb26"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a><span class="co"># age group</span></span>
<span id="cb26-2"><a href="#cb26-2" aria-hidden="true" tabindex="-1"></a><span class="co">#table(author_age_group, author_age, useNA = "ifany")</span></span>
<span id="cb26-3"><a href="#cb26-3" aria-hidden="true" tabindex="-1"></a><span class="co">#prop.table(table(author_age_group))</span></span>
<span id="cb26-4"><a href="#cb26-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-5"><a href="#cb26-5" aria-hidden="true" tabindex="-1"></a><span class="co"># lets bind everything together for easier subsetting</span></span>
<span id="cb26-6"><a href="#cb26-6" aria-hidden="true" tabindex="-1"></a>data <span class="ot"><-</span> <span class="fu">cbind.data.frame</span>(data,author_age,author_age_group, <span class="at">stringsAsFactors =</span> <span class="cn">FALSE</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
<section id="non-cis-gendered-authors" class="level3">
<h3 class="anchored" data-anchor-id="non-cis-gendered-authors">Non cis-gendered Authors</h3>
<div class="cell">
<div class="sourceCode cell-code" id="cb27"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a><span class="co"># having a look at the unusual cases</span></span>
<span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">sapply</span>(AuthorInfo,nchar) <span class="sc">></span> <span class="dv">4</span> <span class="sc">&</span> <span class="sc">!</span><span class="fu">is.na</span>(AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "m30 f28" "nb-25" " 21 f " "19ftm" "18-24 f" "19m, 19f"</code></pre>
</div>
<div class="sourceCode cell-code" id="cb29"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a><span class="co"># there are a few specific things that pop up often</span></span>
<span id="cb29-2"><a href="#cb29-2" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"relationship"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2 year relationship" "relationship with" "7-year relationship"
[4] "2 year relationship" "4mo relationship" </code></pre>
</div>
<div class="sourceCode cell-code" id="cb31"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb31-1"><a href="#cb31-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"update"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "update" "update" "update" "update" "update" "update"</code></pre>
</div>
<div class="sourceCode cell-code" id="cb33"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"advice"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "advice needed" "seeking advice" "advice wanted" "advice"
[5] "need advice" "advice" </code></pre>
</div>
<div class="sourceCode cell-code" id="cb35"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb35-1"><a href="#cb35-1" aria-hidden="true" tabindex="-1"></a><span class="co"># In these cases, we often have information that doesn't fit the standard formatting.</span></span>
<span id="cb35-2"><a href="#cb35-2" aria-hidden="true" tabindex="-1"></a><span class="co"># For the most part, it's save to just extract the instance of m or f as the gender from</span></span>
<span id="cb35-3"><a href="#cb35-3" aria-hidden="true" tabindex="-1"></a><span class="co"># the author info, however, we need to pay special attention to the terms</span></span>
<span id="cb35-4"><a href="#cb35-4" aria-hidden="true" tabindex="-1"></a><span class="co"># -trans</span></span>
<span id="cb35-5"><a href="#cb35-5" aria-hidden="true" tabindex="-1"></a><span class="co"># -mtf</span></span>
<span id="cb35-6"><a href="#cb35-6" aria-hidden="true" tabindex="-1"></a><span class="co"># -ftm</span></span>
<span id="cb35-7"><a href="#cb35-7" aria-hidden="true" tabindex="-1"></a><span class="co"># -nb</span></span>
<span id="cb35-8"><a href="#cb35-8" aria-hidden="true" tabindex="-1"></a><span class="co"># -non-binary</span></span>
<span id="cb35-9"><a href="#cb35-9" aria-hidden="true" tabindex="-1"></a><span class="co"># We should also check how often the following terms appear</span></span>
<span id="cb35-10"><a href="#cb35-10" aria-hidden="true" tabindex="-1"></a><span class="co">#- gay</span></span>
<span id="cb35-11"><a href="#cb35-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb35-12"><a href="#cb35-12" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"trans"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "transwoman 33" "36transfem" "18transm" </code></pre>
</div>
<div class="sourceCode cell-code" id="cb37"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb37-1"><a href="#cb37-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"mtf"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "mtf18" "21mtf" "20mtf" "21mtf" "29 mtf" "mtf-23"</code></pre>
</div>
<div class="sourceCode cell-code" id="cb39"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb39-1"><a href="#cb39-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"m2f"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>character(0)</code></pre>
</div>
<div class="sourceCode cell-code" id="cb41"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb41-1"><a href="#cb41-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"ftm"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "19ftm" "19ftm" "ftm 25" "18ftm" "ftm22" "19ftm" </code></pre>
</div>
<div class="sourceCode cell-code" id="cb43"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb43-1"><a href="#cb43-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"f2m"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>character(0)</code></pre>
</div>
<div class="sourceCode cell-code" id="cb45"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb45-1"><a href="#cb45-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"nb"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "nb-25" "21nb" "nb28" "nb22" "29nb" "22nb" </code></pre>
</div>
<div class="sourceCode cell-code" id="cb47"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb47-1"><a href="#cb47-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"non-binary"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "22m non-binary" "27 non-binary" "26 non-binary"
[4] "actually non-binary" "f-non-binary, 45" </code></pre>
</div>
<div class="sourceCode cell-code" id="cb49"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb49-1"><a href="#cb49-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"gay"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "27m, gay" "33m, openly gay"</code></pre>
</div>
<div class="sourceCode cell-code" id="cb51"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb51-1"><a href="#cb51-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">grep</span>(<span class="st">"lesbian"</span>,AuthorInfo)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "lesbian"</code></pre>
</div>
<div class="sourceCode cell-code" id="cb53"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb53-1"><a href="#cb53-1" aria-hidden="true" tabindex="-1"></a><span class="co"># We should first extract the gender only using the (m/f) indicator</span></span>
<span id="cb53-2"><a href="#cb53-2" aria-hidden="true" tabindex="-1"></a><span class="co"># and then add corrections where-ever necessary</span></span>
<span id="cb53-3"><a href="#cb53-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb53-4"><a href="#cb53-4" aria-hidden="true" tabindex="-1"></a><span class="co"># extracting gender from author variable</span></span>
<span id="cb53-5"><a href="#cb53-5" aria-hidden="true" tabindex="-1"></a>author_gender <span class="ot"><-</span> <span class="fu">str_extract_all</span>(AuthorInfo,<span class="st">"(f|m)"</span>)</span>
<span id="cb53-6"><a href="#cb53-6" aria-hidden="true" tabindex="-1"></a>author_gender[author_gender <span class="sc">==</span> <span class="st">"character(0)"</span> <span class="sc">&</span> <span class="sc">!</span><span class="fu">is.na</span>(author_gender)] <span class="ot"><-</span> <span class="cn">NA</span></span>
<span id="cb53-7"><a href="#cb53-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb53-8"><a href="#cb53-8" aria-hidden="true" tabindex="-1"></a><span class="co"># There is an issue: We sometimes get two matching gender indicators from the titles</span></span>
<span id="cb53-9"><a href="#cb53-9" aria-hidden="true" tabindex="-1"></a><span class="co"># This usually happens when people mention something else within their self-description in brackets</span></span>
<span id="cb53-10"><a href="#cb53-10" aria-hidden="true" tabindex="-1"></a><span class="co"># such as their sexual orientation or something unrelated. As convention, we just take the first instance</span></span>
<span id="cb53-11"><a href="#cb53-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb53-12"><a href="#cb53-12" aria-hidden="true" tabindex="-1"></a><span class="co"># checking how many ambiguities are present and where they are</span></span>
<span id="cb53-13"><a href="#cb53-13" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(author_gender[<span class="fu">sapply</span>(author_gender,length) <span class="sc">></span> <span class="dv">1</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[[1]]
[1] "m" "f"
[[2]]
[1] "f" "m"
[[3]]
[1] "m" "f"
[[4]]
[1] "f" "f"
[[5]]
[1] "m" "f"
[[6]]
[1] "f" "m"</code></pre>
</div>
<div class="sourceCode cell-code" id="cb55"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb55-1"><a href="#cb55-1" aria-hidden="true" tabindex="-1"></a><span class="co">#which(sapply(author_gender,length) > 1)</span></span>
<span id="cb55-2"><a href="#cb55-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb55-3"><a href="#cb55-3" aria-hidden="true" tabindex="-1"></a><span class="co"># checking with which titles it occurs</span></span>
<span id="cb55-4"><a href="#cb55-4" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(data<span class="sc">$</span>title[<span class="fu">which</span>(<span class="fu">sapply</span>(author_gender,length) <span class="sc">></span> <span class="dv">1</span>)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "Seeing her with another guy has crushed me (M30 F28)"
[2] "I (19FTM) have a secret partner (32M)"
[3] "(19m, 19f) Im kinda starting to freak out haha. Girl I'm talking too is a bad texter and it's throwing me off"
[4] "Keep forgetting stuff my friends and people tell me (23f (23f)"
[5] "(21m, 24f) Should I cut things off with her?"
[6] "I (19FTM) dislike my mom (48F). I feel somewhat bad for having to rely on her." </code></pre>
</div>
<div class="sourceCode cell-code" id="cb57"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb57-1"><a href="#cb57-1" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(AuthorInfo[<span class="fu">which</span>(<span class="fu">sapply</span>(author_gender,length) <span class="sc">></span> <span class="dv">1</span>)])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "m30 f28" "19ftm" "19m, 19f" "23f (23f" "21m, 24f" "19ftm" </code></pre>
</div>
<div class="sourceCode cell-code" id="cb59"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb59-1"><a href="#cb59-1" aria-hidden="true" tabindex="-1"></a><span class="co"># taking only first element of every sublist</span></span>
<span id="cb59-2"><a href="#cb59-2" aria-hidden="true" tabindex="-1"></a>author_gender <span class="ot"><-</span> <span class="fu">sapply</span>(author_gender, <span class="st">`</span><span class="at">[[</span><span class="st">`</span>, <span class="dv">1</span>)</span>
<span id="cb59-3"><a href="#cb59-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb59-4"><a href="#cb59-4" aria-hidden="true" tabindex="-1"></a><span class="co"># including trans indications</span></span>
<span id="cb59-5"><a href="#cb59-5" aria-hidden="true" tabindex="-1"></a>author_gender[<span class="fu">grep</span>(<span class="st">"mtf"</span>,AuthorInfo)] <span class="ot"><-</span> <span class="st">"mtf"</span></span>
<span id="cb59-6"><a href="#cb59-6" aria-hidden="true" tabindex="-1"></a>author_gender[<span class="fu">grep</span>(<span class="st">"m2f"</span>,AuthorInfo)] <span class="ot"><-</span> <span class="st">"mtf"</span></span>
<span id="cb59-7"><a href="#cb59-7" aria-hidden="true" tabindex="-1"></a>author_gender[<span class="fu">grep</span>(<span class="st">"ftm"</span>,AuthorInfo)] <span class="ot"><-</span> <span class="st">"ftm"</span></span>
<span id="cb59-8"><a href="#cb59-8" aria-hidden="true" tabindex="-1"></a>author_gender[<span class="fu">grep</span>(<span class="st">"f2m"</span>,AuthorInfo)] <span class="ot"><-</span> <span class="st">"ftm"</span></span>
<span id="cb59-9"><a href="#cb59-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb59-10"><a href="#cb59-10" aria-hidden="true" tabindex="-1"></a><span class="co"># setting trans indicators </span></span>
<span id="cb59-11"><a href="#cb59-11" aria-hidden="true" tabindex="-1"></a>author_gender[<span class="fu">grepl</span>(<span class="st">"trans"</span>,AuthorInfo) <span class="sc">&</span> author_gender <span class="sc">==</span> <span class="st">"m"</span>] <span class="ot"><-</span> <span class="st">"ftm"</span></span>
<span id="cb59-12"><a href="#cb59-12" aria-hidden="true" tabindex="-1"></a>author_gender[<span class="fu">grepl</span>(<span class="st">"trans"</span>,AuthorInfo) <span class="sc">&</span> author_gender <span class="sc">==</span> <span class="st">"f"</span>] <span class="ot"><-</span> <span class="st">"mtf"</span></span>
<span id="cb59-13"><a href="#cb59-13" aria-hidden="true" tabindex="-1"></a>author_gender[<span class="fu">grepl</span>(<span class="st">"trans"</span>,AuthorInfo) <span class="sc">&</span> <span class="fu">is.na</span>(author_gender)] <span class="ot"><-</span> <span class="st">"trans"</span></span>
<span id="cb59-14"><a href="#cb59-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb59-15"><a href="#cb59-15" aria-hidden="true" tabindex="-1"></a><span class="co"># setting non-binary indicator</span></span>
<span id="cb59-16"><a href="#cb59-16" aria-hidden="true" tabindex="-1"></a>author_gender[<span class="fu">grepl</span>(<span class="st">"nb"</span>,AuthorInfo)] <span class="ot"><-</span> <span class="st">"non-binary"</span></span>
<span id="cb59-17"><a href="#cb59-17" aria-hidden="true" tabindex="-1"></a>author_gender[<span class="fu">grepl</span>(<span class="st">"non-binary"</span>,AuthorInfo)] <span class="ot"><-</span> <span class="st">"non-binary"</span></span>
<span id="cb59-18"><a href="#cb59-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb59-19"><a href="#cb59-19" aria-hidden="true" tabindex="-1"></a><span class="co"># first results</span></span>
<span id="cb59-20"><a href="#cb59-20" aria-hidden="true" tabindex="-1"></a><span class="fu">table</span>(author_gender)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>author_gender
f ftm m mtf non-binary
24864 65 18074 18 312 </code></pre>
</div>
<div class="sourceCode cell-code" id="cb61"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb61-1"><a href="#cb61-1" aria-hidden="true" tabindex="-1"></a><span class="fu">barplot</span>(<span class="fu">table</span>(author_gender))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid" width="672"></p>
</div>
<div class="sourceCode cell-code" id="cb62"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb62-1"><a href="#cb62-1" aria-hidden="true" tabindex="-1"></a><span class="co"># adding author gender to the dataframe</span></span>
<span id="cb62-2"><a href="#cb62-2" aria-hidden="true" tabindex="-1"></a>data <span class="ot"><-</span> <span class="fu">cbind.data.frame</span>(data,author_gender, <span class="at">stringsAsFactors =</span> <span class="cn">FALSE</span>)</span>
<span id="cb62-3"><a href="#cb62-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb62-4"><a href="#cb62-4" aria-hidden="true" tabindex="-1"></a><span class="co"># Visual Sanity Check</span></span>
<span id="cb62-5"><a href="#cb62-5" aria-hidden="true" tabindex="-1"></a><span class="co">#View(data)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
</section>
<section id="automated-gender-and-age-detection" class="level2">
<h2 class="anchored" data-anchor-id="automated-gender-and-age-detection">Automated Gender and Age detection</h2>
<div class="cell">
<div class="sourceCode cell-code" id="cb63"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb63-1"><a href="#cb63-1" aria-hidden="true" tabindex="-1"></a><span class="do">#### script for interacting with gpt3 ####</span></span>
<span id="cb63-2"><a href="#cb63-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-3"><a href="#cb63-3" aria-hidden="true" tabindex="-1"></a><span class="do">## setting up access token as described here:</span></span>
<span id="cb63-4"><a href="#cb63-4" aria-hidden="true" tabindex="-1"></a><span class="co"># https://github.com/ben-aaron188/rgpt3</span></span>
<span id="cb63-5"><a href="#cb63-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-6"><a href="#cb63-6" aria-hidden="true" tabindex="-1"></a><span class="co"># installing package</span></span>
<span id="cb63-7"><a href="#cb63-7" aria-hidden="true" tabindex="-1"></a>devtools<span class="sc">::</span><span class="fu">install_github</span>(<span class="st">"ben-aaron188/rgpt3"</span>)</span>
<span id="cb63-8"><a href="#cb63-8" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(rgpt3)</span>
<span id="cb63-9"><a href="#cb63-9" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(quanteda)</span>
<span id="cb63-10"><a href="#cb63-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-11"><a href="#cb63-11" aria-hidden="true" tabindex="-1"></a><span class="co"># authenticate access</span></span>
<span id="cb63-12"><a href="#cb63-12" aria-hidden="true" tabindex="-1"></a><span class="co"># </span><span class="al">TODO</span><span class="co">: BE CAREFUL, THIS IS LINKED TO CARSTENS PRIVATE CREDIT CARD!</span></span>
<span id="cb63-13"><a href="#cb63-13" aria-hidden="true" tabindex="-1"></a><span class="fu">gpt3_authenticate</span>(<span class="st">"./GPT_access_token//gpt3_token_carsten2.txt"</span>)</span>
<span id="cb63-14"><a href="#cb63-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-15"><a href="#cb63-15" aria-hidden="true" tabindex="-1"></a><span class="co"># make test request</span></span>
<span id="cb63-16"><a href="#cb63-16" aria-hidden="true" tabindex="-1"></a><span class="fu">gpt3_test_completion</span>()</span>
<span id="cb63-17"><a href="#cb63-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-18"><a href="#cb63-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-19"><a href="#cb63-19" aria-hidden="true" tabindex="-1"></a><span class="do">#### Testing author and gender attribution via gpt3 ####</span></span>
<span id="cb63-20"><a href="#cb63-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-21"><a href="#cb63-21" aria-hidden="true" tabindex="-1"></a><span class="co"># loading in data</span></span>
<span id="cb63-22"><a href="#cb63-22" aria-hidden="true" tabindex="-1"></a><span class="fu">setwd</span>(<span class="st">"./data_final/"</span>)</span>
<span id="cb63-23"><a href="#cb63-23" aria-hidden="true" tabindex="-1"></a><span class="fu">list.files</span>()</span>
<span id="cb63-24"><a href="#cb63-24" aria-hidden="true" tabindex="-1"></a><span class="fu">load</span>(<span class="st">"Reddit_relationship_data_final.Rdata"</span>)</span>
<span id="cb63-25"><a href="#cb63-25" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-26"><a href="#cb63-26" aria-hidden="true" tabindex="-1"></a><span class="co"># subsetting data</span></span>
<span id="cb63-27"><a href="#cb63-27" aria-hidden="true" tabindex="-1"></a><span class="co">#sample_size <- 100</span></span>
<span id="cb63-28"><a href="#cb63-28" aria-hidden="true" tabindex="-1"></a><span class="co">#data_subset <- data[sample(1:dim(data)[1],sample_size),]</span></span>
<span id="cb63-29"><a href="#cb63-29" aria-hidden="true" tabindex="-1"></a>data_subset <span class="ot"><-</span> data</span>
<span id="cb63-30"><a href="#cb63-30" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-31"><a href="#cb63-31" aria-hidden="true" tabindex="-1"></a><span class="co"># constructing subset dataframe</span></span>
<span id="cb63-32"><a href="#cb63-32" aria-hidden="true" tabindex="-1"></a>preface <span class="ot"><-</span> <span class="st">"Extract only the authors' self-indicated age and gender. Ignore anyone but the author: "</span></span>
<span id="cb63-33"><a href="#cb63-33" aria-hidden="true" tabindex="-1"></a>addon <span class="ot"><-</span> <span class="st">"</span><span class="sc">\n</span><span class="st"> </span><span class="sc">\n</span><span class="st"> Do not include any additional text. If no info is available for the author, return NA"</span></span>
<span id="cb63-34"><a href="#cb63-34" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-35"><a href="#cb63-35" aria-hidden="true" tabindex="-1"></a>prompt_frame <span class="ot"><-</span> <span class="fu">data.frame</span>(<span class="at">prompt_role_var =</span> <span class="fu">rep</span>(<span class="st">"user"</span>,<span class="fu">dim</span>(data_subset)[<span class="dv">1</span>]),</span>
<span id="cb63-36"><a href="#cb63-36" aria-hidden="true" tabindex="-1"></a> <span class="at">prompt_content_var =</span> <span class="fu">paste</span>(preface,data_subset<span class="sc">$</span>title,addon),</span>
<span id="cb63-37"><a href="#cb63-37" aria-hidden="true" tabindex="-1"></a> <span class="at">id_var =</span> data_subset<span class="sc">$</span>id)</span>
<span id="cb63-38"><a href="#cb63-38" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-39"><a href="#cb63-39" aria-hidden="true" tabindex="-1"></a><span class="co"># sending to gpt3</span></span>
<span id="cb63-40"><a href="#cb63-40" aria-hidden="true" tabindex="-1"></a><span class="co"># </span><span class="al">TODO</span><span class="co">: ALWAYS DOUBLE CHECK TO NOT SEND TOO MUCH DATA!</span></span>
<span id="cb63-41"><a href="#cb63-41" aria-hidden="true" tabindex="-1"></a><span class="fu">dim</span>(prompt_frame)</span>
<span id="cb63-42"><a href="#cb63-42" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-43"><a href="#cb63-43" aria-hidden="true" tabindex="-1"></a><span class="co"># token amount</span></span>
<span id="cb63-44"><a href="#cb63-44" aria-hidden="true" tabindex="-1"></a><span class="fu">length</span>(<span class="fu">unlist</span>(<span class="fu">strsplit</span>(prompt_frame<span class="sc">$</span>prompt_content_var,<span class="st">" "</span>)))</span>
<span id="cb63-45"><a href="#cb63-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-46"><a href="#cb63-46" aria-hidden="true" tabindex="-1"></a><span class="co"># pricing</span></span>
<span id="cb63-47"><a href="#cb63-47" aria-hidden="true" tabindex="-1"></a><span class="co"># https://openai.com/pricing</span></span>
<span id="cb63-48"><a href="#cb63-48" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-49"><a href="#cb63-49" aria-hidden="true" tabindex="-1"></a><span class="co"># GPT3 (4k context):</span></span>
<span id="cb63-50"><a href="#cb63-50" aria-hidden="true" tabindex="-1"></a><span class="co"># in: 0.03$ / 1000 tokens</span></span>
<span id="cb63-51"><a href="#cb63-51" aria-hidden="true" tabindex="-1"></a><span class="co"># out: 0.002$ / 1000 tokens</span></span>
<span id="cb63-52"><a href="#cb63-52" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-53"><a href="#cb63-53" aria-hidden="true" tabindex="-1"></a><span class="co"># input tokens</span></span>
<span id="cb63-54"><a href="#cb63-54" aria-hidden="true" tabindex="-1"></a>input_tokens <span class="ot"><-</span> <span class="fu">length</span>(<span class="fu">unlist</span>(<span class="fu">strsplit</span>(prompt_frame<span class="sc">$</span>prompt_content_var,<span class="st">" "</span>)))</span>
<span id="cb63-55"><a href="#cb63-55" aria-hidden="true" tabindex="-1"></a><span class="co"># input costs</span></span>
<span id="cb63-56"><a href="#cb63-56" aria-hidden="true" tabindex="-1"></a>input_costs <span class="ot"><-</span> (input_tokens <span class="sc">/</span> <span class="dv">1000</span>) <span class="sc">*</span> <span class="fl">0.03</span>;input_costs <span class="co"># = price in dollar </span></span>
<span id="cb63-57"><a href="#cb63-57" aria-hidden="true" tabindex="-1"></a><span class="co"># output costs (max)</span></span>
<span id="cb63-58"><a href="#cb63-58" aria-hidden="true" tabindex="-1"></a>output_costs <span class="ot"><-</span> ((<span class="dv">50</span> <span class="sc">*</span> <span class="fu">length</span>(prompt_frame<span class="sc">$</span>prompt_content_var)) <span class="sc">/</span> <span class="dv">1000</span>) <span class="sc">*</span> <span class="fl">0.06</span>;output_costs</span>
<span id="cb63-59"><a href="#cb63-59" aria-hidden="true" tabindex="-1"></a><span class="co"># Total costs (in DOLLARS)</span></span>
<span id="cb63-60"><a href="#cb63-60" aria-hidden="true" tabindex="-1"></a>total_costs <span class="ot"><-</span> input_costs <span class="sc">+</span> output_costs;total_costs</span>
<span id="cb63-61"><a href="#cb63-61" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-62"><a href="#cb63-62" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-63"><a href="#cb63-63" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-64"><a href="#cb63-64" aria-hidden="true" tabindex="-1"></a><span class="do">### We do this with traCatch to not break everything when sth goes wrong </span><span class="al">###</span></span>
<span id="cb63-65"><a href="#cb63-65" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-66"><a href="#cb63-66" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-67"><a href="#cb63-67" aria-hidden="true" tabindex="-1"></a><span class="co"># sending the model off to compute</span></span>
<span id="cb63-68"><a href="#cb63-68" aria-hidden="true" tabindex="-1"></a>gpt_age_gender <span class="ot"><-</span> <span class="fu">chatgpt</span>(<span class="at">prompt_role_var =</span> prompt_frame<span class="sc">$</span>prompt_role_var,</span>
<span id="cb63-69"><a href="#cb63-69" aria-hidden="true" tabindex="-1"></a> <span class="at">prompt_content_var =</span> prompt_frame<span class="sc">$</span>prompt_content_var,</span>
<span id="cb63-70"><a href="#cb63-70" aria-hidden="true" tabindex="-1"></a> <span class="at">id_var =</span> prompt_frame<span class="sc">$</span>id_var,</span>
<span id="cb63-71"><a href="#cb63-71" aria-hidden="true" tabindex="-1"></a> <span class="at">param_max_tokens =</span> <span class="dv">50</span>,</span>
<span id="cb63-72"><a href="#cb63-72" aria-hidden="true" tabindex="-1"></a> <span class="at">param_temperature =</span> <span class="dv">0</span>,</span>
<span id="cb63-73"><a href="#cb63-73" aria-hidden="true" tabindex="-1"></a> <span class="at">param_n =</span> <span class="dv">1</span>,</span>
<span id="cb63-74"><a href="#cb63-74" aria-hidden="true" tabindex="-1"></a> <span class="at">param_model =</span> <span class="st">"gpt-3.5-turbo"</span>)</span>
<span id="cb63-75"><a href="#cb63-75" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-76"><a href="#cb63-76" aria-hidden="true" tabindex="-1"></a><span class="co"># saving results</span></span>
<span id="cb63-77"><a href="#cb63-77" aria-hidden="true" tabindex="-1"></a><span class="fu">saveRDS</span>(gpt_age_gender,<span class="at">file =</span> <span class="st">"GPT3_Age_Gender_annotations.rds"</span>,<span class="at">ver=</span><span class="dv">2</span>)</span>
<span id="cb63-78"><a href="#cb63-78" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb63-79"><a href="#cb63-79" aria-hidden="true" tabindex="-1"></a><span class="co"># checking results</span></span>
<span id="cb63-80"><a href="#cb63-80" aria-hidden="true" tabindex="-1"></a>gpt_age_gender[[<span class="dv">1</span>]]<span class="sc">$</span>chatgpt_content</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
<section id="data-wrangling-formatting-and-subsetting" class="level2">
<h2 class="anchored" data-anchor-id="data-wrangling-formatting-and-subsetting">Data Wrangling: Formatting and Subsetting</h2>
<p>To further clean our data, we first make sure that all our variables come in the right format. Second, we further clean the data by removing (a) missing texts in the posts (because, of course, we cannot use these posts to examine its text, if there is none) and (b) posts from minors or posts where the writer indicated an unrealistic age (above 100). Lastly, we save our cleaned data set.</p>
</section>
<section id="preprocessing-text-data" class="level2">
<h2 class="anchored" data-anchor-id="preprocessing-text-data">Preprocessing Text Data</h2>
<div class="cell">
<div class="sourceCode cell-code" id="cb64"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb64-1"><a href="#cb64-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(quanteda)</span>
<span id="cb64-2"><a href="#cb64-2" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(quanteda.textmodels)</span>
<span id="cb64-3"><a href="#cb64-3" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(quanteda.textplots)</span>
<span id="cb64-4"><a href="#cb64-4" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(quanteda.textstats)</span>
<span id="cb64-5"><a href="#cb64-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-6"><a href="#cb64-6" aria-hidden="true" tabindex="-1"></a><span class="co"># loading data</span></span>
<span id="cb64-7"><a href="#cb64-7" aria-hidden="true" tabindex="-1"></a><span class="fu">load</span>(<span class="st">'data/RedditData_Cleaned.Rdata'</span>)</span>
<span id="cb64-8"><a href="#cb64-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-9"><a href="#cb64-9" aria-hidden="true" tabindex="-1"></a><span class="co"># Add title to selftext</span></span>
<span id="cb64-10"><a href="#cb64-10" aria-hidden="true" tabindex="-1"></a>data <span class="ot"><-</span> data <span class="sc">%>%</span></span>
<span id="cb64-11"><a href="#cb64-11" aria-hidden="true" tabindex="-1"></a> <span class="fu">mutate</span>(<span class="at">text =</span> <span class="fu">paste</span>(title, selftext))</span>
<span id="cb64-12"><a href="#cb64-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-13"><a href="#cb64-13" aria-hidden="true" tabindex="-1"></a><span class="co"># Remove punctuation</span></span>
<span id="cb64-14"><a href="#cb64-14" aria-hidden="true" tabindex="-1"></a>data<span class="sc">$</span>text <span class="ot"><-</span> <span class="fu">str_replace_all</span>(data<span class="sc">$</span>text, <span class="st">"(?<=(</span><span class="sc">\\</span><span class="st">[|</span><span class="sc">\\</span><span class="st">())(.*?)(?=(</span><span class="sc">\\</span><span class="st">]|</span><span class="sc">\\</span><span class="st">)))"</span>, <span class="st">""</span>)</span>
<span id="cb64-15"><a href="#cb64-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-16"><a href="#cb64-16" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate corpus</span></span>
<span id="cb64-17"><a href="#cb64-17" aria-hidden="true" tabindex="-1"></a>reddit_corpus <span class="ot"><-</span> <span class="fu">corpus</span>(data,</span>
<span id="cb64-18"><a href="#cb64-18" aria-hidden="true" tabindex="-1"></a> <span class="at">text_field =</span> <span class="st">"text"</span>,</span>
<span id="cb64-19"><a href="#cb64-19" aria-hidden="true" tabindex="-1"></a> <span class="at">docid_field =</span> <span class="st">"id"</span>)</span>
<span id="cb64-20"><a href="#cb64-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-21"><a href="#cb64-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-22"><a href="#cb64-22" aria-hidden="true" tabindex="-1"></a><span class="fu">docvars</span>(reddit_corpus)<span class="sc">$</span>text <span class="ot"><-</span> data<span class="sc">$</span>text</span>
<span id="cb64-23"><a href="#cb64-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-24"><a href="#cb64-24" aria-hidden="true" tabindex="-1"></a><span class="co"># From corpus to tokens</span></span>
<span id="cb64-25"><a href="#cb64-25" aria-hidden="true" tabindex="-1"></a>reddit_tokens <span class="ot"><-</span> <span class="fu">tokens</span>(reddit_corpus,</span>
<span id="cb64-26"><a href="#cb64-26" aria-hidden="true" tabindex="-1"></a> <span class="at">remove_url =</span> <span class="cn">TRUE</span>, <span class="co"># remove URLs</span></span>
<span id="cb64-27"><a href="#cb64-27" aria-hidden="true" tabindex="-1"></a> <span class="at">remove_numbers =</span> <span class="cn">TRUE</span>, <span class="co"># remove numbers</span></span>
<span id="cb64-28"><a href="#cb64-28" aria-hidden="true" tabindex="-1"></a> <span class="at">remove_punct =</span> <span class="cn">TRUE</span>, <span class="co"># remove punctuation</span></span>
<span id="cb64-29"><a href="#cb64-29" aria-hidden="true" tabindex="-1"></a> <span class="at">remove_separators =</span> <span class="cn">TRUE</span>,</span>
<span id="cb64-30"><a href="#cb64-30" aria-hidden="true" tabindex="-1"></a> <span class="co"># remove empty spaces/line breaks</span></span>
<span id="cb64-31"><a href="#cb64-31" aria-hidden="true" tabindex="-1"></a> <span class="at">remove_symbols =</span> <span class="cn">TRUE</span>)</span>
<span id="cb64-32"><a href="#cb64-32" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-33"><a href="#cb64-33" aria-hidden="true" tabindex="-1"></a><span class="co"># Further pre-processing of tokens:</span></span>
<span id="cb64-34"><a href="#cb64-34" aria-hidden="true" tabindex="-1"></a>own_stopwords <span class="ot"><-</span> <span class="fu">c</span>(<span class="st">"i’m"</span>, <span class="st">"it’s"</span>, <span class="st">"i’ve"</span>,</span>
<span id="cb64-35"><a href="#cb64-35" aria-hidden="true" tabindex="-1"></a> <span class="st">"he’s"</span>, <span class="st">"she’s"</span>,</span>
<span id="cb64-36"><a href="#cb64-36" aria-hidden="true" tabindex="-1"></a> <span class="st">"doesn’t"</span>, <span class="st">"don’t"</span>, <span class="st">"didn’t"</span>,</span>
<span id="cb64-37"><a href="#cb64-37" aria-hidden="true" tabindex="-1"></a> <span class="st">"they’ve"</span>, <span class="st">"they’re"</span>, <span class="st">"tl"</span>,</span>
<span id="cb64-38"><a href="#cb64-38" aria-hidden="true" tabindex="-1"></a> <span class="st">"dr"</span>, <span class="st">"tldr"</span>, <span class="st">"ldr"</span>, <span class="st">"tld"</span>,</span>
<span id="cb64-39"><a href="#cb64-39" aria-hidden="true" tabindex="-1"></a> <span class="st">"can’t"</span>, <span class="st">"cant"</span>, <span class="st">"can"</span>,</span>
<span id="cb64-40"><a href="#cb64-40" aria-hidden="true" tabindex="-1"></a> <span class="st">"cannot"</span>, <span class="st">"wont"</span>, <span class="st">"im"</span>,</span>
<span id="cb64-41"><a href="#cb64-41" aria-hidden="true" tabindex="-1"></a> <span class="st">"its"</span>, <span class="st">"hes"</span>, <span class="st">"shes"</span>, <span class="st">"theyve"</span>,</span>
<span id="cb64-42"><a href="#cb64-42" aria-hidden="true" tabindex="-1"></a> <span class="st">"theyre"</span>, <span class="st">"hed"</span>, <span class="st">"shed"</span>, <span class="st">"_+"</span>,</span>
<span id="cb64-43"><a href="#cb64-43" aria-hidden="true" tabindex="-1"></a> <span class="st">"arent"</span>, <span class="st">"post"</span>,</span>
<span id="cb64-44"><a href="#cb64-44" aria-hidden="true" tabindex="-1"></a> <span class="st">"crosspost"</span>, <span class="st">"update"</span>, <span class="st">"edit"</span>,</span>
<span id="cb64-45"><a href="#cb64-45" aria-hidden="true" tabindex="-1"></a> <span class="st">"hi"</span>, <span class="st">"hello"</span>, <span class="st">"everyone"</span>,</span>
<span id="cb64-46"><a href="#cb64-46" aria-hidden="true" tabindex="-1"></a> <span class="st">"we’ve"</span>, <span class="st">"he’d"</span>, <span class="st">"#x200b"</span>,</span>
<span id="cb64-47"><a href="#cb64-47" aria-hidden="true" tabindex="-1"></a> <span class="st">"aren’t"</span>, <span class="st">"he’ll"</span>, <span class="st">"i’d"</span>,</span>
<span id="cb64-48"><a href="#cb64-48" aria-hidden="true" tabindex="-1"></a> <span class="st">"really"</span>, <span class="st">"ive"</span>, <span class="st">"dont"</span>,</span>
<span id="cb64-49"><a href="#cb64-49" aria-hidden="true" tabindex="-1"></a> <span class="st">"doesnt"</span>, <span class="st">"didnt"</span>)</span>
<span id="cb64-50"><a href="#cb64-50" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-51"><a href="#cb64-51" aria-hidden="true" tabindex="-1"></a><span class="co"># Apply among other:</span></span>
<span id="cb64-52"><a href="#cb64-52" aria-hidden="true" tabindex="-1"></a>reddit_tokens <span class="ot"><-</span> reddit_tokens <span class="sc">%>%</span> </span>
<span id="cb64-53"><a href="#cb64-53" aria-hidden="true" tabindex="-1"></a> <span class="fu">tokens_tolower</span>() <span class="sc">%>%</span> </span>
<span id="cb64-54"><a href="#cb64-54" aria-hidden="true" tabindex="-1"></a> <span class="fu">tokens_remove</span>(<span class="fu">c</span>(<span class="fu">stopwords</span>(<span class="st">'english'</span>), </span>
<span id="cb64-55"><a href="#cb64-55" aria-hidden="true" tabindex="-1"></a> <span class="at">padding =</span> <span class="cn">FALSE</span>)) <span class="sc">%>%</span></span>
<span id="cb64-56"><a href="#cb64-56" aria-hidden="true" tabindex="-1"></a> <span class="fu">tokens_remove</span>(own_stopwords)</span>
<span id="cb64-57"><a href="#cb64-57" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-58"><a href="#cb64-58" aria-hidden="true" tabindex="-1"></a><span class="co"># Inspect results:</span></span>
<span id="cb64-59"><a href="#cb64-59" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(reddit_tokens[[<span class="dv">1</span>]])</span>
<span id="cb64-60"><a href="#cb64-60" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-61"><a href="#cb64-61" aria-hidden="true" tabindex="-1"></a><span class="co"># Detect collocations</span></span>
<span id="cb64-62"><a href="#cb64-62" aria-hidden="true" tabindex="-1"></a>colls <span class="ot"><-</span> <span class="fu">textstat_collocations</span>(reddit_tokens,</span>
<span id="cb64-63"><a href="#cb64-63" aria-hidden="true" tabindex="-1"></a> <span class="at">min_count =</span> <span class="dv">400</span>)</span>
<span id="cb64-64"><a href="#cb64-64" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-65"><a href="#cb64-65" aria-hidden="true" tabindex="-1"></a><span class="co"># look at terms</span></span>
<span id="cb64-66"><a href="#cb64-66" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(colls[<span class="fu">order</span>(<span class="fu">desc</span>(colls<span class="sc">$</span>lambda)),])</span>
<span id="cb64-67"><a href="#cb64-67" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-68"><a href="#cb64-68" aria-hidden="true" tabindex="-1"></a><span class="co"># only consider collocations with a minimum lambda of 4</span></span>
<span id="cb64-69"><a href="#cb64-69" aria-hidden="true" tabindex="-1"></a>colls <span class="ot"><-</span> colls[colls<span class="sc">$</span>lambda <span class="sc">>=</span> <span class="dv">4</span>, ]</span>
<span id="cb64-70"><a href="#cb64-70" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-71"><a href="#cb64-71" aria-hidden="true" tabindex="-1"></a><span class="co"># Some last refinements after further inspection of the collocations:</span></span>
<span id="cb64-72"><a href="#cb64-72" aria-hidden="true" tabindex="-1"></a>reddit_tokens_c <span class="ot"><-</span> <span class="fu">tokens_compound</span>(reddit_tokens, colls) <span class="sc">%>%</span></span>
<span id="cb64-73"><a href="#cb64-73" aria-hidden="true" tabindex="-1"></a> <span class="fu">tokens_remove</span>(<span class="st">''</span>) <span class="co"># remove empty strings</span></span>
<span id="cb64-74"><a href="#cb64-74" aria-hidden="true" tabindex="-1"></a>reddit_tokens_c <span class="ot"><-</span> reddit_tokens_c <span class="sc">%>%</span></span>
<span id="cb64-75"><a href="#cb64-75" aria-hidden="true" tabindex="-1"></a> <span class="fu">tokens_remove</span>(<span class="fu">c</span>(<span class="st">"title_says"</span>, <span class="st">"long_story"</span>, <span class="st">"story_short"</span>))</span>
<span id="cb64-76"><a href="#cb64-76" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-77"><a href="#cb64-77" aria-hidden="true" tabindex="-1"></a><span class="co"># From tokens to dfm</span></span>
<span id="cb64-78"><a href="#cb64-78" aria-hidden="true" tabindex="-1"></a>reddit_dfm <span class="ot"><-</span> <span class="fu">dfm</span>(reddit_tokens_c)</span>
<span id="cb64-79"><a href="#cb64-79" aria-hidden="true" tabindex="-1"></a><span class="fu">head</span>(reddit_dfm)</span>
<span id="cb64-80"><a href="#cb64-80" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-81"><a href="#cb64-81" aria-hidden="true" tabindex="-1"></a><span class="co"># Only keep terms that appear in at least 10 documents & are at least 2</span></span>
<span id="cb64-82"><a href="#cb64-82" aria-hidden="true" tabindex="-1"></a><span class="co"># characters long</span></span>
<span id="cb64-83"><a href="#cb64-83" aria-hidden="true" tabindex="-1"></a>reddit_dfm <span class="ot"><-</span> reddit_dfm <span class="sc">%>%</span></span>
<span id="cb64-84"><a href="#cb64-84" aria-hidden="true" tabindex="-1"></a> <span class="fu">dfm_trim</span>(<span class="at">min_docfreq =</span> <span class="dv">10</span>) <span class="sc">%>%</span></span>
<span id="cb64-85"><a href="#cb64-85" aria-hidden="true" tabindex="-1"></a> <span class="fu">dfm_keep</span>(<span class="at">min_nchar =</span> <span class="dv">2</span>)</span>
<span id="cb64-86"><a href="#cb64-86" aria-hidden="true" tabindex="-1"></a><span class="co">#View(reddit_dfm)</span></span>
<span id="cb64-87"><a href="#cb64-87" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-88"><a href="#cb64-88" aria-hidden="true" tabindex="-1"></a><span class="co"># Save dfm</span></span>
<span id="cb64-89"><a href="#cb64-89" aria-hidden="true" tabindex="-1"></a><span class="co">#saveRDS(reddit_dfm, "reddit_dfm.rds")</span></span>
<span id="cb64-90"><a href="#cb64-90" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-91"><a href="#cb64-91" aria-hidden="true" tabindex="-1"></a><span class="do">## Create STM object</span></span>
<span id="cb64-92"><a href="#cb64-92" aria-hidden="true" tabindex="-1"></a>reddit_stm <span class="ot"><-</span> quanteda<span class="sc">::</span><span class="fu">convert</span>(reddit_dfm, <span class="at">to =</span> <span class="st">'stm'</span>)</span>
<span id="cb64-93"><a href="#cb64-93" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb64-94"><a href="#cb64-94" aria-hidden="true" tabindex="-1"></a><span class="co"># Save stm object</span></span>
<span id="cb64-95"><a href="#cb64-95" aria-hidden="true" tabindex="-1"></a><span class="co">#saveRDS(reddit_stm, "reddit_stm.rds")</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
</section>
<section id="analysis" class="level1">
<h1>Analysis</h1>
<section id="descriptive-statistics" class="level2">
<h2 class="anchored" data-anchor-id="descriptive-statistics">Descriptive Statistics</h2>
<section id="distribution-of-author-age-groups" class="level4">
<h4 class="anchored" data-anchor-id="distribution-of-author-age-groups">Distribution of Author Age Groups</h4>
<p>We now examine the distribution of various variables in our data set by age group. First, we look at how old the writers of our Reddit posts are overall.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-figures1-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Now, we look at the distribution of author age groups across subreddits</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-figures3-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Age groups over time:</p>
<p>Over the years:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-figures6-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Across weekdays:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-figures7-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Across hours of the day:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-figures8-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Across whole observation period:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-figures9-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Distribution of upvotes by age group and subreddit</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-figures10-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Distribution of number of comments by age group and subreddit</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-figures11-1.png" class="img-fluid" width="672"></p>
</div>
</div>
</section>
<section id="age-as-a-continuous-variable" class="level4">
<h4 class="anchored" data-anchor-id="age-as-a-continuous-variable">Age as a continuous variable</h4>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-cont1-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Distribution by subreddit</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-cont2-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Age distribution by subreddit and gender</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-cont4-1.png" class="img-fluid" width="672"></p>
</div>
</div>
</section>
<section id="age-distribution-by-subreddit-and-gender-for-non-cis-gender-authors" class="level4">
<h4 class="anchored" data-anchor-id="age-distribution-by-subreddit-and-gender-for-non-cis-gender-authors">Age distribution by subreddit and gender for non-cis gender authors</h4>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/age-cont6-1.png" class="img-fluid" width="672"></p>
</div>
</div>
</section>
<section id="distribution-of-author-gender" class="level4">
<h4 class="anchored" data-anchor-id="distribution-of-author-gender">Distribution of Author Gender</h4>
<p>In the next steps, we look at the distribution of various variables in our data set by gender (sometimes in combination with the subreddit in which a post was posted). We start by checking the gender of the writers of our posts:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/gender-figures-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Next, we have a look at the distribution of posts across subreddits:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="SICSS_Group_project_Reddit_files/figure-html/gender-figures2-1.png" class="img-fluid" width="672"></p>
</div>
</div>
<p>Now, gender over time:</p>
<div class="cell">
<div class="cell-output-display">