-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsc1015_mini_project.py
1096 lines (810 loc) · 45.4 KB
/
sc1015_mini_project.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# -*- coding: utf-8 -*-
"""SC1015_Mini_Project.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1Ui0s3H-CxEHLlRc0PvCCcfNcxYIcEwzE
# **Classification and Detection of Disaster Tweets**
---
Natural Disasters have caused an average of 60,000 deaths worldwide. When Natural Disaster strike, many that have witnessed it would often report it on social media in real time which can be done through twitter or facebook. Many often seek news from social media as it is much faster than traditional media. Since people would report it on social media, there is a need for fast response from the rescue operators to respond to the disaster. However, there is currently no system in place to alert the rescue operators about a disaster that is posted on social media.
The goal of this project is to identify tweets that are deemed as a `Disaster Tweet` through the use of Machine Learning
In order to achieve the goals set out, we will need to:
* Find a suitable dataset
* Clean the dataset
* Find a suitable model for training
* Implement the idea
## Prerequisite
---
### Import Libraries for project
---
Before we begin, we will import the following libraries:
> `numpy` - Array for data manipulation
> `pandas` - Data manipulation for source
> `seaborn`, `mathplotlib` - Data visualization library
> `wordcloud` - Wordcloud visualization
> `sklearn` - Used to split train and test data
> `tensorflow` - Machine Learning library
> `re` - Regular expression for cleaning data
> `string` - Finding out all punctuations for cleaning data
"""
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense, Dropout, LSTM, Bidirectional, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import CategoricalCrossentropy,BinaryCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy,BinaryAccuracy
import re
import string
import nltk
from nltk.tokenize import RegexpTokenizer
import collections
"""### Import Dataset from CSV File
---
Dataset by [Kaggle](https://www.kaggle.com/competitions/nlp-getting-started/data).
Since it is formatted into a CSV file, we can import the dataset into Python through `read_csv()` function.
`head()` function is used to verify that the dataset is successfully imported.
"""
fileURL = "https://raw.githubusercontent.com/woonyee28/mini-project/main/data/train.csv" #Assign link of dataset to variable
original_tweets = pd.read_csv(fileURL) #Import data into original_tweets
#original_tweets = pd.read_csv("/content/Dataset/mini_project_train.csv") #Import data from CSV File
original_tweets.head() #Verify that data is imported
"""### Count number of dataset for each category
---
The dataset `original_tweets` has a column named `target` that stores values `0`, and `1`.
The value represents whether data in the row is classified as Disaster Text, where `1` represents **Disaster Text**, and `0` represents **Non-Disaster Text**.
We will change the column name to `isDisaster` to make it clearer, then count the number of dataset in each category of `isDisaster`
"""
original_tweets.rename(columns={'target': 'isDisaster'},inplace=True) #Rename the column 'label' to 'isDisaster'
original_tweets.groupby('isDisaster').count() #Group based on the category, and count the number of entries for each category
"""Number of tweets classified as **Disaster Speech**: **3,271**
Number of tweets classified as **Non-Disaster Speech**: **4,342**
The dataset has a disproportionate number of Disaster Speech to Non-Disaster Speech. We will need to balance the dataset. We will be doing that in the **Imbalanced Dataset** section
## Data Cleaning
---
The data we chose contains special characters which would not work with tensorflow. We will need to clean up all unnecessary characters before continuing.
### Data Cleaning Functions
---
To use the data, we will need to clean up any unnecessary characters that may cause issues:
> `remove_user()` - Removes `@user` found in tweets. Removing the `@user` would increase the accuracy of the result. Else, it will show up in the wordcould for both Disaster Text and Non-Disaster Text.
> `remove_URL()` - Removes URLs found in tweets.
> `remove_HTML()` - Removes HTML tags if any.
> `replace_HTML_reserve()` - Replaces HTML reserve characters such as `<`, `>`, `&` into its original form `<`, `>`, `&`.
> `remove_emoji()` - Removes any emojis in tweets.
> `decontraction()` - Replaces contractions like `let's` into its orignal form `let us`.
> `remove_non_alphanumspace()` - Removes weird charactesr such as `Â` and `ð`, any punctuations, and special characters.
> `seperate_alphanumeric()` - Separates words like `gr8` into `gr 8` for data processing later on.
> `cont_rep_char()` - Replaces 3 or more repeated character without any space to 2 repeated character. e.g., `eee` to `ee`
> `unique_char()` - Find repeated characters and pass it to `cont_rep_char()` function, then replaces the said character.
> `remove_all()` - Takes in parameters `(dataset, column)` and execute the above functions on the given column in the dataset.
"""
def remove_user(text):
user = re.compile(r'@user')
return user.sub(r'', text)
def remove_URL(text):
url = re.compile(r'https?://\S+|www\.\S+')
return url.sub(r'', text)
def remove_HTML(text):
html=re.compile(r'<.*?>')
return html.sub(r'', text)
def replace_HTML_reserve(text):
text = re.sub(r"&", "&", text)
text = re.sub(r"<", "<", text)
text = re.sub(r">", ">", text)
text = re.sub(r"≤", "<=", text)
text = re.sub(r"≥", ">=", text)
return text
def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
def decontraction(text):
text = re.sub(r"won\'t", " will not", text)
text = re.sub(r"won\'t've", " will not have", text)
text = re.sub(r"can\'t", " can not", text)
text = re.sub(r"don\'t", " do not", text)
text = re.sub(r"can\'t've", " can not have", text)
text = re.sub(r"ma\'am", " madam", text)
text = re.sub(r"let\'s", " let us", text)
text = re.sub(r"ain\'t", " am not", text)
text = re.sub(r"shan\'t", " shall not", text)
text = re.sub(r"sha\n't", " shall not", text)
text = re.sub(r"o\'clock", " of the clock", text)
text = re.sub(r"y\'all", " you all", text)
text = re.sub(r"n\'t", " not", text)
text = re.sub(r"n\'t've", " not have", text)
text = re.sub(r"(\S)\'re", r"\1 are", text)
text = re.sub(r"(\S)\'s", r"\1 is", text)
text = re.sub(r"(\S)\'d", r"\1 would", text)
text = re.sub(r"(\S)\'d've", r"\1 would have", text)
text = re.sub(r"(\S)\'ll", r"\1 will", text)
text = re.sub(r"(\S)\'ll've", r"\1 will have", text)
text = re.sub(r"(\S)\'t", r"\1 not", text)
text = re.sub(r"(\S)\'ve", r"\1 have", text)
text = re.sub(r"(\S)\'m", r"\1 am", text)
text = re.sub(r"(\S)\'re", r"\1 are", text)
return text
"""
def remove_punct(text):
table=str.maketrans('','',string.punctuation)
return text.translate(table)
"""
def seperate_alphanumeric(text):
words = text
words = re.findall(r"[^\W\d_]+|\d+", words)
return " ".join(words)
def cont_rep_char(text):
tchr = text.group(0)
if len(tchr) > 1:
return tchr[0:2]
def unique_char(rep, text):
substitute = re.sub(r'([A-Za-z])\1+', rep, text)
return substitute
def remove_non_alphanumspace(text):
url = re.compile(r'[^0-9a-zA-Z\s]+')
return url.sub(r'', text)
def remove_all(dataset, column):
dataset[column] = dataset[column].apply(lambda x : remove_user(x))
dataset[column] = dataset[column].apply(lambda x : remove_URL(x))
dataset[column] = dataset[column].apply(lambda x : remove_HTML(x))
dataset[column] = dataset[column].apply(lambda x : replace_HTML_reserve(x))
dataset[column] = dataset[column].apply(lambda x: remove_emoji(x))
dataset[column] = dataset[column].apply(lambda x : decontraction(x))
dataset[column] = dataset[column].apply(lambda x : remove_non_alphanumspace(x))
dataset[column] = dataset[column].apply(lambda x : seperate_alphanumeric(x))
dataset[column] = dataset[column].apply(lambda x : unique_char(cont_rep_char, x))
"""### Cleaning the data
---
Once we construct the different functions to clean our data, we can start cleaning our dataset and verify it.
"""
remove_all(original_tweets, 'text'); #removes all unnecessary characters from [dataset, column]
original_tweets.drop(['keyword', 'location'], axis=1, inplace=True) #Removes unnecessary rows from the dataset
original_tweets.head(10) #Verify that all unnecessary characters have been removed from the dataset
"""### Filtering Disaster Text vs Non-Disaster Text
---
We will filter out disaster text and non-disaster text by the value of column `isDisaster`. We will store them into `isDisaster_tweets` and `isNotDisaster_tweets` respectively.
"""
isDisaster_tweets = original_tweets[original_tweets.isDisaster == 1] #Filters out all disaster texts and stores it into isDisaster_tweets
isNotDisaster_tweets = original_tweets[original_tweets.isDisaster == 0] #Filters out all non-disaster text and stores it into isNotDisaster_tweets
"""#### Verify filter"""
isDisaster_tweets.head(10)
isNotDisaster_tweets.head(10)
"""## Imbalanced Dataset
---
Previously, we have identified that there is a disproportionate number of Disaster Text to Non-Disaster Text. We will be doing data balancing here.
### Finding percentage of Disaster Text and Non-Disaster Text
---
"""
plt.figure(figsize = (8,6))
cntPlot = sns.countplot(x = 'isDisaster', data = original_tweets)
plt.title('Before Downsampling', fontsize=20)
for p in cntPlot.patches:
percentage = '{:.2f}%'.format(100 * p.get_height()/float(len(isNotDisaster_tweets)+len(isDisaster_tweets)))
x = p.get_x() + p.get_width()
y = p.get_height()
cntPlot.annotate(percentage, (x, y))
plt.show()
print(isNotDisaster_tweets.shape, isDisaster_tweets.shape)
"""From the graph, we can see that the percentage is far from the ideal `50%`. The number of data in each category is also imbalanced (**4,342** vs **3,271**). We will need to downsample `isDisaster = 0` to produce a balanced training dataset.
### Downsample Data
---
Using `sample()` function, we can downsample `isDisaster = 0` to match the number of data in `isDisaster = 1`
NOTE: comment out this line if accuracy high for original data
"""
isNotDisaster_tweets = isNotDisaster_tweets.sample(n = len(isDisaster_tweets))
"""### Verify downsampling
---
"""
tweets_concat = pd.concat([isNotDisaster_tweets,isDisaster_tweets]).reset_index(drop=True)
plt.figure(figsize=(8,6))
plt.title('After Downsampling', fontsize=20)
cntPlot = sns.countplot(x=tweets_concat.isDisaster)
for p in cntPlot.patches:
percentage = '{:.2f}%'.format(100 * p.get_height()/float(len(isNotDisaster_tweets)+len(isDisaster_tweets)))
x = p.get_x() + (p.get_width())
y = p.get_height()
cntPlot.annotate(percentage, (x, y))
plt.show()
print(isNotDisaster_tweets.shape, isDisaster_tweets.shape)
"""Now we have successfully balanced out the percentage to `50%` and it has matching number of data (**3,271**). We can proceed to visualize the data.
## Data Visualization
---
### Preparation for Wordcloud
---
To prepare the data for the visualization, we will need to store tweets into a numpy array.
"""
isDisaster_tweets_numpy = " ".join(isDisaster_tweets.text.to_numpy().tolist())
isNotDisaster_tweets_numpy = " ".join(isNotDisaster_tweets.text.to_numpy().tolist())
"""### Wordcloud!
---
We can generate wordcoulds for both `isDisaster_tweets` and `isNotDisaster_tweets` to visualize the data. From the visualization, we can observe most common keywords that were used in tweets classified in their own category.
#### Wordcloud for **isDisaster_tweets**
---
"""
isDisaster_tweets_wordcloud = WordCloud(width =520, height =260, stopwords=STOPWORDS,max_font_size=50, background_color ="black", colormap='Blues').generate(isDisaster_tweets_numpy)
plt.figure(figsize=(16,10)) #Size of figure
plt.imshow(isDisaster_tweets_wordcloud, interpolation='bilinear') #Create an wordcloud image, bilinear to smooth edges
plt.axis('off') #Turn off axis
plt.show() #Displays the wordcloud
"""#### Wordcloud for **isNotDisaster_tweets**
---
"""
isNotDisaster_tweets_wordcloud = WordCloud(width =520, height =260, stopwords=STOPWORDS,max_font_size=50, background_color ="black", colormap='Blues').generate(isNotDisaster_tweets_numpy)
plt.figure(figsize=(16,10)) #Size of figure
plt.imshow(isNotDisaster_tweets_wordcloud, interpolation='bilinear') #Create an wordcloud image, bilinear to smooth edges
plt.axis('off') #Turn off axis
plt.show() #Displays the wordcloud
"""### Other Data Visualization
---
By using other Data Visualization tools (mainly Histograms), we are able to:
* Compare the `length` of tweeets for both categories
* `Count of the most common words` in the tweeets for both categories
"""
def get_wrd_count(text_lst):
"""Get Word Counters for EDA"""
all_wrds = []
tokenizer = RegexpTokenizer(r'\w+')
for txt in text_lst:
wrds = tokenizer.tokenize(txt)
all_wrds.extend(wrds)
wrd_counter = collections.Counter(all_wrds)
return wrd_counter
#Common words for each category
isDisaster_wrd = get_wrd_count(isDisaster_tweets['text'].tolist())
isNotDisaster_wrd = get_wrd_count(isNotDisaster_tweets['text'].tolist())
isDisaster_wrd_cnt_sorted_32 = isDisaster_wrd.most_common(n=32)
isDisaster_wrd_cnt_sorted_22 = isDisaster_wrd.most_common(n=22)
isDisaster_wrd_cnt_sorted = [item for item in isDisaster_wrd_cnt_sorted_32 if item not in isDisaster_wrd_cnt_sorted_22]
isNotDisaster_wrd_cnt_sorted_32 = isNotDisaster_wrd.most_common(n=32)
isNotDisaster_wrd_cnt_sorted_22 = isNotDisaster_wrd.most_common(n=22)
isNotDisaster_wrd_cnt_sorted = [item for item in isNotDisaster_wrd_cnt_sorted_32 if item not in isNotDisaster_wrd_cnt_sorted_22]
l0, h0, l1, h1 = [],[],[],[]
_ = [(l0.append(i[0]), h0.append(i[1])) for i in isDisaster_wrd_cnt_sorted]
_ = [(l1.append(i[0]), h1.append(i[1])) for i in isNotDisaster_wrd_cnt_sorted]
#print(l1, h1)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(13, 6));
sns.barplot(x=list(range(len(l0))), y=h0, ax=ax[0]);
ax[0].set_ylim(top=200);
ax[0].set_xticks(ticks = list(range(len(l0))));
ax[0].set_xticklabels(l0);
ax[0].set_xlabel('Words');
ax[0].set_ylabel('Count');
ax[0].set_title("Some common words for disaster tweets");
sns.barplot(x=list(range(len(l1))), y=h1, ax=ax[1]);
ax[1].set_ylim(top=200);
ax[1].set_xticks(ticks = list(range(len(l1))));
ax[1].set_xticklabels(l1);
ax[1].set_xlabel('Words');
ax[1].set_ylabel('Count');
ax[1].set_title("Some common words for non disaster tweets");
"""From this histogram, we are able to obtain the count of words that most commonly appear in both the categories. This is very similar to the wordcloud, but we can obtain a numerical value of the words in this histogram."""
isDisaster_tweets_len = isDisaster_tweets['text'].apply(lambda x:len(x) )
isNotDisaster_tweets_len = isNotDisaster_tweets['text'].apply(lambda x:len(x) )
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
sns.histplot(isDisaster_tweets_len, ax=ax[0]);
ax[0].set_title("Tweet Length distribution for Disaster Tweets");
sns.histplot(isNotDisaster_tweets_len, ax=ax[1]);
ax[1].set_title("Tweet Length distribution for Non Disaster Tweets");
plt.show();
"""From this histogram, we are able to identify that disaster tweets have a huge spike around 100 to 110. Non-disaster tweets have a smoother peak at around 90 - 115. From this graph, we can use it to estimate the average length of tweet and use that information for the tokenization part.
## Data Pre-processing
---
### Splitting dataset into train and test dataset
---
We will split the dataset into `80%` train and `20%` test
"""
tweets_train, tweets_test, isDisaster_train, isDisaster_test = train_test_split(tweets_concat['text'], tweets_concat['isDisaster'], test_size=0.2)
"""### Text Processing
---
In this part, we will be utilizing TensorFlow to prepare data for deep learning models. We will be doing 3 procedures:
> `Tokenization` - The process of turning text into numbers. This is required as deep learning models does not understand text. Produces a dictionary of words and numbers.
> `Sequencing` - The process of constructing text using numbers found in Tokenization. Produces sequence of numbers as a result.
> `Padding` - The process of adding 0's to sequences. This is required as it needs to have inputs of same size.
#### Preparation for Data Pre-processing
---
In this part, we will be setting up parameters for Tokenization, Sequencing, and Padding.
"""
tweets_concat['text_length'] = tweets_concat['text'].apply(len) #Create new column called text_length which stores the length of tweets
labels = tweets_concat.groupby('isDisaster').mean() #Gets the mean of text_length i.e., average length of tweet.
#Comparing and taking the higher average between isDisaster = 0 and isDisaster = 1
largest_mean_length = labels['text_length'][0] if labels['text_length'][0] > labels['text_length'][1] else labels['text_length'][1]
labels
"""The parameters used here are as follows:
> `max_len` - Set the maximum length of tweet that we will be training with. `len(tweets) < max_len` will be padded, while `len(tweets) > max_len` will be truncated. We can use the average length as a gauge. By default, it will use the longest length of tweet. However, processing time will increase when training the model and effort may not drastically increase the accuracy.
> `trunc_type` - `len(tweet) > max_len` will be truncated. Post means tweets will be truncated at the end of each sequence.
> `padding_type` - `len(tweet) < max_len` will be padded. Post means tweets will be padded after each sequence.
> `oov_token` - This will be used if there are no words matching the word list in the train dataset.
> `number_of_tokens` - Set the maximum number of popular unique words we will be keeping. Processing time increases as this value increase. We will need to balance out to find the most optimal value.
"""
max_len = 75 #round(largest_mean_length)
trunc_type = "post"
padding_type = "post"
oov_token = "<OOV>"
number_of_tokens = 1000
"""#### Tokenization
---
"""
tokenizer = Tokenizer(num_words = number_of_tokens, lower= 1, oov_token= oov_token) #Initializing Tokenizer
tokenizer.fit_on_texts(tweets_train) #Process of transforming words into numbers into a dictionary
word_index = tokenizer.word_index
"""##### Verify Tokenization
---
"""
print(word_index)
print("Total Unique Tokens in tweets_train: {}".format(len(word_index)))
"""#### Sequencing
---
In this part, we will use the dictionary obtained from `Tokenizer` to map tweets to their respective keys.
"""
training_sequences = tokenizer.texts_to_sequences(tweets_train)
testing_sequences = tokenizer.texts_to_sequences(tweets_test)
"""##### Verify Sequencing
---
"""
print("Training Data:", training_sequences)
print("Testing Data:", testing_sequences)
"""#### Padding
---
In this part, we will use the sequences obtained from `Sequencing` to pad/truncate tweets. This is to ensure all are equal length.
"""
training_padded = pad_sequences (training_sequences, maxlen = max_len, padding = padding_type, truncating = trunc_type )
testing_padded = pad_sequences(testing_sequences, maxlen = max_len,padding = padding_type, truncating = trunc_type)
"""##### Verify Padding
---
"""
print("Training Data pad/truncate:", training_padded)
print("Testing Data pad/truncate:", testing_padded)
print("Training Tensor\n"\
"---------------\n"\
"Rows: " + str(training_padded.shape[0]) + "\tColumns: " + str(training_padded.shape[1]) + "\n"\
"Percentage: " + str(round((training_padded.shape[0]/(training_padded.shape[0]+testing_padded.shape[0]))*100)) + "%\n")
print("Testing Tensor\n"\
"---------------\n"\
"Rows: " + str(testing_padded.shape[0]) + "\tColumns: " + str(testing_padded.shape[1]) + "\n"\
"Percentage: " + str(round((testing_padded.shape[0]/(training_padded.shape[0]+testing_padded.shape[0]))*100)) + "%\n")
"""## Checkpoint \#01
---
At this point, we have:
* Clean our data to remove any special characters and non-alphanumeric characters.
* Balanced out our data to have 50% `isDisaster = 1` and `isDisaster = 0`
* Visualise our data using various data visualization tools
* Pre-process our data to prepare for deep learning
* Dataset:
> `training_padded` - Tweets for training
> `testing_padded` - Tweets for testing
> `isDisaster_train` - Category for tweet, training data
> `isDisaster_test` - Category for tweet, testing data
## Dense Network
---
Code below explains how we implemented the dense model architecture
"""
vocab_size = 13000 #Defined earlier as number_of_tokens
embeding_dim = 16
drop_value = 0.2
n_dense = 24
"""Implementation of the dense model
* The `sequential` calls for keras sequential model in which layers are added in a sequence
* First layer: `Embedding layer` where it takes the integer-encoded vocabularly which was performed by tokenization function during the pre-processing of data which looks up the embedding vector for each word index
* `GlobalAveragePooling layer` then returns a fixed length output vector for each example by averaging over the sequence dimension which allows the model to handle input of variable length in the simplest way and we have converted layer to 1 dimension
* `Dense layer` with activation function `'relu'`
* `Dropouut layer` that prevents overfitting of data
* `Dense layer` with `sigmoid` activation function that outputs the probabilities between 0 and 1 to classify our output
"""
model = Sequential()
model.add(Embedding(vocab_size,embeding_dim,input_length=max_len))
model.add(GlobalAveragePooling1D())
model.add(Dense(24,activation='relu'))
model.add(Dropout(drop_value))
model.add(Dense(1,activation='sigmoid'))
model.summary()
"""Compile and train the model using the `Adam optimizer` which is an efficient stochastic gradient descent because it automatically tunes itself and gives good results in a wide range of problems and the `BinaryCrossentropy loss` which is for binary classification problems"""
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
"""### Fitting of Model
The training process will run for a fixed number of iterations through the dataset which is called `Epochs`.
`Epoch:` Number of times the learning algorithm will work through the entire training data set
`Callbacks: `callbacks is used to pass the early stopping parameter
`EarlyStopping` (monitor='val_loss',patience=2) is used to define that we want to monitor the validation loss and if the validation loss is not improved after two epochs, then the model training is stopped. It helps to avoid overfitting problem and indicates when to stop training before the learner begins over-fit
"""
num_epochs = 30
early_stop = EarlyStopping(monitor='val_loss',patience=3)
history = model.fit(training_padded, isDisaster_train, epochs=num_epochs, validation_data=(testing_padded, isDisaster_test),callbacks=[early_stop],verbose=2)
"""**Evaluate our Model**
After training our neural network using the dense model on the entire dataset, we can now evaluate the performance of the network on the same dataset.
This will provide us the idea of how well we have modeled the dataset (e.g. train accuracy).
By using the evaluate() function it will return a list with two values. The first value will be the loss of the model on the dataset while the second will be the accuracy of the model on the dataset.
"""
loss, accuracy = model.evaluate(testing_padded,isDisaster_test)
print('Accuracy:', round((accuracy*100),2),'%')
print('Loss:',round((loss*100),2),'%')
metrics = pd.DataFrame(history.history)
metrics.rename(columns = {'loss':'Training_Loss','accuracy': 'Training_Accuracy', 'val_loss': 'Validation_Loss', 'val_accuracy': 'Validation_Accuracy'},inplace=True)
def plot_graphs1(var1,var2,string):
metrics[[var1,var2]].plot()
plt.title('Training and Validation' + string)
plt.xlabel('Number of epochs')
plt.ylabel(string)
plt.legend([var1,var2])
plot_graphs1('Training_Loss','Validation_Loss','loss')
plot_graphs1('Training_Accuracy','Validation_Accuracy','accuracy')
"""## Dense Network (Overfitting Model)
This model as shown below is most likely to be overfitting. The model as shown as 2 densely connected layers of 64 elements.
In the beginning, the validation loss decreases. However, at approximately epochs 4, validation loss does not continue to decrease but instead, it increases rapidly. This shows that this is where it begins to overfit.
"""
vocab_size = 13000 #Defined earlier as number_of_tokens
embeding_dim = 16
drop_value = 0.2
n_dense = 24
overfitmodel = Sequential()
overfitmodel.add(Embedding(vocab_size,embeding_dim,input_length=max_len))
overfitmodel.add(GlobalAveragePooling1D())
overfitmodel.add(Dense(64,activation='relu'))
overfitmodel.add(Dense(64,activation='relu'))
overfitmodel.add(Dense(1,activation='sigmoid'))
overfitmodel.summary()
overfitmodel.compile(loss='binary_crossentropy',optimizer = 'adam', metrics=['accuracy'])
num_epochs = 30
history = overfitmodel.fit(training_padded, isDisaster_train, epochs=num_epochs, validation_data=(testing_padded, isDisaster_test),verbose=2)
loss, accuracy = overfitmodel.evaluate(testing_padded,isDisaster_test)
print('Accuracy:', round((accuracy*100),2),'%')
print('Loss:',round((loss*100),2),'%')
metrics = pd.DataFrame(history.history)
metrics.rename(columns = {'loss':'Training_Loss','accuracy': 'Training_Accuracy', 'val_loss': 'Validation_Loss', 'val_accuracy': 'Validation_Accuracy'},inplace=True)
def plot_graphs1(var1,var2,string):
metrics[[var1,var2]].plot()
plt.title('Training and Validation' + string)
plt.xlabel('Number of epochs')
plt.ylabel(string)
plt.legend([var1,var2])
plot_graphs1('Training_Loss','Validation_Loss','loss')
plot_graphs1('Training_Accuracy','Validation_Accuracy','accuracy')
"""From the training and validation accuracy graph, we can see that as the training accuracy continues to increase, the validation accuracy decreases slightly which increases the discrepancy between the training accuracy and validation accuracy. Therefore, showing that our data may have some extent of overfitting of data.
### Confusion matrices: Overfitted data
Confusion matrix will provide us with the details of the data's accuracy, precision, recall, F1 score and the false positive rate.
"""
# Visualise data
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np
from sklearn.metrics import confusion_matrix
# Confusion matrix for train and test set
#Train prediction
train_predicted = (overfitmodel.predict(training_padded)>=0.5).astype('int64')
#Test prediction
test_predicted = (overfitmodel.predict(testing_padded)>=0.5).astype('int64')
#Plotting of confusion matrix
f, axes = plt.subplots(1, 2, figsize=(12, 4))
f.suptitle('Training VS Test Performance',fontweight = 'bold', fontsize = 'x-large')
#Train confusion matrix
sb.heatmap(confusion_matrix(isDisaster_train, train_predicted),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(isDisaster_test, test_predicted),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
#Calculate metrix for train
CM_train= confusion_matrix(isDisaster_train,train_predicted)
TN = CM_train[0][0]
FN = CM_train[1][0]
TP = CM_train[1][1]
FP = CM_train[0][1]
Accuracy_train = (TP+TN)/(TP+FN+TN+FP)
Precision_train = TP/(TP+FP)
Recall_train = TP/(TP+FN)
FPR = FP/(TN+FP)
F1_score_train = 2*Precision_train*Recall_train/(Precision_train+Recall_train)
Train_stats = "\n\nAccuracy={:0.2f}\nPrecision={:0.2f}\nRecall={:0.2f}\nFalse positive rate={:0.2f}\nF1 Score={:0.2f}".format(
Accuracy_train, Precision_train, Recall_train,FPR,F1_score_train)
axes[0].set(xlabel = 'Predicted' + Train_stats,ylabel='Actual',title = 'Training set')
#Calculate metrix for test
CM_train= confusion_matrix(isDisaster_test,test_predicted)
TN = CM_train[0][0]
FN = CM_train[1][0]
TP = CM_train[1][1]
FP = CM_train[0][1]
Accuracy_test = (TP+TN)/(TP+FN+TN+FP)
Precision_test = TP/(TP+FP)
Recall_test = TP/(TP+FN)
FPR = FP/(TN+FP)
F1_score_test = 2*Precision_test*Recall_test/(Precision_test+Recall_test)
Test_stats = "\n\nAccuracy={:0.2f}\nPrecision={:0.2f}\nRecall={:0.2f}\nFalse positive rate={:0.2f}\nF1 Score={:0.2f}".format(
Accuracy_test, Precision_test, Recall_test, FPR,F1_score_test)
axes[1].set(xlabel = 'Predicted'+Test_stats,ylabel='Actual',title = 'Testing set')
"""## Dense Network (After handling overfit data)
---
To handle overfitting of data,
1. Reduce the network's capacity where we remove layers and reduce the number of elements in the hidden layer as seen.
2. Included a dropout layer in order to prevent overfitting of data
3. Include early stopping to stop the fitting of model after 2 continuous validation loss
"""
vocab_size = 13000 #Defined earlier as number_of_tokens
embeding_dim = 16
drop_value = 0.2
n_dense = 24
model = Sequential()
model.add(Embedding(vocab_size,embeding_dim,input_length=max_len))
model.add(GlobalAveragePooling1D())
model.add(Dense(16,activation='relu'))
model.add(Dropout(drop_value))
model.add(Dense(1,activation='sigmoid'))
model.summary()
model.compile(loss='binary_crossentropy',optimizer = 'adam', metrics=['accuracy'])
num_epochs = 30
early_stop = EarlyStopping(monitor='val_loss',patience=2)
history = model.fit(training_padded, isDisaster_train, epochs=num_epochs, validation_data=(testing_padded, isDisaster_test),callbacks=[early_stop],verbose=2)
loss, accuracy = model.evaluate(testing_padded,isDisaster_test)
print('Accuracy:', round((accuracy*100),2),'%')
print('Loss:',round((loss*100),2),'%')
metrics = pd.DataFrame(history.history)
metrics.rename(columns = {'loss':'Training_Loss','accuracy': 'Training_Accuracy', 'val_loss': 'Validation_Loss', 'val_accuracy': 'Validation_Accuracy'},inplace=True)
def plot_graphs1(var1,var2,string):
metrics[[var1,var2]].plot()
plt.title('Training and Validation' + string)
plt.xlabel('Number of epochs')
plt.ylabel(string)
plt.legend([var1,var2])
plot_graphs1('Training_Loss','Validation_Loss','loss')
plot_graphs1('Training_Accuracy','Validation_Accuracy','accuracy')
"""### Confusion matrices: After handling overfitted data
Confusion matrix will provide us with the details of the data's accuracy, precision, recall, F1 score and the false positive rate.
"""
# Visualise data
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np
from sklearn.metrics import confusion_matrix
# Confusion matrix for train and test set
#Train prediction
train_predicted = (model.predict(training_padded)>=0.5).astype('int64')
#Test prediction
test_predicted = (model.predict(testing_padded)>=0.5).astype('int64')
#Plotting of confusion matrix
f, axes = plt.subplots(1, 2, figsize=(12, 4))
f.suptitle('Training VS Test Performance',fontweight = 'bold', fontsize = 'x-large')
#Train confusion matrix
sb.heatmap(confusion_matrix(isDisaster_train, train_predicted),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(isDisaster_test, test_predicted),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
#Calculate metrix for train
CM_train= confusion_matrix(isDisaster_train,train_predicted)
TN = CM_train[0][0]
FN = CM_train[1][0]
TP = CM_train[1][1]
FP = CM_train[0][1]
Accuracy_train = (TP+TN)/(TP+FN+TN+FP)
Precision_train = TP/(TP+FP)
Recall_train = TP/(TP+FN)
FPR = FP/(TN+FP)
F1_score_train = 2*Precision_train*Recall_train/(Precision_train+Recall_train)
Train_stats = "\n\nAccuracy={:0.2f}\nPrecision={:0.2f}\nRecall={:0.2f}\nFalse positive rate={:0.2f}\nF1 Score={:0.2f}".format(
Accuracy_train, Precision_train, Recall_train,FPR,F1_score_train)
axes[0].set(xlabel = 'Predicted' + Train_stats,ylabel='Actual',title = 'Training set')
#Calculate metrix for test
CM_train= confusion_matrix(isDisaster_test,test_predicted)
TN = CM_train[0][0]
FN = CM_train[1][0]
TP = CM_train[1][1]
FP = CM_train[0][1]
Accuracy_test = (TP+TN)/(TP+FN+TN+FP)
Precision_test = TP/(TP+FP)
Recall_test = TP/(TP+FN)
FPR = FP/(TN+FP)
F1_score_test = 2*Precision_test*Recall_test/(Precision_test+Recall_test)
Test_stats = "\n\nAccuracy={:0.2f}\nPrecision={:0.2f}\nRecall={:0.2f}\nFalse positive rate={:0.2f}\nF1 Score={:0.2f}".format(
Accuracy_test, Precision_test, Recall_test, FPR,F1_score_test)
axes[1].set(xlabel = 'Predicted'+Test_stats,ylabel='Actual',title = 'Testing set')
"""After handling the overfitting of data,
* Accuracy have improved
* Smaller discrepancy between the training and validation set
## Comparision of results (Dense Network)
---
The bar plot helps to compare between the accuracy of the data that were overfitted and the accurcay of the data after we had handled the overfitting of data.
Clearly, after handling the overfitting of our data, it have help to increase the accuracy of our data.
"""
# Visualise data before and after we fix overfitting data in terms of accuracy
import matplotlib.pyplot as plt
import pandas as pd
import numpy as isDisaster_tweets_numpy
import seaborn as sns
overfitDense_loss,overfitDense_accuracy = overfitmodel.evaluate(testing_padded,isDisaster_test)
Dense_loss,Dense_accuracy = model.evaluate(testing_padded,isDisaster_test)
models = ['Overfit','Normal']
accuracy = [overfitDense_accuracy,Dense_accuracy]
df = pd.DataFrame({"Models":models,"Accuracy":accuracy})
df
plt.figure(figsize=(8,6))
sns.barplot(x='Models',y="Accuracy",data=df,order=df.sort_values('Accuracy').Models)
plt.xlabel("Models",size = 15)
plt.ylabel("Accuracy",size = 15)
plt.title("Accuracy of different models",size = 15)
plt.tight_layout()
"""## LSTM Model
---
Long short term memory networks are a special kind of RNN that are capable of learning long-term dependencies. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavour.
"""
n_lstm = 20
drop_lstm = 0.2
model1 = Sequential()
model1.add(Embedding(vocab_size, embeding_dim, input_length = max_len))
model1.add(LSTM(n_lstm, dropout=drop_lstm, return_sequences=True))
model1.add(Dense(1,activation='sigmoid'))
model1.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])
num_epochs = 30
early_stop = EarlyStopping(monitor='val_loss',patience=2)
history = model1.fit(training_padded,isDisaster_train,epochs=num_epochs,validation_data=(testing_padded,isDisaster_test),callbacks=[early_stop],verbose=2)
model1.evaluate(testing_padded,isDisaster_test)
metrics = pd.DataFrame(history.history)
metrics.rename(columns = {'loss': 'Training_Loss', 'accuracy': 'Training_Accuracy',
'val_loss': 'Validation_Loss', 'val_accuracy': 'Validation_Accuracy'}, inplace = True)
def plot_graphs1(var1, var2, string):
metrics[[var1, var2]].plot()
plt.title('LSTM Model: Training and Validation ' + string)
plt.xlabel ('Number of epochs')
plt.ylabel(string)
plt.legend([var1, var2])
plot_graphs1('Training_Loss', 'Validation_Loss', 'loss')
plot_graphs1('Training_Accuracy', 'Validation_Accuracy', 'accuracy')
"""## Bi-directional LSTM
---
Lastly, we carried out bi-directional long short term memory to train our model. Bidirectional long short term memory is the process of making any neural network have the sequence information in both direction. The input would flow in two directions which caused it to differ from the long-short term memory that we used before.
"""
model2 = Sequential()
model2.add(Embedding(vocab_size, embeding_dim, input_length=max_len))
model2.add(Bidirectional(LSTM(n_lstm, dropout=drop_lstm, return_sequences=True)))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])
num_epochs = 30
early_stop = EarlyStopping(monitor='val_loss', patience=2)
history = model2.fit(training_padded, isDisaster_train, epochs=num_epochs,
validation_data=(testing_padded, isDisaster_test),callbacks =[early_stop], verbose=2)
metrics = pd.DataFrame(history.history)
metrics.rename(columns = {'loss': 'Training_Loss', 'accuracy': 'Training_Accuracy',
'val_loss': 'Validation_Loss', 'val_accuracy': 'Validation_Accuracy'}, inplace = True)
def plot_graphs1(var1, var2, string):
metrics[[var1, var2]].plot()
plt.title('BiLSTM Model: Training and Validation ' + string)
plt.xlabel ('Number of epochs')
plt.ylabel(string)
plt.legend([var1, var2])
plot_graphs1('Training_Loss', 'Validation_Loss', 'loss')
plot_graphs1('Training_Accuracy', 'Validation_Accuracy', 'accuracy')
"""## Comparing three different models """
print(f"Dense architecture loss and accuracy: {model.evaluate(testing_padded,isDisaster_test)}")
print(f"LSTM architecture loss and accuracy: {model1.evaluate(testing_padded,isDisaster_test)}")
print(f"Bi-LSTM architecture loss and accuracy: {model2.evaluate(testing_padded,isDisaster_test)}")
"""We uses a bar plot to help with the visualisation of our results. Hence, by sorting the data to be shown in an ascending order in the bar plot, our results show that out of the three models, the dense model have provided us with the best accuracy result.
However, we also note that this three models may not provide us with a high accuracy of 90%, hence, in the future, we can try out other natural language process model such as the BERT model.
"""
# Visualisation to compare the different 3 models in terms of their accuracy
import matplotlib.pyplot as plt
import pandas as pd
import numpy as isDisaster_tweets_numpy
import seaborn as sns
Dense_loss,Dense_accuracy = model.evaluate(testing_padded,isDisaster_test)
LSTM_loss,LSTM_accuracy = model1.evaluate(testing_padded,isDisaster_test)
BiLSTM_loss,BiLSTM_accuracy = model2.evaluate(testing_padded,isDisaster_test)
models = ['Dense','LSTM','Bi-LSTM']
accuracy = [Dense_accuracy,LSTM_accuracy,BiLSTM_accuracy]
df = pd.DataFrame({"Models":models,"Accuracy":accuracy})
df
plt.figure(figsize=(8,6))
sns.barplot(x='Models',y="Accuracy",data=df,order=df.sort_values('Accuracy').Models)
plt.xlabel("Models",size = 15)
plt.ylabel("Accuracy",size = 15)
plt.title("Accuracy of different models",size = 15)
plt.tight_layout()
"""## Other methods: Decision Tree Classifier
---
Decision Tree is a supervised machine learning algorithm, that is similar to how humans make decisions.
Here, we used the decision tree to obtain results for classification. The intuiton behind it is that we train our data to predict their outputs.
Here, we create a decision tree of max_depth 4 and obtain our results as shown by the confusion metrics.
"""
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sb
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 4) # create the decision tree object
dectree.fit(training_padded, isDisaster_train) # train the decision tree model
# Predict
y_train_pred = dectree.predict(training_padded)
y_test_pred = dectree.predict(testing_padded)
# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(training_padded, isDisaster_train))
print()
# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(testing_padded,isDisaster_test))
print()
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(isDisaster_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(isDisaster_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
"""However, as seen from the classification accuracy, the accuracy of the test dataset is approximately 59%. Hence, suggessting that it might not be a good representation of our dataset.
## Other methods : Random Forest Classifier
"""
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, roc_curve, classification_report