-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathInter_html.Rmd
431 lines (269 loc) · 14.2 KB
/
Inter_html.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
---
title: "Interpol Criminal Data Interpretation"
author: "Sudeep Chinna Kandukuri"
date: "19 July 2018"
output:
html_document: default
word_document: default
pdf_document: default
---
Factly makes an earnest attempt in establishment of fervent insights pertaining to `Interpol Most Wanted Criminal Records.`
Interpol Criminal Records provide a comprehensive view of treacherous, perfidious and treasonous crimes committed by inhumane individuals. In continuance, these records furnishes information about generalised description for the identification of criminals like `hair,` `eyes,` `weight,` `height,` and `language;` `gender,` `nationality`, `wanted by country,` and `criminal charges imposed.`
The relevant R packages for performing Interpol Criminal Data Analysis are loaded.
```{r Package Chunk, echo=TRUE, message=FALSE, warning=FALSE, results='hide'}
#Laod the packages
library(readxl)
library(tm)
library(wordcloud)
library(stringr)
library(ggplot2)
library(SnowballC)
library(dplyr)
library(tidyverse)
library(qdap)
library(magrittr)
library(rJava)
library(textclean)
library(rmarkdown)
library(topicmodels)
library(tidytext)
library(xlsx)
library(plotly)
```
The `Interpol Criminal Data` is imported to the RStudio IDE. Only imperative elements / variables are subjected to consideration for the data interpretation.
In the viewpoint of data interpretation the generalised description could be a misguided element to assume in the occurance of the crime. Therefore, the relevant action for the removal of impertinent variables is done. Subsequently, considerable figure of criminal records are in `spanish` and `french` which are subjected to translation in `GOOGLE TRANSLATE` and mutated to the master data-set `criminal.`
```{r echo = FALSE}
#Load the data-set
criminal <- read_excel("criminal_data_clean_2.xlsx")
#Subset / Removal of Labels
criminal_1 <- subset(criminal, select = -c(weight, eyes, hair,
forename, present_family_name, charges))
```
## Interpol Criminal Data Variables.
a glimpse of interpol criminal data all variables `criminal` and imperative variables `criminal_1.`
```{r}
glimpse(criminal)
glimpse(criminal_1)
```
## Text Pre-Processing
The charge-sheet content pertaining to each criminal enrolled as most wanted is subjected to `text cleansing`. As a part of it, a black is inserted after every comma, period, semicolon and brackets.
```{r echo = TRUE}
#TEXT PARSING
#Preventing concatination of two words while removing punctuaion marks.
#Inserted space after every comma
crim <- add_comma_space(criminal_1$Translation)
#Inserted space after every period
crim_1 <- gsub("\\.", ". ", crim)
#Inserted space after every semicolon
crim_2 <- gsub("\\;", "; ", crim_1)
#Inserted space after every bracket
crim_3 <- gsub("\\)", ") ", crim_2)
```
## Data Transformation
In continuance to aforementioned above, the preprocessed data is transfigured to a `text corpus`. Further, the data is subjected to string normalisation as mentioned below:
a. the charge-sheet contents are converted to `lower-case aphlabets.`
b. Relevant actions are performed for the removal of `numbers,` `punctuations` and `stop-words in english.`
c. the white space between the charge-sheet terms is stripped.
```{r}
#Create a Vector Corpus
textcorpus <- VCorpus(VectorSource(crim_3))
#Data Transformation
textcorpus <- tm_map(textcorpus, content_transformer(tolower))
textcorpus <- tm_map(textcorpus, removeNumbers)
textcorpus <- tm_map(textcorpus, removePunctuation)
textcorpus <- tm_map(textcorpus, removeWords, stopwords("english"))
textcorpus <- tm_map(textcorpus, stripWhitespace)
```
## Document-Term Matrix
In favor to progess, the string normalised text corpus is utilised to a `document-term matrix ~ dtm`. Later, the dtm is transfigured into matix for operational convenience.
```{r}
#Create Document - Term Matrix
in_dtm <- DocumentTermMatrix(textcorpus)
in_dtm
#Convert VCorpus to Data Frame
in_dtm_df <- as.matrix(in_dtm)
```
## Term Frequency Inventory List
An inventory activity is intiated to identify top 50 recurring terms in all interpol criminal charge-sheets.
```{r, echo=FALSE}
#Create Term Frequency Data Frame
tf <- (sort(colSums(in_dtm_df), decreasing = TRUE))
tf_df <- data.frame(word = names(tf), Frequency = tf)
tf_df[1:20,]
```
## Term - Frequency Barplot
The graphical representation of recurring terms in the interpol criminal charge-sheets are mentioned below:
```{r, echo=FALSE}
plotly::ggplotly(ggplot(tf_df[1:20,], aes(reorder(word, -Frequency), Frequency)) +
geom_bar(stat = "identity", width = 0.5, fill = "tomato2") +
labs(title = "Interpol Criminal Charges - Term Frequency", x = "Term", y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)))
```
## Correlation
The verbiage pertaining to the crime nature is identified and utilised for the observation of correlated terms. The correlation coefficient varies from `high` to `low.`
```{r}
#Mapping Associated Words
sexual <- findAssocs(in_dtm, "sexual", 0.1)
sexual
murder <- findAssocs(in_dtm, "murder", 0.1)
murder
firearm <- findAssocs(in_dtm, "firearm", 0.1)
firearm
terrorist <- findAssocs(in_dtm, "terrorist", 0.1)
terrorist
conspiracy <- findAssocs(in_dtm, "conspiracy", 0.1)
conspiracy
money <- findAssocs(in_dtm, "money", 0.1)
money
fraud <- findAssocs(in_dtm, "fraud", 0.1)
fraud
```
## Post Text Process Analysis
The class of `textcorpus` is `PlainTextDocument.` The `PlainTextDocument` couldn't be utilised for data interpretation and graphical representations. Therefore, `textcorpus` i.e., cleansed charge-sheets are transfigured into `data.frame` and appended to the master data-set.
```{r}
#Conversion of PlainTextDocument to Character Data.Frame
text <- sapply(textcorpus, as.character)
text_df <- as.data.frame(text, stringsAsFactors = F)
#Appending Cleansed Charges to Original Data-set
crim_bind <- cbind(criminal, text_df)
```
## Criminal Nature Classifier Function
In continuance to correlation, the correlated term patterns are identiied pertaining to the criminal nature verbiage which are mention below:
`sexual abuse,` `murder,` `illegal_firearms,` `terrorism and disruptive activity,` `narcotic drugs and psychotropic substances,` `forgery and fraud,` `tax evasion and money laundering,` `robbery and dacoity,` `conspiracy and its consequential crimes,` `copyright infringement and piracy,` `human trafficking and adultry,` `illegal wild-life trade,` `unlawful circulation of precious stones and metals,` `illegal group crime activies,` `prison break,` `traffic offences,` and `deprivation of liberty.`
Criminal Nature Classifier Funtion detects the correlated terms patterns. As a result, new independent columns of criminal nature verbiage are appended to master data-set. the newly appended criminal nature columns only signifies whether the criminal has commited that certain crime based on the criminal charges imposed on certain individual criminal record.
```{r}
#Adding a New Column - Crime_Nature to Original Data-set
crim_bind <- crim_bind %>%
mutate(Sexual_Abuse = grepl("sex|sexual|sexual abuse|molestation|lust|incest|pornography|minor|rape|abduction|fornication|intercourse|child", text)) %>%
mutate(Murder = grepl("murder|homicide|genocide|feminicide|femicide|manslaughter|assassination|kill|killing|death", text)) %>%
mutate(Illegal_Firearms = grepl("firearm|armed|ammunition|weapon|arms", text)) %>%
mutate(Terrorism_and_Disruptive_Activity = grepl("terror|terrorist|terrorism|terrorists", text)) %>%
mutate(Narcotic_Drugs_and_Psychotropic_Substances = grepl("narcotic|drugs|drug|psychotropics|drug trafficking|methylenedioxymethamphetamine|grams|cocaine|smuggling|psychoactive|marijuana|ephedrine|heroin|kilograms|doses|trafficking drugs", text)) %>%
mutate(Forgery_and_Fraud = grepl("fake|falsification|falsifying|false|tampering|forge|forged|certification|counterfeit|forgery|fraud|breach trust", text)) %>%
mutate(Tax_Evasion_and_Money_Laundering = grepl("tax|evasion|tax evasion|money|loan|money laundering|laundering|embezzlement|embezzling|financial", text)) %>%
mutate(Robbery_and_Dacoity = grepl("robbery|burglary|stolen|theft|extortion|swindling|dacoity|misappropriation|swindle|stealing", text)) %>%
mutate(Conspiracy_and_Its_Consequential_Crimes = grepl("conspiracy|bribe|bribery|office|position|corruption|corrupt|cheat|cheating|breach trust|trust", text)) %>%
mutate(Copyright_Infringement_and_Piracy = grepl("copyright|infringement|piracy", text)) %>%
mutate(Human_Trafficking_and_Adultry = grepl("human|human trafficking|women|prostitution|prostitute|trafficking", text)) %>%
mutate(Illegal_Wildlife_Trade = grepl("wildlife|wild", text)) %>%
mutate(Unlawful_Circulation_of_Precious_Metals_and_Stones = grepl("metals|stones", text)) %>%
mutate(Illegal_Group_Crime_Activities = grepl("kidnapping|hooliganism|illegal agrupaciones|criminal|groups|criminal group|unlawful|harm|injure|injury|injuries|wounding|war crime|assault|violence|harassement|torture|violation", text)) %>%
mutate(Prison_Break = grepl("escape|escaping", text)) %>%
mutate(Traffic_Offences = grepl("traffic|vehicles|vehicle|transport", text)) %>%
mutate(Deprivation_of_Liberty = grepl("deprivation|freedom|liberty", text))
```
The above classifier function has a distorted view because of inline length. If you insist to see, kindly, ask.
## Master Data-Set
A glimpse of master data-set elements which are the subject for graphical representations and application development is mentioned below:
```{r}
# Create Age - Bins
crim_bind <- crim_bind %>%
mutate(age_bin = case_when(between(age, 15, 19) ~ "15 - 19",
between(age, 20, 30) ~ "20 - 30",
between(age, 31, 40) ~ "31 - 40",
between(age, 41, 50) ~ "41 - 50",
between(age, 51, 60) ~ "51 - 60",
between(age, 61, 70) ~ "61 - 70",
between(age, 71, 80) ~ "71 - 80",
between(age, 81, 89) ~ "81 - 89"))
crim_bind[, c(8,9,12,13,35)] <- lapply(crim_bind[, c(8,9,12,13,35)], as.factor)
glimpse(crim_bind)
```
## Data Visualisations
Below graphical representations are drawn by the removal of criminals whose age exceeds `90.` In consideration of assumption of criminal aged above `90` are `outliers` or may be `deceased.`
A subset of data is obtained by the below R code:
```{r}
crim_age <- filter(crim_bind, crim_bind$age < 90)
range(crim_age$age)
```
Therefore, the youngest and oldest criminals are aged 18 and 89 years old.
## Interpol Crime Frequency
```{r}
crim_col <- crim_bind %>%
subset(select = -c(criminal_id, age,text, Translation, wanted_by, sex, dob, height, language, nationality, place_of_birth, weight, eyes, hair, forename, present_family_name, charges, age_bin))
crime_count<- sort(colSums(crim_col, na.rm = T), decreasing = T)
crime_count_df <- data.frame(word = names(crime_count), Frequency = crime_count)
plotly::ggplotly(ggplot(crime_count_df, aes(reorder(word, Frequency), Frequency, fill = Frequency, text = word)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_gradient(low="yellow", high="red") +
labs(y = "Crime Count", x = "Nature of Crime"))
```
No. of criminals involved in more than one crime
```{r}
crim_bind <- crim_bind %>%
mutate(count = rowSums(crim_col, na.rm = T))
crim_group <- crim_bind %>%
select(criminal_id, age, nationality, sex, wanted_by, count)
plotly::ggplotly(ggplot(crim_bind, aes(factor(count)))+
geom_bar(fill = "tomato2") +
geom_text(stat='count', aes(label=..count..), vjust= -0.5) +
labs(x = "above figure illustrates the no. of crimes committed by all criminals", y= "crime count"))
```
## Top 10 countries
```{r}
nation_split <- crim_bind %>%
select(nationality, sex) %>%
group_by(nationality,sex) %>%
summarise(count = n()) %>%
filter(count >=40)
plotly::ggplotly(ggplot(nation_split, aes(reorder(nationality, count), count, fill = sex)) +
geom_bar(stat = "identity", color = "white", position = "stack") +
labs(x= "Nationality", y = "Criminal Count", title = "Country Vs. Criminals") +
coord_flip())
```
## Wanted_by Vs. Criminal Count
```{r}
wanted_count <- crim_bind %>%
select(wanted_by, sex) %>%
group_by(wanted_by,sex) %>%
summarise(count_1 = n()) %>%
filter(count_1 >= 40)
plotly::ggplotly(ggplot(wanted_count, aes(reorder(wanted_by, count_1), count_1, fill = sex)) +
geom_bar(stat = "identity", color = "white", position = "stack") +
labs(x= "Wanted by Country", y = "Criminal Count", title = "Wanted-by Vs. Criminals") +
coord_flip())
```
```{r}
crim_gather <- crim_bind %>%
gather(Nature, Status, -c(criminal_id, age,text, Translation, wanted_by, sex, dob, height, language, nationality, place_of_birth, weight, eyes, hair, forename, present_family_name, charges, age_bin, count))
crim_gather <- crim_gather[crim_gather$Status == TRUE,]
crim_gather[, 21] <- as.factor(crim_gather[, 21])
glimpse(crim_gather)
```
```{r}
plotly::ggplotly(ggplot(crim_bind, aes(sex)) +
geom_bar(fill = "tomato2") +
labs(x= "Gender", y = "Count", title = "Interpol Gender Metrics"))
```
```{r}
plotly::ggplotly(ggplot(crim_bind, aes(age_bin)) +
geom_bar(fill = "tomato2") +
labs(x= "Age", y = "Count", title = "Interpol Age Metrics"))
```
```{r}
crim_gather <- crim_gather[crim_gather$Status == TRUE,]
crim_gather[, 21] <- as.factor(crim_gather[, 21])
gen_nat <- crim_gather %>%
select(Nature, sex) %>%
group_by(Nature, sex) %>%
summarise(cnt = n())
gen_nat
plotly::ggplotly(ggplot(gen_nat, aes(Nature, cnt, fill = sex)) +
geom_bar(stat = "identity", color = "white", position = "dodge") +
labs(x= "Gender", y = "Count", title = "Interpol Gender Metrics") +
coord_flip())
```
```{r}
plotly::ggplotly(ggplot(crim_gather, aes(Nature, ..count..)) +
geom_bar(aes(fill = age_bin), position = "dodge") +
labs(x= "Crime", y = "Count", title = "Interpol Crime Vs. Age Metrics") +
coord_flip())
```
## Note
This interim report encapsulates the Interpol criminal data acquisition, importation to the RStudio IDE, text pre-processesing activity, data transformation, creation of document-term matrix, term-frequency inventory activity, identification of correlated terms and construction of `criminal nature classifier.
In due process, a clear viewpoint of the Interpol criminal records and nature of crimes committed are obtained.
## Upcoming Activies
* Graphical Representations.
* Application build up using R Shiny.
* Miscellaneous Optimisation Operations.