Inter_html.Rmd

---
title: "Interpol Criminal Data Interpretation"
author: "Sudeep Chinna Kandukuri"
date: "19 July 2018"
output:
  html_document: default
  word_document: default
  pdf_document: default
---


Factly makes an earnest attempt in establishment of fervent insights pertaining to `Interpol Most Wanted Criminal Records.` 


Interpol Criminal Records provide a comprehensive view of treacherous, perfidious and treasonous crimes committed by inhumane individuals. In continuance, these records furnishes information about generalised description for the identification of criminals like `hair,` `eyes,` `weight,` `height,` and `language;` `gender,` `nationality`, `wanted by country,` and `criminal charges imposed.` 


The relevant R packages for performing Interpol Criminal Data Analysis are loaded.


```{r Package Chunk, echo=TRUE, message=FALSE, warning=FALSE, results='hide'}
#Laod the packages
library(readxl)
library(tm)
library(wordcloud)
library(stringr)
library(ggplot2)
library(SnowballC)
library(dplyr)
library(tidyverse)
library(qdap)
library(magrittr)
library(rJava)
library(textclean)
library(rmarkdown)
library(topicmodels)
library(tidytext)
library(xlsx)
library(plotly)
```


The `Interpol Criminal Data` is imported to the RStudio IDE. Only imperative elements / variables are subjected to consideration for the data interpretation. 

In the viewpoint of data interpretation the generalised description could be a misguided element to assume in the occurance of the crime. Therefore, the relevant action for the removal of impertinent variables is done. Subsequently, considerable figure of criminal records are in `spanish` and `french` which are subjected to translation in `GOOGLE TRANSLATE` and mutated to the master data-set `criminal.` 


```{r echo = FALSE}
#Load the data-set
criminal <- read_excel("criminal_data_clean_2.xlsx")

#Subset / Removal of Labels
criminal_1 <- subset(criminal, select = -c(weight, eyes, hair,
                    forename, present_family_name, charges))
```


## Interpol Criminal Data Variables.


a glimpse of interpol criminal data all variables `criminal` and imperative variables `criminal_1.` 
```{r}
glimpse(criminal)

glimpse(criminal_1)
```


## Text Pre-Processing

The charge-sheet content pertaining to each criminal enrolled as most wanted is subjected to `text cleansing`. As a part of it, a black is inserted after every comma, period, semicolon and brackets.


```{r echo = TRUE}
#TEXT PARSING
#Preventing concatination of two words while removing punctuaion marks.

#Inserted space after every comma

crim <- add_comma_space(criminal_1$Translation)

#Inserted space after every period

crim_1 <- gsub("\\.", ". ", crim)

#Inserted space after every semicolon 

crim_2 <- gsub("\\;", "; ", crim_1)

#Inserted space after every bracket

crim_3 <- gsub("\\)", ") ", crim_2)

```

## Data Transformation

In continuance to aforementioned above, the preprocessed data is transfigured to a `text corpus`. Further, the data is subjected to string normalisation as mentioned below:

a. the charge-sheet contents are converted to `lower-case aphlabets.`

b. Relevant actions are performed for the removal of `numbers,` `punctuations` and `stop-words in english.`

c. the white space between the charge-sheet terms is stripped.


```{r}
#Create a Vector Corpus
textcorpus <- VCorpus(VectorSource(crim_3))

#Data Transformation
textcorpus <- tm_map(textcorpus, content_transformer(tolower))
textcorpus <- tm_map(textcorpus, removeNumbers)
textcorpus <- tm_map(textcorpus, removePunctuation) 
textcorpus <- tm_map(textcorpus, removeWords, stopwords("english"))
textcorpus <- tm_map(textcorpus, stripWhitespace)

```


## Document-Term Matrix

In favor to progess, the string normalised text corpus is utilised to a `document-term matrix ~ dtm`. Later, the dtm is transfigured into matix for operational convenience.

```{r}
#Create Document - Term Matrix
in_dtm <- DocumentTermMatrix(textcorpus)
in_dtm

#Convert VCorpus to Data Frame

in_dtm_df <- as.matrix(in_dtm)

```


## Term Frequency Inventory List

An inventory activity is intiated to identify top 50 recurring terms in all interpol criminal charge-sheets.

```{r, echo=FALSE}

#Create Term Frequency Data Frame

tf <- (sort(colSums(in_dtm_df), decreasing = TRUE))
tf_df <- data.frame(word = names(tf), Frequency = tf)
tf_df[1:20,]
```

## Term - Frequency Barplot

The graphical representation of recurring terms in the interpol criminal charge-sheets are mentioned below:

```{r, echo=FALSE}
plotly::ggplotly(ggplot(tf_df[1:20,], aes(reorder(word, -Frequency), Frequency)) + 
  geom_bar(stat = "identity", width = 0.5, fill = "tomato2") +
  labs(title = "Interpol Criminal Charges - Term Frequency", x = "Term", y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)))
  
```


## Correlation

The verbiage pertaining to the crime nature is identified and utilised for the observation of correlated terms. The correlation coefficient varies from `high` to `low.`


```{r}
#Mapping Associated Words

sexual <- findAssocs(in_dtm, "sexual", 0.1)
sexual
murder <- findAssocs(in_dtm, "murder", 0.1)
murder
firearm <- findAssocs(in_dtm, "firearm", 0.1)
firearm
terrorist <- findAssocs(in_dtm, "terrorist", 0.1)
terrorist
conspiracy <- findAssocs(in_dtm, "conspiracy", 0.1)
conspiracy
money <- findAssocs(in_dtm, "money", 0.1)
money
fraud <- findAssocs(in_dtm, "fraud", 0.1)
fraud

```


## Post Text Process Analysis

The class of `textcorpus` is `PlainTextDocument.` The `PlainTextDocument` couldn't be utilised for data interpretation and graphical representations. Therefore, `textcorpus` i.e., cleansed charge-sheets are transfigured into `data.frame` and appended to the master data-set.


```{r}
#Conversion of PlainTextDocument to Character Data.Frame
text <- sapply(textcorpus, as.character)
text_df <- as.data.frame(text, stringsAsFactors = F)

#Appending Cleansed Charges to Original Data-set
crim_bind <- cbind(criminal, text_df)

```


## Criminal Nature Classifier Function

In continuance to correlation, the correlated term patterns are identiied pertaining to the criminal nature verbiage which are mention below:

`sexual abuse,` `murder,` `illegal_firearms,` `terrorism and disruptive activity,` `narcotic drugs and psychotropic substances,` `forgery and fraud,` `tax evasion and money laundering,` `robbery and dacoity,` `conspiracy and its consequential crimes,` `copyright infringement and piracy,` `human trafficking and adultry,` `illegal wild-life trade,` `unlawful circulation of precious stones and metals,` `illegal group crime activies,` `prison break,` `traffic offences,` and `deprivation of liberty.`

Criminal Nature Classifier Funtion detects the correlated terms patterns. As a result, new independent columns of criminal nature verbiage are appended to master data-set. the newly appended criminal nature columns only signifies whether the criminal has commited that certain crime based on the criminal charges imposed on certain individual criminal record.


```{r}
#Adding a New Column - Crime_Nature to Original Data-set
crim_bind <- crim_bind %>%
  mutate(Sexual_Abuse = grepl("sex|sexual|sexual abuse|molestation|lust|incest|pornography|minor|rape|abduction|fornication|intercourse|child", text)) %>%
  mutate(Murder =  grepl("murder|homicide|genocide|feminicide|femicide|manslaughter|assassination|kill|killing|death", text)) %>%
  mutate(Illegal_Firearms = grepl("firearm|armed|ammunition|weapon|arms", text)) %>%
  mutate(Terrorism_and_Disruptive_Activity = grepl("terror|terrorist|terrorism|terrorists", text)) %>%
  mutate(Narcotic_Drugs_and_Psychotropic_Substances = grepl("narcotic|drugs|drug|psychotropics|drug trafficking|methylenedioxymethamphetamine|grams|cocaine|smuggling|psychoactive|marijuana|ephedrine|heroin|kilograms|doses|trafficking drugs", text)) %>%
  mutate(Forgery_and_Fraud =  grepl("fake|falsification|falsifying|false|tampering|forge|forged|certification|counterfeit|forgery|fraud|breach trust", text)) %>%
  mutate(Tax_Evasion_and_Money_Laundering = grepl("tax|evasion|tax evasion|money|loan|money laundering|laundering|embezzlement|embezzling|financial", text)) %>%
  mutate(Robbery_and_Dacoity = grepl("robbery|burglary|stolen|theft|extortion|swindling|dacoity|misappropriation|swindle|stealing", text)) %>% 
  mutate(Conspiracy_and_Its_Consequential_Crimes = grepl("conspiracy|bribe|bribery|office|position|corruption|corrupt|cheat|cheating|breach trust|trust", text)) %>% 
  mutate(Copyright_Infringement_and_Piracy = grepl("copyright|infringement|piracy", text)) %>%
  mutate(Human_Trafficking_and_Adultry = grepl("human|human trafficking|women|prostitution|prostitute|trafficking", text)) %>%
  mutate(Illegal_Wildlife_Trade = grepl("wildlife|wild", text)) %>%
  mutate(Unlawful_Circulation_of_Precious_Metals_and_Stones = grepl("metals|stones", text)) %>%
  mutate(Illegal_Group_Crime_Activities = grepl("kidnapping|hooliganism|illegal agrupaciones|criminal|groups|criminal group|unlawful|harm|injure|injury|injuries|wounding|war crime|assault|violence|harassement|torture|violation", text)) %>%
  mutate(Prison_Break = grepl("escape|escaping", text)) %>%
  mutate(Traffic_Offences = grepl("traffic|vehicles|vehicle|transport", text)) %>%
  mutate(Deprivation_of_Liberty = grepl("deprivation|freedom|liberty", text))

```


The above classifier function has a distorted view because of inline length. If you insist to see, kindly, ask.

## Master Data-Set 

A glimpse of master data-set elements which are the subject for graphical representations and application development is mentioned below:


```{r}

# Create Age - Bins

crim_bind <- crim_bind %>%
  mutate(age_bin = case_when(between(age, 15, 19) ~ "15 - 19",
    between(age, 20, 30) ~ "20 - 30",
    between(age, 31, 40) ~ "31 - 40",
    between(age, 41, 50) ~ "41 - 50",
    between(age, 51, 60) ~ "51 - 60",
    between(age, 61, 70) ~ "61 - 70",
    between(age, 71, 80) ~ "71 - 80",
    between(age, 81, 89) ~ "81 - 89"))

crim_bind[, c(8,9,12,13,35)] <- lapply(crim_bind[, c(8,9,12,13,35)], as.factor)


glimpse(crim_bind)
```


## Data Visualisations


Below graphical representations are drawn by the removal of criminals whose age exceeds `90.` In consideration of assumption of criminal aged above `90` are `outliers` or may be `deceased.`

A subset of data is obtained by the below R code:
```{r}
crim_age <- filter(crim_bind, crim_bind$age < 90)

range(crim_age$age)
```

Therefore, the youngest and oldest criminals are aged 18 and 89 years old.

## Interpol Crime Frequency

```{r}
crim_col <- crim_bind %>%
  subset(select = -c(criminal_id, age,text, Translation, wanted_by, sex, dob, height, language, nationality, place_of_birth, weight, eyes, hair, forename, present_family_name, charges, age_bin))
  
crime_count<- sort(colSums(crim_col, na.rm = T), decreasing = T)
crime_count_df <- data.frame(word = names(crime_count), Frequency = crime_count)

plotly::ggplotly(ggplot(crime_count_df, aes(reorder(word, Frequency), Frequency, fill = Frequency, text = word)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_gradient(low="yellow", high="red") +
  labs(y = "Crime Count", x = "Nature of Crime"))
  
  
```

No. of criminals involved in more than one crime

```{r}

crim_bind <- crim_bind %>%
mutate(count = rowSums(crim_col, na.rm = T))

crim_group <- crim_bind %>%
select(criminal_id, age, nationality, sex, wanted_by, count)

 plotly::ggplotly(ggplot(crim_bind, aes(factor(count)))+
  geom_bar(fill = "tomato2") +
  geom_text(stat='count', aes(label=..count..), vjust= -0.5) +
  labs(x = "above figure illustrates the no. of crimes committed by all criminals", y= "crime count"))
 

```

## Top 10 countries 

```{r}
nation_split <- crim_bind %>%
  select(nationality, sex) %>%
  group_by(nationality,sex) %>%
  summarise(count = n()) %>%
  filter(count >=40)

plotly::ggplotly(ggplot(nation_split, aes(reorder(nationality, count), count, fill = sex)) +
  geom_bar(stat = "identity", color = "white", position = "stack") +
  labs(x= "Nationality", y = "Criminal Count", title = "Country Vs. Criminals") +
  coord_flip())
```


## Wanted_by Vs. Criminal Count

```{r}
wanted_count <- crim_bind %>%
  select(wanted_by, sex) %>%
  group_by(wanted_by,sex) %>%
  summarise(count_1 = n()) %>%
  filter(count_1 >= 40) 

plotly::ggplotly(ggplot(wanted_count, aes(reorder(wanted_by, count_1), count_1, fill = sex)) +
  geom_bar(stat = "identity", color = "white", position = "stack") +
  labs(x= "Wanted by Country", y = "Criminal Count", title = "Wanted-by Vs. Criminals") +
  coord_flip())

  
```


```{r}
crim_gather <- crim_bind %>%
  gather(Nature, Status, -c(criminal_id, age,text, Translation, wanted_by, sex, dob, height, language, nationality, place_of_birth, weight, eyes, hair, forename, present_family_name, charges, age_bin, count))

crim_gather <- crim_gather[crim_gather$Status == TRUE,]
crim_gather[, 21] <- as.factor(crim_gather[, 21])
glimpse(crim_gather)

```

```{r}
plotly::ggplotly(ggplot(crim_bind, aes(sex)) +
  geom_bar(fill = "tomato2") +
  labs(x= "Gender", y = "Count", title = "Interpol Gender Metrics"))


```

```{r}
plotly::ggplotly(ggplot(crim_bind, aes(age_bin)) +
  geom_bar(fill = "tomato2") +
  labs(x= "Age", y = "Count", title = "Interpol Age Metrics"))


```


```{r}
crim_gather <- crim_gather[crim_gather$Status == TRUE,]
crim_gather[, 21] <- as.factor(crim_gather[, 21])

gen_nat <- crim_gather %>%
  select(Nature, sex) %>%
  group_by(Nature, sex) %>%
  summarise(cnt = n())
gen_nat
 
plotly::ggplotly(ggplot(gen_nat, aes(Nature, cnt, fill = sex)) +
  geom_bar(stat = "identity", color = "white", position = "dodge") +
  labs(x= "Gender", y = "Count", title = "Interpol Gender Metrics") +
  coord_flip())

```


```{r}
plotly::ggplotly(ggplot(crim_gather, aes(Nature, ..count..)) +
  geom_bar(aes(fill = age_bin), position = "dodge") +
  labs(x= "Crime", y = "Count", title = "Interpol Crime Vs. Age Metrics") +
  coord_flip())

```


## Note

This interim report encapsulates the Interpol criminal data acquisition, importation to the RStudio IDE, text pre-processesing activity, data transformation, creation of document-term matrix, term-frequency inventory activity, identification of correlated terms and construction of `criminal nature classifier.

In due process, a clear viewpoint of the Interpol criminal records and nature of crimes committed are obtained.

## Upcoming Activies

* Graphical Representations.

* Application build up using R Shiny.

* Miscellaneous Optimisation Operations.