-
Notifications
You must be signed in to change notification settings - Fork 0
Cleaning data part II and Implementing Lasso
- Added Thailand, removed USA. Final list --- "Iraq","Pakistan","Afghanistan","India","Philippines","Thailand".
-
Removed rows with >n/2 NAs.
-
Removed columns with text.
-
Removed additional columns using domain knowledge - "eventid","provstate","city","latitude","longitude","specificity","location","summary","targsubtype1","motive","weapdetail","propcomment","scite1","scite2","dbsource","target1","corp1","nkillter","nkillus","nwoundus","nwoundte".
-
Removed INT columns with NA (-9) data.
-
Response variable: nvictim = nkill + nwound
-
Vectorized column "gname" and replaced with "gname.index"
-
Workaround for columns with NAs: (Refer "NAVector" for status of NAs)
a. column "natlty": assumed country of incidence
b. column "guncertain1": deleted corresponding rows
c. column "ishostkid1": deleted corresponding rows
d. column "nvictim": deleted corresponding rows
e. column "nperp" and "nperpcap": based on BIC, AIC and other scores, not important column. Ignoring for now.
f. column "weapsubtype1": trying linear regression to extrapolate the missing 4% data --- WIP