This repo is a pipeline that takes an HTML/JSON formatted news articles and do several NLP tasks on them.
An input file will cross by all/some the components of the pipeline. The output of each component will be written directly to a copy of the given file. where the output of each component is as follows,
-
Document classifier -> 'doc_label'=0/1
-
Violent_classifer -> 'is_violent'=0/1
-
Sentence_classifier will tokenize the sentences and get their labels for 4 different classifers->
- 'sentences'=[]
- 'sentence_labels'=[]
- 'sentence_tokens'=[[]]
- 'Trigger_Semantic_label'=[]
- 'participant_semantic'=[]
- 'organizer_semantic'=[]
-
Token_classifers -> 'token_labels'=[[]]
-
report script (
bin/out_to_csv.py
) will create a detailed report on sentence base and merge coreferensed setnences using[NS]
token.
An example output can be found at the end of the page.
- when cascaded parameter is true
It will run the pipeline as illustrated in the flowchart below. Where no filtering will be applied hence, files will cross by all component in all level.
Flowchart,
- when cascaded parameter is false It will run the pipeline as illustrated in the flowchart below. Where filtering will be applied and only postive sentences of a postive document will be passed.
- Java 7 or 8
- Python3
install.sh
will install/download the following models in $HOME/.pytorch_pretrained_bert folder.
- Nextflow
- Python requirments libraries (requirement.txt)
- OsmanMutlu/pytorch-pretrained-BERT
- Document Models (Protest classifier (BERT) - Violent Classifier (SVM) )
- Sentence Model (Protest classifier (BERT) - Participant Semantic Categorization (BERT), Trigger Semantic Categorization (BERT), and Organizer Semantic Categorization (BERT)- Coreference Model (ALBERT))
- Token Model(Token Classifier)
- Change "/PATH/TO/REPO" and ENV_NAME variables accordingly!
echo 'PATH="$PATH:/PATH/TO/REPO/bin"'
conda create -n ENV_NAME python=3.6
conda activate ENV_NAME
cd /PATH/TO/REPO
bash install.sh
It can be modified in nextflow.conf
file
-
cascaded
true/false
as it's explained in 'Pipeline Configuration' section.
-
classifier_first
true/false
if false pipeline starts with extractor.
if true the first component of the pipeline will be the classifier, hence the input files must be JSON formatted and contain the following variables,
{ "id":str, "length":int, "text":str, "time":str, "title":str }
-
gpu_classifier
GPU id or group of GPU ids that is/are assgined to document proteset classifier.
A single number or series of number joined by comma ie 1,2,3
-
gpu_number_tsc
GPU id or group of GPU ids that is/are assgined to trigger semantic classifier.
A single number or series of number joined by comma ie 1,2,3
-
gpu_number_psc
GPU id or group of GPU ids that is/are assgined to participant semantic classifier.
A single number or series of number joined by comma ie 1,2,3
-
gpu_number_osc
GPU id or group of GPU ids that is/are assgined to organization semantic classifier.
A single number or series of number joined by comma ie 1,2,3
-
gpu_number_protest
GPU id or group of GPU ids that is/are assgined to sentence protest classifier.
A single number or series of number joined by comma ie 1,2,3
-
gpu_token
GPU id or group of GPU ids that is/are assgined to token classifier.
A single number or series of number joined by comma ie 1,2,3
-
input
input folder Path
"<PATH>" .not ending with backslash
-
output
output folder path
"<PATH>/jsons/"
-
resume
true/false
nextflow's parameter https://www.nextflow.io/docs/latest/getstarted.html#modify-and-resume
-
files_start_with
file names pattern
ie "*" ,"*json" ,"\http*"
-
doc_batchsize
Document protest classifier batch size
-
token_batchsize
token classifier batch size
-
prefix
path to pipeline scripts
ie "<Path>/emw_pipeline_nf"
-
extractor_script_path
path to html to text script
ie $prefix"/bin/extract/peoples_chaina.py" #
-
out_output_type
type of output record file
csv/json
-
out_name_output_file
name of outfile record file
-
out_date_key
the key of date inside the dataset
ie "time","date"
-
filter_unprotested_sentence
true/false
filter protest sentences in the record file.
if true only sentences that are postive will be inculded.
if false all the sentences will be included irrespective of their label
-
filter_unprotested_documents
true/false
filter protest document in the record file.
if true only sentences that are in postive document will be inculded.
if false all the sentences will be include irrespective of their document label
Set your parameters in nextflow.conf
first.
start.sh
script will start the FLASK API of classifiers then will format the paramters that are in nextflow.conf
to a JSON file and run nextflow. After nextflow is done reporting part will be started.
bash start.sh
event_id
The id of the event.url
Using your browser, you can go this url to check the text of the document this event was extracted from.title
Title of the document this event was extracted from.event_sentence_numbers
All of the information of this event were extracted from these sentences. Note that our tool's sentence separation/numbering might not be perfect.district_name
The district of the event. Categorizable. If empty, check for state_name.state_name
State of the event. Categorizable. If empty, specific_place_name must not be empty.specific_place_name
Specific place name returned by geopy tool. Might be empty since geopy might not have been used.year
Publishing year.month
Publishing month. Might not be available for some sources.day
Publishing day. Might not be available for some sources.urbanrural
Indicates whether the event takes place in a "urban" or "rural" setting. Decided on document level.violent
Indicates whether the event is "violent" or "non-violent". Decided on document level.triggers
Extracted triggers of the event.eventcategory
Semantic category of the event.participants
Extracted participants of the event.participant0category - participant3category
Semantic category of the participants of the event. Goes from most common to least common, "participant0category" being the most common one.organizers
Extracted organizers of the event.organizer0category - organizer3category
Semantic category of the organizers of the event. Goes from most common to least common, "organizer0category" being the most common one.targets
Extracted targets of the event.