Skip to content

A tool for annotating text documents with their knowledge domains

License

Notifications You must be signed in to change notification settings

eke-initiative/kda

Repository files navigation

Knowledge Domain Annotation

Requirements

  • maven
  • Java 11+
  • python3
  • conda

To run the experiment:

  1. Download the dataset from this link
  2. Unpack the dataset:
tar -zxvf KDA_dataset.tar.gz
  1. Unpack virtual text documents
cd KDA_dataset/VirtualTextDocuments
tar -zxvf TPB.tar.gz
tar -zxvf LOV.tar.gz
tar -zxvf Laundromat.tar.gz

alternatively, you can generate virtual text documents by using vdg (in this case Apache Maven is required to be installed on your machine). To do that:

a. Unpack input rdf files

cd KDA_dataset/InputRDF
tar -zxvf TPB.tar.gz
tar -zxvf lov.nq.tar.gz
tar -zxvf LOD_Laundromat.tar.gz
tar -zxvf labelMap.tar.gz

b. Download and run vdg

git clone https://github.com/empirical-knowledge-engineering/vdg.git
cd vdg/
mvn clean install
mkdir ../VirtualTextDocuments/Laundromat
mvn exec:java  -Dexec.cleanupDaemonThreads=false -Dexec.mainClass="it.cnr.istc.stlab.lgu.Main"  -Dexec.args="Laundromat ../Laundromat ../labelMap ../VirtualTextDocuments/Laundromat"  -DjvmArgs="-Xmx32g"
mkdir ../VirtualTextDocuments/TPB
mvn exec:java  -Dexec.cleanupDaemonThreads=false -Dexec.mainClass="it.cnr.istc.stlab.lgu.Main"  -Dexec.args="TPB ../TPB ../dataset_ids ../VirtualTextDocuments/TPB"  -DjvmArgs="-Xmx32g"
mkdir ../VirtualTextDocuments/LOV
mvn exec:java  -Dexec.cleanupDaemonThreads=false -Dexec.mainClass="it.cnr.istc.stlab.lgu.Main"  -Dexec.args="LOV ../lov.nq.tar.gz ../VirtualTextDocuments/LOV"  -DjvmArgs="-Xmx32g"
  1. Preprocess virtual documents:
git clone https://github.com/empirical-knowledge-engineering/kda.git
cd kda/
conda env create -f environment.yml 
conda activate kda
python preprocess_virtual_documents.py --input_folder <KDA_Dataset_path>

The default vectorisation is in Binary setting, TF-IDF setting is activated by giving --tfidf as argument of the script preprocess_virtual_documents.py.

Alternatively, you can use the preprocessed dataset available in KDA_Dataset/experiment

  1. Create stratified folds
python create_folds.py <KDA_Dataset_path>

Pre-computed folds are available in KDA_Datasets/experiment

  1. Resample folds with MLSMOTE (from directory KDA_Dataset/):
python resample_folds.py <KDA_Dataset_path>

Resampled folds are available in KDA_Datasets/experiment

  1. Train and test classifier
python train_and_test_classifier.py  <KDA_Dataset_path>

About

A tool for annotating text documents with their knowledge domains

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published