Skip to content

dineshnagumothu/quokka

Repository files navigation

quokka

  1. Download data

    Download files from these links and copy them to the data directory

    Energy Hub

    Energy Hub Training set - 
    
    Energy Hub Validation set - 
    
    Energy Hub Test set - 
    

    Reuters

    Reuters Training set - 
    
    Reuters Validation set - 
    
    Retuers Test set - 
    
  2. Downloading Necessary Packages
    • Download NLTK stopwords using
      import nltk
      
      nltk.download('stopwords')
      
    • Download Mallet from here. Unzip and copy it to the directory.

      If you use Google Colab:

      !wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
      !unzip mallet-2.0.8.zip
      
    • Download GloVe embeddings from here. Unzip and copy it to the directory.

      If you use Google Colab:

      ```
      !wget https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
      !unzip glove*.zip
      ```
      
  3. Build Topic-Entity Triples

    This step involves

    • Training a Topic Modeler over the corpus
    • Extracting Named-Entities using spaCy
    • Building Triples using Dependency parser and POS tagger
    • Apply Topic Entity Filter over these triples

    Run the following python file.

    python data_preprocess.py <dataset>

    Change <dataset> to "energy hub" or "reuters" to select the corpus.

  4. Training Models

    Run the following python file.

    python train.py <dataset> <model>

    Change <dataset> to "energy hub" or "reuters" to select the corpus.

    Change <model> to the following options

    • text - for GloVe based text model
    • topics - To use topic distributions
    • entites - To use Glove-enriched named entities
    • triples - To use Glove-enriched triples
    • text_topics - To use text and topic distributions
    • text_triples - To use text(GloVe) and triples(GloVe)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages