This repository shows how to train a sentence mood classifier using spaCy's new SpanCategorizer component and the Georgetown University Multilayer (GUM) Corpus. The classifier uses a custom span suggester, which returns sentences for classification.
Please note that this repository is only for demonstration. The GUM corpus is too small for training a classifier from scratch and some labels are very rare. The classifier does a decent job with declaratives and interrogatives, but struggles with imperatives and rarer moods.
For information on classifier performance, see the file training/metrics.json
.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
spaCy projects documentation.
The following commands are defined by the project. They
can be executed using spacy project run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
convert |
Convert the CoNLL-U data to spaCy's binary format |
debug |
Debug the data for insights on the corpus |
train |
Train the model for sentence mood classification |
evaluate |
Evaluate the model and export metrics |
package |
Package the trained model as a pip package |
The following workflows are defined by the project. They
can be executed using spacy project run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
convert → train → evaluate → package |
The following assets are defined by the project. They can
be fetched by running spacy project assets
in the project directory.
File | Source | Description |
---|---|---|
assets/gum |
Git | The Georgetown University Multilayer (GUM) Corpus |
- Run the command
python setup.py install
in the directorypackages/en_moodcat-0.0.1
to install the pipeline - Run the file
moodcat_demo.py