WikiDoMiner is a tool that automatically generates domain-specific corpora by crawling Wikipedia.
Clone and install the required libraries
git clone github.com/SNTSVV/WikiDoMiner.git
cd WikiDoMiner
pip install -r requirements.txt
CLI:
python WikiDoMiner.py --doc Xfile.txt --output-path ../research/nlp --wiki-depth 1
checkout available arguments using
python WikiDoMiner.py --help
# extract keywords
keywords = getKeywords(document, spacy_pipeline)
# query wikipedia to get your corpus
corpus = getCorpus(keywords, depth=1)
# locally save your corpus
saveCorpus(corpus, parent_dir='Documents', folder='Corpus')
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.