Subset of pages from wikipedia anotated with WN senses.
Steps:
- get a set of pages from Wikipedia with it’s owen tool
https://en.wikipedia.org/wiki/Special:Export
The list of pages we used is in `WikipediaPages.txt`
output: Wikipedia-XXXXXX.xml
- run WikiExtractor http://attardi.github.io/wikiextractor/ twice to produce a version with links and a version without links:
python3.7 PATH-TO/wikiextractor/WikiExtractor.py -s -l -o out1/ Wikipedia-20190806182211.xml
python3.7 PATH-TO/wikiextractor/WikiExtractor.py -s -o out2/ Wikipedia-20190806182211.xml
Manually inspected the out2/
to clean maths, unecessary sections etc.
- Export the links
(load "prepare.lisp")
(get-links "/Users/ar/work/wikipedia-og-corpus/out1/AA/wiki_00" "my.links")
and remove the out1
directory.
- split files
(main "/Users/ar/work/wikipedia-og-corpus/out2/AA/wiki_00")
Remove the out2/
directory.
- split de sentencas nos arquivos em out/*.txt:
for f in *.txt ;
do ~/work/apache-opennlp-1.9.0/bin/opennlp SentenceDetector ~/work/apache-opennlp-1.5.3/models/en-sent.bin < $f > `basename $f .txt`.sent;
done
- See own-pt/sensetion.el#135, to fix the files for avoiding problems with sensetion/touch.py
sed -i .bak 's/\."$/"./' *.sent sed -i .bak 's/ / /' *.sent sed -i .bak 's/ / /' *.sent
remove the .bak
files. I have also manually (using deadgrep in
Emacs) search for split errors searching for lines starting with
lowercase characters.
- sensetion preprocessing
python touch.py -c ~/hpsg/terg/pet/repp.set ~/work/wikipedia-og-corpus/out/*.sent > ~/work/wikipedia-og-corpus/t0.jsonl
python enrich.py --es ~/work/wikipedia-og-corpus/t0.jsonl > ~/work/wikipedia-og-corpus/t1.jsonl