Skip to content

own-pt/wikipedia-og-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Subset of pages from wikipedia anotated with WN senses.

Steps:

  1. get a set of pages from Wikipedia with it’s owen tool

https://en.wikipedia.org/wiki/Special:Export

The list of pages we used is in `WikipediaPages.txt`

output: Wikipedia-XXXXXX.xml

  1. run WikiExtractor http://attardi.github.io/wikiextractor/ twice to produce a version with links and a version without links:
python3.7 PATH-TO/wikiextractor/WikiExtractor.py -s -l -o out1/ Wikipedia-20190806182211.xml
python3.7 PATH-TO/wikiextractor/WikiExtractor.py -s -o out2/ Wikipedia-20190806182211.xml

Manually inspected the out2/ to clean maths, unecessary sections etc.

  1. Export the links
(load "prepare.lisp")
(get-links "/Users/ar/work/wikipedia-og-corpus/out1/AA/wiki_00" "my.links")

and remove the out1 directory.

  1. split files
(main "/Users/ar/work/wikipedia-og-corpus/out2/AA/wiki_00")

Remove the out2/ directory.

  1. split de sentencas nos arquivos em out/*.txt:
for f in *.txt ; 
 do ~/work/apache-opennlp-1.9.0/bin/opennlp SentenceDetector ~/work/apache-opennlp-1.5.3/models/en-sent.bin < $f > `basename $f .txt`.sent; 
done
  1. See own-pt/sensetion.el#135, to fix the files for avoiding problems with sensetion/touch.py
sed -i .bak 's/\."$/"./' *.sent
sed -i .bak 's/ / /' *.sent
sed -i .bak 's/ / /' *.sent

remove the .bak files. I have also manually (using deadgrep in Emacs) search for split errors searching for lines starting with lowercase characters.

  1. sensetion preprocessing
python touch.py -c ~/hpsg/terg/pet/repp.set ~/work/wikipedia-og-corpus/out/*.sent > ~/work/wikipedia-og-corpus/t0.jsonl
python enrich.py --es ~/work/wikipedia-og-corpus/t0.jsonl > ~/work/wikipedia-og-corpus/t1.jsonl

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published