GitHub - own-pt/wikipedia-og-corpus

Subset of pages from wikipedia anotated with WN senses.

Steps:

get a set of pages from Wikipedia with it’s owen tool

https://en.wikipedia.org/wiki/Special:Export

The list of pages we used is in `WikipediaPages.txt`

output: Wikipedia-XXXXXX.xml

run WikiExtractor http://attardi.github.io/wikiextractor/ twice to produce a version with links and a version without links:

python3.7 PATH-TO/wikiextractor/WikiExtractor.py -s -l -o out1/ Wikipedia-20190806182211.xml
python3.7 PATH-TO/wikiextractor/WikiExtractor.py -s -o out2/ Wikipedia-20190806182211.xml

Manually inspected the out2/ to clean maths, unecessary sections etc.

Export the links

(load "prepare.lisp")
(get-links "/Users/ar/work/wikipedia-og-corpus/out1/AA/wiki_00" "my.links")

and remove the out1 directory.

split files

(main "/Users/ar/work/wikipedia-og-corpus/out2/AA/wiki_00")

Remove the out2/ directory.

split de sentencas nos arquivos em out/*.txt:

for f in *.txt ; 
 do ~/work/apache-opennlp-1.9.0/bin/opennlp SentenceDetector ~/work/apache-opennlp-1.5.3/models/en-sent.bin < $f > `basename $f .txt`.sent; 
done

See own-pt/sensetion.el#135, to fix the files for avoiding problems with sensetion/touch.py

sed -i .bak 's/\."$/"./' *.sent
sed -i .bak 's/ / /' *.sent
sed -i .bak 's/ / /' *.sent

remove the .bak files. I have also manually (using deadgrep in Emacs) search for split errors searching for lines starting with lowercase characters.

sensetion preprocessing

python touch.py -c ~/hpsg/terg/pet/repp.set ~/work/wikipedia-og-corpus/out/*.sent > ~/work/wikipedia-og-corpus/t0.jsonl
python enrich.py --es ~/work/wikipedia-og-corpus/t0.jsonl > ~/work/wikipedia-og-corpus/t1.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
out		out
README.org		README.org
Wikipedia-20190806182211.xml		Wikipedia-20190806182211.xml
WikipediaPages.txt		WikipediaPages.txt
my.links		my.links
prepare.lisp		prepare.lisp

own-pt/wikipedia-og-corpus

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages