Skip to content

Commit

Permalink
Use JSON corpora from kalamine-corpus.
Browse files Browse the repository at this point in the history
  • Loading branch information
fabi1cazenave committed Dec 9, 2024
1 parent 891c161 commit 9a11b87
Show file tree
Hide file tree
Showing 7 changed files with 5 additions and 150,169 deletions.
300 changes: 0 additions & 300 deletions kalamine/www/corpus/LICENSE

This file was deleted.

7 changes: 5 additions & 2 deletions kalamine/www/corpus/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
# Corpus for layout analysis
# Corpus for Layout Analysis

All JSON files have been generated with [kalamine-corpus](https://github.com/OneDeadKey/kalamine-corpus?tab=readme-ov-file).

## `fr` / `en`

Those corpora and stats come from Don Quixote
Those corpora and stats come from Don Quixote (Cervantes), Gutenberg Project.

## `fra_mixed-typical_2012_1M-sentences`

These stats come from [University of Leipzig](https://wortschatz.uni-leipzig.de/en/download/French#fra_mixed_2012)

### Sources

French Mixed-Typical 2012, 1M sentences file has been extracted, and the
sentence indices have been stripped with `awk '!($1="")'
fra_mixed-typical_2012_1M/fra_mixed-typical_2012_1M-sentences.txt >
Expand Down
76 changes: 0 additions & 76 deletions kalamine/www/corpus/chardict.py

This file was deleted.

Loading

0 comments on commit 9a11b87

Please sign in to comment.