Skip to content

Latest commit

 

History

History
96 lines (79 loc) · 4.14 KB

README.md

File metadata and controls

96 lines (79 loc) · 4.14 KB

Paraly: An (annotated) dataset for exploring the concept of paralysis (fr. ‘paralysie’) in a digital corpus of French Literature

PRs Welcome Open Code Open Data Open Science

Table of contents

Structure

paraly/
├── data/
│   ├── training/
│   │   ├── train_fasttext_dataset.txt
│   │   ├── test_fasttext_dataset.txt
│   │   └── dev_fasttext_dataset.txt
│   ├── model/
│   │   ├── training.log
│   │   ├── test.tsv
│   │   ├── loss.tsv
│   │   └── dev.tsv
│   ├── corpus/
│   │   ├── 20_paraly_metadata.tsv
│   │   ├── 20_paraly_data_TEI.xml/
│   │   ├── 20_paraly_corpus.cec6
│   │   ├── 19_paraly_metadata.tsv
│   │   ├── 19_paraly_data_TEI.xml/
│   │   ├── 19_paraly_corpus.cec6
│   │   ├── 18_paraly_metadata.tsv
│   │   ├── 18_paraly_data_TEI.xml/
│   │   └── 18_paraly_corpus.cec6
│   └── annotations/
│       ├── 20_paraly_annotations.xlsx
│       ├── 20_paraly_annotations.csv
│       ├── 19_paraly_annotations.xlsx
│       ├── 19_paraly_annotations.csv
│       ├── 18_paraly_annotations.xlsx
│       └── 18_paraly_annotations.csv
├── code/
│   ├── training/
│   │   ├── train_fc.py
│   │   └── README_training.md
│   ├── splitting/
│   │   ├── prepare_training_data.py
│   │   └── README_splitting.md
│   ├── merging/
│   │   ├── merge.ipynb
│   │   └── README_merging.md
│   ├── extraction/
│   │   ├── starten.bat
│   │   ├── skript.cecs
│   │   ├── query.txt
│   │   └── README_extraction.md
│   └── collection/
│       ├── get_metadata_for_corpus.ipynb
│       ├── get_metadata_for_all_books.ipynb
│       ├── get_OCRed_books_from_gallica.ipynb
│       └── comment_metadata_in_html_files.ipynb
├── README.md
└── LICENSE

Collection

The whole digital collection for various centuries is at https://gallica.bnf.fr/html/und/litteratures/les-classiques-de-la-litterature-acces-par-periode?mode=desktop. Our focus is on the following collections:

The OCR-ed books and their metadata were downloaded using the scripts in ./code/collection/.

Annotation

The annotated data located in ./data/annotations/ was labeled as "c" (concrete), "f" (figurative), and “cf” (an “inter-“category).

Model

The multilabel classifier paraly_camembert_large_multilabel was trained using flair-library with a script in ./code/training/ and is openly available at Hugging Face.

App

The app for using the classifier is openly available via Hugging Face Spaces.

License

This work is licensed under the MIT license (code) and Creative Commons Attribution 4.0 International license (for everything else). You are free to share and adapt the material for any purpose, even commercially, as long as you provide attribution.