Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/MiMoText/roman18
Browse files Browse the repository at this point in the history
  • Loading branch information
JoKons committed Mar 9, 2023
2 parents fe6209f + e23dbee commit 9cc8970
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 13 deletions.
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ For a more detailed documentation of our sampling strategy, see our [Jupyter Not

The texts are provided in several different formats. For the texts from the first group, the original double keying files are available. In addition, a cleaned-up XML version closely reflecting the original documents’ layout is available (folder Archiv/XML4OCR).

The master format for all texts is an XML format following the *Guidelines* of the Text Encoding Initiative (folder XML-TEI). The files are encoded in accordance with a relatively restrictive schema developed in the [COST Action ‘Distant Reading for European Literary History’](https://www.distant-reading.net/).
The master format for all texts is an XML format following the *Guidelines* of the Text Encoding Initiative (folder XML-TEI). The files are encoded in accordance with a relatively restrictive schema developed in the [COST Action ‘Distant Reading for European Literary History’](https://www.distant-reading.net/) (level-1 encoding).

In addition, we provide plain text versions of the texts. However, these are best generated depending on individual needs using the scripts “tei2txt.py” & "tei2txt_run.py"(in the Scripts folder).

Expand All @@ -36,10 +36,12 @@ There is a short and an extensive metadata description in TSV for all TEI/XML fi
* Metadata, long version: https://github.com/MiMoText/roman18/blob/master/XML-TEI/xml-tei_full_metadata.tsv

## Language
French

The main language of all texts is French.

## Structure of the repository
* Archiv: here we store files which were generated as intermediate for our digitization pipeline with OCR4all.

* Archive: here we store files which were generated as intermediate for our digitization pipeline with OCR4all.
* Python-Scripts: the scripts folder contains python scripts needed for corpus creation
* Schemas: current versions of the ELTeC schema in RELAX NG are available from this repository
* XML-TEI: our corpus of french novels 1751-1800 in XML/TEI and metadata are stored here
Expand All @@ -52,7 +54,7 @@ All texts and scripts are in the public domain and can be reused without restric

## Citation suggestion

*Collection de romans français du dix-huitième siècle (1750-1800) / Eighteenth-Century French Novels (1750-1800)*, edited by Julia Röttgermann, with contributions from Julia Dudar, Anne Klee, Johanna Konstanciak, Amelie Probst, Sarah Rebecca Ondraszek and Christof Schöch. Release v0.2.0. Trier: TCDH, 2021. URL: https://github.com/mimotext/roman18. DOI: http://doi.org/10.5281/zenodo.5040855.
*Collection de romans français du dix-huitième siècle (1751-1800) / Eighteenth-Century French Novels (1751-1800)*, edited by Julia Röttgermann, with contributions from Julia Dudar, Henning Gebhard, Anne Klee, Johanna Konstanciak, Damir Padieu, Amelie Probst, Sarah Rebecca Ondraszek and Christof Schöch. Release v0.3.0. Trier: TCDH, 2023. URL: https://github.com/mimotext/roman18. DOI: http://doi.org/10.5281/zenodo.5040855.

## Funding

Expand Down
18 changes: 10 additions & 8 deletions XML-TEI/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,31 @@ Eighteenth-Century French Novels.

## Introduction

This repository of Eighteenth-Century French Novels contains digital texts of novels created or first published between 1751 and 1800. The collection is created in the context of Mining and Modeling Text, a project which is located at the Trier Center for Digital Humanities (TCDH) at Trier University. Work on the collection is ongoing.
This repository of Eighteenth-Century French Novels contains digital texts of novels created or first published between 1751 and 1800. The collection is created in the context of Mining and Modeling Text, a project which is located at the Trier Center for Digital Humanities (TCDH) at Trier University. Work on the collection is ongoing until the end of 2023.

## Corpus building

In the first step, about 40 novels have been carefully created by double keying. Using this first group of novels, an OCR-model has been trained in cooperation with Christian Reul (University of Würzburg), who is one of the developers of OCR4all. The result is an OCR model for French prints of the late 18th century. This model will shortly be available within OCR4all.
In the first step, about 40 novels have been carefully created by double keying. Using this first group of novels, an OCR-model has been trained in cooperation with Christian Reul (University of Würzburg), who is one of the developers of OCR4all. The result is an OCR model for French prints of the late 18th century. This model is available within OCR4all.

Applying this OCR-model to additional scans provided by for instance Gallica (bnf.fr) and HathiTrust, a second group of novels which are not yet digitally available (or only in low quality) is now being produced.
Applying this OCR-model to additional scans provided by for instance Gallica (bnf.fr), a second group of novels which are not yet digitally available (or only in low quality) is now being produced.

A third group of texts, based on existing full texts (from Gallica, Google books or Wikisource) will hopefully help us reach about 200 novels by the end of 2022.
A third group of texts, based on existing full texts (from Gallica, Google books or Wikisource) helped us reach about 200 novels by the end of 2022.

At the moment, corpus composition depends primarily on pragmatic criteria. We currently collect and plan to provide metadata for the creation of more principled subcorpora. A bibliography documenting the overall production of novels in the period is Angus Martin, Vivienne G. Mylne and Richard Frautschi, Bibliographie du genre romanesque français 1751-1800, 1977. Our goal is to use this metadata to balance our corpus of texts.
At the beginning, corpus composition depended primarily on pragmatic criteria. We then proceeded and used additional metadata on the literary production to balance the corpus of full texts. A bibliography documenting the overall production of novels in the period is Angus Martin, Vivienne G. Mylne and Richard Frautschi, Bibliographie du genre romanesque français 1751-1800, 1977. We used this metadata to balance our corpus of texts regarding the parameters gender, year of first publication and narrative form.

For a more detailed documentation of our sampling strategy, see our [Jupyter Notebook](https://github.com/MiMoText/balance_novels/blob/main/balance_analysis_newStructure.ipynb).

## Formats

The texts are provided in several different formats. For the texts from the first group, the original double keying files are available. In addition, a cleaned-up XML version closely reflecting the original documents’ layout is available (folder XML4OCR).
The texts are provided in several different formats. For the texts from the first group, the original double keying files are available. In addition, a cleaned-up XML version closely reflecting the original documents’ layout is available (folder Archiv/XML4OCR).

The master format for all texts is an XML format following the Guidelines of the Text Encoding Initiative (folder XML-TEI). The files are encoded in accordance with a relatively restrictive schema developed in the COST Action ‘Distant Reading for European Literary History’.

In addition, we provide plain text versions of the texts. However, these are best generated depending on individual needs using the script “get_text.py” (in the Scripts folder).

## Licence

All texts are in the public domain and can be reused without restrictions. We don’t claim any copyright or other rights on the transcription, markup or metadata. If you use our texts, for example in research or teaching, please reference this collection using the citation suggestion below.
All texts and scripts are in the public domain and can be reused without restrictions. We don’t claim any copyright or other rights on the transcription, markup or metadata. If you use our texts, for example in research or teaching, please reference this collection using the citation suggestion below.

## Metadata
The tsv-File xml-tei_full_metadata.tsv contains metadata on the XML/TEI-files. There is a short metadata version (xml-tei_metadata.tsv) and an extended metadata version (xml-tei_full_metadata.tsv).
Expand All @@ -37,4 +39,4 @@ The tsv-File xml-tei_full_metadata.tsv contains metadata on the XML/TEI-files. T

## Citation suggestion

Collection de romans français du dix-huitième siècle (1750-1800) / Collection of Eighteenth-Century French Novels (1750-1800), edited by Julia Röttgermann, with contributions from Julia Dudar, Anne Klee, Johanna Konstanciak, Amelie Probst, Sarah Rebecca Ondraszek and Christof Schöch. Release v0.1.0. Trier: TCDH, 2020. URL: https://github.com/mimotext/roman18. DOI: https://doi.org/10.5281/zenodo.4061903
Collection de romans français du dix-huitième siècle (1751-1800) / Eighteenth-Century French Novels (1751-1800), edited by Julia Röttgermann, with contributions from Julia Dudar, Henning Gebhard, Anne Klee, Johanna Konstanciak, Damir Padieu, Amelie Probst, Sarah Rebecca Ondraszek and Christof Schöch. Release v0.3.0. Trier: TCDH, 2023. URL: https://github.com/mimotext/roman18. DOI: http://doi.org/10.5281/zenodo.5040855.
2 changes: 1 addition & 1 deletion plain/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ All texts and scripts are in the public domain and can be reused without restric

## Citation suggestion

Collection de romans français du dix-huitième siècle (1750-1800) / Eighteenth-Century French Novels (1750-1800), edited by Julia Röttgermann, with contributions from Julia Dudar, Anne Klee, Johanna Konstanciak, Amelie Probst, Sarah Rebecca Ondraszek and Christof Schöch. Release v0.2.0. Trier: TCDH, 2021. URL: https://github.com/mimotext/roman18. DOI: http://doi.org/10.5281/zenodo.5040855.
Collection de romans français du dix-huitième siècle (1751-1800) / Eighteenth-Century French Novels (1751-1800), edited by Julia Röttgermann, with contributions from Julia Dudar, Henning Gebhard, Anne Klee, Johanna Konstanciak, Damir Padieu, Amelie Probst, Sarah Rebecca Ondraszek and Christof Schöch. Release v0.3.0. Trier: TCDH, 2023. URL: https://github.com/mimotext/roman18. DOI: http://doi.org/10.5281/zenodo.5040855.

0 comments on commit 9cc8970

Please sign in to comment.