From 01cb7470a3d0d099bf6eba6e4d98df7dc8378bd5 Mon Sep 17 00:00:00 2001 From: Daniel Jettka Date: Thu, 1 Feb 2024 10:53:16 +0100 Subject: [PATCH 01/33] added annotation --- data/JTEI/10_2016-19/jtei-10-haaf-source.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml index 9ec8142f..a9ac8328 100644 --- a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml @@ -1210,7 +1210,7 @@ corpora. Our primary goal is to be as inclusive as possible, allowing for other projects to benefit from our resources (i.e., our comprehensive guidelines and documentation as well as the technical infrastructure that includes Schemas, ODDs, and XSLT scripts) and + xml:id="R1" target="#XSLT"/>XSLT scripts) and contribute to our corpora. We also want to ensure interoperability of all data within the DTA corpora. The underlying TEI format has to be continuously maintained and adapted to new necessities with these two premises in mind.

From 669e1d7ccabefe17b72228b1a785938a391edf9b Mon Sep 17 00:00:00 2001 From: Daniel Jettka Date: Thu, 1 Feb 2024 11:52:31 +0100 Subject: [PATCH 02/33] added annotations and software entries --- data/JTEI/10_2016-19/jtei-10-haaf-source.xml | 10 ++++---- .../JTEI/10_2016-19/jtei-10-romary-source.xml | 24 +++++++++++-------- taxonomy/software-list.xml | 18 ++++++++++++++ 3 files changed, 38 insertions(+), 14 deletions(-) diff --git a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml index a9ac8328..3c31a814 100644 --- a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml @@ -211,9 +211,10 @@ target="http://www.deutschestextarchiv.de/doku/software#cab"/>. as well as collaborative text correction and annotationSee DTAQ: Kollaborative Qualitätssicherung im Deutschen Textarchiv - (Collaborative Quality Assurance within the DTA), accessed January 28, 2017, . On the process of + level="a">DTAQ: Kollaborative Qualitätssicherung im Deutschen Textarchiv + (Collaborative Quality Assurance within the DTA), accessed January 28, 2017, . On the process of quality assurance in the DTA, see, for example, Haaf, Wiegand, and Geyken 2013.) is a matter of supporting scholarly projects in their usage of the DTA infrastructure, which is part of the DTA’s mission. Second, @@ -273,7 +274,8 @@ Since June 2014, nine complete volumes with a total of more than 3,500 manuscript pages have been manually transcribed, annotated in TEI XML, and published via the DTA infrastructure. Most of these manuscripts were keyed manually by a vendor and published at - an early stage in the web-based quality assurance platform DTAQ. There, the transcription + an early stage in the web-based quality assurance platform DTAQ. There, the transcription as well as the annotation of each document was checked and corrected, if necessary; DTAQ also provided the means to add additional markup, such as the tagging of person names (persName), directly at page level. After the process of quality control has diff --git a/data/JTEI/10_2016-19/jtei-10-romary-source.xml b/data/JTEI/10_2016-19/jtei-10-romary-source.xml index 3b75d2b6..39c9020a 100644 --- a/data/JTEI/10_2016-19/jtei-10-romary-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-romary-source.xml @@ -645,8 +645,8 @@ available at . In our proposal, the etym element has to be made recursive in order to allow the fine-grained representations we propose here. The corresponding ODD customization, together with - reference examples, is available on GitHub. and the + reference examples, is available on GitHub. and the fact that a change occurred within the contemporary lexicon (as opposed to its parent language) is indicated by means of xml:lang on the source form.There may also be cases in which it is unknown whether a given etymological process occurred @@ -768,8 +768,8 @@ text.The interested reader may ponder here the possibility to also encode scripts by means of the notation attribute instead of using a cluttering of language subtags on xml:lang. For more on this issue, see the proposal in - the TEI GitHub (GitHub (). This is why we have extended the notation attribute to orth in order to allow for better representation of both language identification and the orthographic content. With this double mechanism, we intend to @@ -987,7 +987,7 @@

The dateThe element date as a child of cit is another example which does not adhere to the current TEI standards. We have allowed this within our ODD document. A feature request proposal will be made on the GitHub page and this feature may or may not appear in future versions of the TEI Guidelines. element is listed within each etymon block; the values of attributes notBefore and notAfter specify the range of time @@ -1486,8 +1486,10 @@ extent of knowledge that is truly necessary to create an accurate model of metaphorical processes. In order to do this, it is necessary to make use of one or more ontologies, which could be locally defined within a project, and of external linked open data sources - such as DBpedia and Wikidata, or some combination thereof. Within + such as DBpedia and Wikidata, or some combination thereof. Within TEI dictionary markup, URIs for existing ontological entries can be referenced in the sense, usg, and ref elements as the value of the attribute corresp.

@@ -1496,7 +1498,8 @@ reference to the source entry’s unique identifier (if such an entry exists within the dataset). In such cases, the etymon pointing to the source entry can be assumed to inherit the source’s domain and sense information, and this information can be automatically - extracted with a fairly simple XSLT program; thus the encoders may choose to leave some or + extracted with a fairly simple XSLT program; thus the encoders may choose to leave some or all of this information out of the etymon section. However, in the case that the dataset does not actually have entries for the source terms, or the encoder wants to be explicit in all aspects of the etymology, as mentioned above, the source domain and the @@ -1556,7 +1559,8 @@ type="metonymy") and the etymon (cit type="etymon") the source term’s URI is referenced in oRef and pRef as the value of corresp (@corresp="#animal").

-

In sense, the URI corresponding to the DBpedia entry for horse is the +

In sense, the URI corresponding to the DBpedia entry for horse is the value for the attribute corresp. Additionally, the date notBefore="…" element–attribute pairing is used to specify that the term has only been used for the horse since 1517 at maximum (corresponding to the first Spanish @@ -2486,7 +2490,7 @@

For the issues regarded as the most fundamentally important to creating a dynamic and sustainable model for both etymology and general lexicographic markup in TEI, we have submitted formal requests for changes to the TEI GitHub, and will continue to + target="#R8"/>GitHub, and will continue to submit change requests as needed. While this work represents a large step in the right direction for those looking for means of representing etymological information, there are still a number of unresolved issues that will need to be addressed. These remaining issues diff --git a/taxonomy/software-list.xml b/taxonomy/software-list.xml index f0dc2917..3a441351 100644 --- a/taxonomy/software-list.xml +++ b/taxonomy/software-list.xml @@ -1532,6 +1532,24 @@ and "born digital" writing. research + + DTAQ: Kollaborative Qualitätssicherung im Deutschen Textarchiv + http://www.deutschestextarchiv.de/dtaq/about + + research + + + DBpedia + http://wiki.dbpedia.org/ + + research + + + Wikidata + https://www.wikidata.org/ + + research + From 9ea4a245de29f54943e484ec85f546852fcc8794 Mon Sep 17 00:00:00 2001 From: Daniel Jettka Date: Thu, 1 Feb 2024 13:51:23 +0100 Subject: [PATCH 03/33] added annotations and software entries --- .../jtei-cc-ra-bermudez-sabel-137-source.xml | 45 ++++++++++--------- taxonomy/software-list.xml | 21 ++++++++- 2 files changed, 44 insertions(+), 22 deletions(-) diff --git a/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml b/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml index 0b87d585..225980e7 100644 --- a/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml +++ b/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml @@ -110,10 +110,11 @@ ways in which the variant taxonomy may be linked to the body of the edition.

Although this paper is TEI-centered, other XML technologies will be mentioned. includes a brief commentary on using XSLT + type="software" xml:id="R1" target="#XSLT"/>XSLT to transform a TEI-conformant definition of constraints into schema rules. However, the greatest attention to an additional technology is in , which discusses the use of XQuery to retrieve particular + target="#analyses"/>, which discusses the use of XQuery to retrieve particular loci critici and to deploy quantitative analyses.

@@ -211,13 +212,14 @@ neutralized.This statement is especially significant when dealing with corpora that have been compiled over a long period of time. As is clearly explained in the introduction to the Helsinki Corpus that Irma Taavitsainen and Päivi Pahta prepared for - the Corpus Resource Database (CoRD) (Placing the Helsinki Corpus Middle English Section Introduction into - Context, ): The idea of basing corpus texts directly on + the Corpus Resource Database (CoRD) (Placing the Helsinki Corpus Middle English Section Introduction into + Context, ): The idea of basing corpus texts directly on manuscript sources has been presented more recently The principles of preparing manuscript texts for print have undergone changes during the history of editing.

@@ -445,11 +447,12 @@ definition, its typed-feature modeling facilitates the creation of schema constraints. For instance, I process my declaration to further constrict my schema so the feature structure declaration and its actual application are always synchronized and up to date.I use - XSLT to process the feature structure declaration in order to create all required Schematron rules that will constrict the feature library accordingly. I am currently working on creating a more generic validator (see my Github repository, Github repository, ).
@@ -541,16 +544,16 @@ >parallel segmentation method (TEI Consortium 2016, 12.2.3) seems to be a popular encoding technique for multi-witness editions, in terms of both the specific tools that have been created for this method and the number - of projects that apply it.Tools include Versioning Machine, CollateX (both - the Java and Python versions), and Juxta. For + of projects that apply it.Tools include Versioning Machine, CollateX (both + the Java and Python versions), and Juxta. For representative projects using the parallel segmentation method see Satire in Circulation: James editions Russell Lowell’s Letter from a volunteer in diff --git a/taxonomy/software-list.xml b/taxonomy/software-list.xml index 3a441351..2f3d374b 100644 --- a/taxonomy/software-list.xml +++ b/taxonomy/software-list.xml @@ -787,7 +787,7 @@ CollateX - + https://collatex.net/ research @@ -1550,6 +1550,25 @@ research + + XQuery + + XML Query Language (XQuery) + programming language + general + + + Corpus Resource Database (CoRD) + + Corpus Resource Database (CoRD) + research + + + Versioning Machine + http://v-machine.org/ + + + From 1b7a4e7bd317ebe31be69246654c74acb6afd43b Mon Sep 17 00:00:00 2001 From: Daniel Jettka Date: Thu, 1 Feb 2024 14:01:15 +0100 Subject: [PATCH 04/33] revised annotations and software entries (lower case IDs) --- data/JTEI/10_2016-19/jtei-10-haaf-source.xml | 2 +- data/JTEI/10_2016-19/jtei-10-romary-source.xml | 14 +++++++------- .../jtei-cc-ra-bermudez-sabel-137-source.xml | 18 +++++++++--------- taxonomy/software-list.xml | 14 +++++++------- 4 files changed, 24 insertions(+), 24 deletions(-) diff --git a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml index 3c31a814..1f4d360a 100644 --- a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml @@ -1212,7 +1212,7 @@ corpora. Our primary goal is to be as inclusive as possible, allowing for other projects to benefit from our resources (i.e., our comprehensive guidelines and documentation as well as the technical infrastructure that includes Schemas, ODDs, and XSLT scripts) and + xml:id="R1" target="#xslt"/>XSLT scripts) and contribute to our corpora. We also want to ensure interoperability of all data within the DTA corpora. The underlying TEI format has to be continuously maintained and adapted to new necessities with these two premises in mind.

diff --git a/data/JTEI/10_2016-19/jtei-10-romary-source.xml b/data/JTEI/10_2016-19/jtei-10-romary-source.xml index 39c9020a..6a15e913 100644 --- a/data/JTEI/10_2016-19/jtei-10-romary-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-romary-source.xml @@ -646,7 +646,7 @@ the etym element has to be made recursive in order to allow the fine-grained representations we propose here. The corresponding ODD customization, together with reference examples, is available on GitHub.
and the + target="#github"/>GitHub.
and the fact that a change occurred within the contemporary lexicon (as opposed to its parent language) is indicated by means of xml:lang on the source form.There may also be cases in which it is unknown whether a given etymological process occurred @@ -768,7 +768,7 @@ text.The interested reader may ponder here the possibility to also encode scripts by means of the notation attribute instead of using a cluttering of language subtags on xml:lang. For more on this issue, see the proposal in - the TEI GitHub (). This is why we have extended the notation attribute to orth in order to allow for better representation of both language @@ -987,7 +987,7 @@

The dateThe element date as a child of cit is another example which does not adhere to the current TEI standards. We have allowed this within our ODD document. A feature request proposal will be made on the GitHub page and this feature may or may not appear in future versions of the TEI Guidelines. element is listed within each etymon block; the values of attributes notBefore and notAfter specify the range of time @@ -1488,7 +1488,7 @@ which could be locally defined within a project, and of external linked open data sources such as DBpedia and Wikidata, or some combination thereof. Within TEI dictionary markup, URIs for existing ontological entries can be referenced in the sense, usg, and ref elements as the value of the attribute @@ -1499,7 +1499,7 @@ dataset). In such cases, the etymon pointing to the source entry can be assumed to inherit the source’s domain and sense information, and this information can be automatically extracted with a fairly simple XSLT program; thus the encoders may choose to leave some or + target="#xslt"/>XSLT program; thus the encoders may choose to leave some or all of this information out of the etymon section. However, in the case that the dataset does not actually have entries for the source terms, or the encoder wants to be explicit in all aspects of the etymology, as mentioned above, the source domain and the @@ -2489,8 +2489,8 @@ Problematic and Unresolved Issues

For the issues regarded as the most fundamentally important to creating a dynamic and sustainable model for both etymology and general lexicographic markup in TEI, we have - submitted formal requests for changes to the TEI GitHub, and will continue to + submitted formal requests for changes to the TEI GitHub, and will continue to submit change requests as needed. While this work represents a large step in the right direction for those looking for means of representing etymological information, there are still a number of unresolved issues that will need to be addressed. These remaining issues diff --git a/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml b/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml index 225980e7..698f3c52 100644 --- a/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml +++ b/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml @@ -110,11 +110,11 @@ ways in which the variant taxonomy may be linked to the body of the edition.

Although this paper is TEI-centered, other XML technologies will be mentioned. includes a brief commentary on using XSLT + type="software" xml:id="R1" target="#xslt"/>XSLT to transform a TEI-conformant definition of constraints into schema rules. However, the greatest attention to an additional technology is in , which discusses the use of XQuery to retrieve particular + target="#xquery"/>XQuery to retrieve particular loci critici and to deploy quantitative analyses.

@@ -447,12 +447,12 @@ definition, its typed-feature modeling facilitates the creation of schema constraints. For instance, I process my declaration to further constrict my schema so the feature structure declaration and its actual application are always synchronized and up to date.I use - XSLT to process the feature structure declaration in order to create all required Schematron rules that will constrict the feature library accordingly. I am currently working on creating a more generic validator (see my Github repository, Github repository, ).
@@ -545,14 +545,14 @@ 2016, 12.2.3) seems to be a popular encoding technique for multi-witness editions, in terms of both the specific tools that have been created for this method and the number of projects that apply it.Tools include Versioning Machine, CollateX (both - the Java and Java and Python versions), and Juxta. For representative projects using the parallel segmentation method see - + Python https://www.python.org/ Python Programming Language general programming language - + Java https://docs.oracle.com/en/java Java Programming Language @@ -473,7 +473,7 @@ Letras and the Daniel Cosío Villegas Library. research - + GitHub https://github.com @@ -542,7 +542,7 @@ Lucene general - + XSLT XSL Transformations (XSLT) @@ -1138,7 +1138,7 @@ - + Juxta https://github.com/performant-software/juxta-desktop Juxta is an open-source tool for comparing and collating @@ -1550,7 +1550,7 @@ research - + XQuery XML Query Language (XQuery) @@ -1563,7 +1563,7 @@ Corpus Resource Database (CoRD) research - + Versioning Machine http://v-machine.org/ From d44e8d517ef42890eb50e4d583fb7ec0f770f3c4 Mon Sep 17 00:00:00 2001 From: Daniel Jettka Date: Thu, 1 Feb 2024 16:58:07 +0100 Subject: [PATCH 05/33] added annotations and software entries --- ...jtei-cc-ra-hannessschlaeger-164-source.xml | 45 +++++++++++-------- taxonomy/software-list.xml | 12 +++++ 2 files changed, 39 insertions(+), 18 deletions(-) diff --git a/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml b/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml index 309ab6cf..012f3191 100644 --- a/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml +++ b/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml @@ -119,9 +119,11 @@ the gaps between them. Finally, I will illustrate my findings with the help of a concrete example, which is as TEI-specific as it can get: I will describe the history of the TEI Conference 2016 abstracts, which have, since the conference, been transformed into a TEI - data set that has been published not only on GitHub, but also in an - eXistdb-powered web application. This is by any standard a wonderful development for a + data set that has been published not only on GitHub, but also in an + eXistdb-powered web application. This is by any standard a wonderful development for a collection of textual data—and one that would not have been possible had the abstracts not been published under an open license, especially since their authors come from fourteen different countries.

@@ -444,16 +446,19 @@ explain how licensing played a vital role in enabling this transformation.

Submission of the Abstracts and the joys of ConfTool -

As it is every year, the conference management software ConfTool ProConfTool Conference - Management Software, ConfTool GmbH, accessed August 23, 2019, . was used for the submission of the +

As it is every year, the conference management software ConfTool ProConfTool Conference + Management Software, ConfTool GmbH, accessed August 23, 2019, . was used for the submission of the abstracts of the 2016 TEI conference. When the Vienna team received access to the ConfTool system, the instance for the 2016 conference had been equipped with default - settings based on previous TEI conference settings. As ConfTool is not the most + settings based on previous TEI conference settings. As ConfTool is not the most intuitive system to handle for a first-time administrator,The chair of the 2017 TEI conference program committee Kathryn Tomasek has described the rather tricky - structure of the system as the joys of ConfTool (email message to author, April + structure of the system as the joys of ConfTool (email message to author, April 11, 2017). one aspect was overlooked when setting up the system for the 2016 conference: the Copyright Transfer Terms and Licensing Policy that contributors had to agree to when submitting an abstract remained unchanged. It was @@ -527,23 +532,27 @@ Hannesschläger, and Wissik 2016a). Subsequently, the PDF of this printed book was made available via the conference website under the same license (Resch, Hannesschläger, and Wissik 2016b).

-

The page proofs that were transformed into this PDF had been created with Adobe - InDesign. The real fun started when the InDesign file was exported to XML and +

The page proofs that were transformed into this PDF had been created with Adobe + InDesign. The real fun started when the InDesign file was exported to XML and transformed back into single files (one file per abstract). These files were edited with - the Oxygen XML editor to become proper TEI files with extensive headers. Finally, they + the Oxygen XML editor to become proper TEI files with extensive headers. Finally, they were published as a repository together with the TEI schema on GitHub (GitHub (Hannesschläger and Schopper 2017), again under the same license. This allowed Martin Sievers, one of the abstract authors, to immediately correct a typing error in his abstract that the editors had overlooked (see history of Hannesschläger and Schopper - 2017 on GitHub).

+ 2017 on GitHub).

But the story did not end there. The freely available and processable collection of abstracts inspired Peter Andorfer, a colleague of the editors at the Austrian Centre for - Digital Humanities, to use this text collection to built an eXistdb-powered web - application (Andorfer and Hannesschläger - 2017). In the context of licensing issues, it is important to mention that + Digital Humanities, to use this text collection to built an eXistdb-powered web + application (Andorfer and Hannesschläger + 2017). In the context of licensing issues, it is important to mention that Andorfer was never approached by the editors or explicitly asked to process the TEI files, and he only informed the editors about the web application that he was building when it was already available online (as a work in progress, but diff --git a/taxonomy/software-list.xml b/taxonomy/software-list.xml index 4a7c7a92..17bb192d 100644 --- a/taxonomy/software-list.xml +++ b/taxonomy/software-list.xml @@ -1569,6 +1569,18 @@ + + ConfTool Conference Management Software + http://www.conftool.net/ + + + + + Adobe InDesign + + + + From 5e01e472facb63c08c6faf0b91c85e973031954c Mon Sep 17 00:00:00 2001 From: Daniel Jettka Date: Sat, 3 Feb 2024 10:13:38 +0100 Subject: [PATCH 06/33] added and adpated annotations --- data/JTEI/10_2016-19/jtei-10-haaf-source.xml | 2 +- data/JTEI/10_2016-19/jtei-10-romary-source.xml | 4 ++-- .../11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml | 6 +++--- .../jtei-cc-ra-hannessschlaeger-164-source.xml | 9 ++++----- 4 files changed, 10 insertions(+), 11 deletions(-) diff --git a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml index 1f4d360a..0d66a6c9 100644 --- a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml @@ -213,7 +213,7 @@ correction and annotationSee <ptr type="software" xml:id="R3" target="#dtaq"/><rs type="soft.name" ref="R3">DTAQ: Kollaborative Qualitätssicherung im Deutschen Textarchiv</rs> - (Collaborative Quality Assurance within the DTA), accessed January 28, 2017, . On the process of quality assurance in the DTA, see, for example, Haaf, Wiegand, and Geyken 2013.) is a matter of supporting scholarly projects diff --git a/data/JTEI/10_2016-19/jtei-10-romary-source.xml b/data/JTEI/10_2016-19/jtei-10-romary-source.xml index 6a15e913..bfb0c21e 100644 --- a/data/JTEI/10_2016-19/jtei-10-romary-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-romary-source.xml @@ -1487,8 +1487,8 @@ processes. In order to do this, it is necessary to make use of one or more ontologies, which could be locally defined within a project, and of external linked open data sources such as DBpedia and DBpedia and Wikidata, or some combination thereof. Within TEI dictionary markup, URIs for existing ontological entries can be referenced in the sense, usg, and ref elements as the value of the attribute diff --git a/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml b/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml index 698f3c52..8e6a2bdf 100644 --- a/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml +++ b/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml @@ -546,13 +546,13 @@ in terms of both the specific tools that have been created for this method and the number of projects that apply it.Tools include Versioning Machine, Versioning Machine, CollateX (both the Java and Python versions), and Juxta. For representative projects using the parallel segmentation method see GitHub, but also in an eXistdb-powered web application. This is by any standard a wonderful development for a + target="#existdb"/>eXistdb-powered web application. This is by any standard a wonderful development for a collection of textual data—and one that would not have been possible had the abstracts not been published under an open license, especially since their authors come from fourteen different countries.

@@ -447,9 +446,9 @@
Submission of the Abstracts and the joys of ConfTool

As it is every year, the conference management software ConfTool ProConfTool Conference - Management Software, ConfTool GmbH, accessed August 23, 2019, . was used for the submission of the abstracts of the 2016 TEI conference. When the Vienna team received access to the ConfTool system, the instance for the 2016 conference had been equipped with default @@ -551,7 +550,7 @@ abstracts inspired Peter Andorfer, a colleague of the editors at the Austrian Centre for Digital Humanities, to use this text collection to built an eXistdb-powered web - application (Andorfer and Hannesschläger + application (Andorfer and Hannesschläger 2017). In the context of licensing issues, it is important to mention that Andorfer was never approached by the editors or explicitly asked to process the TEI files, and he only informed the editors about the web application that he was building From 0d49c94d8bfce14b6bd99b1a64432b4fe11f1965 Mon Sep 17 00:00:00 2001 From: Daniel Jettka Date: Sat, 3 Feb 2024 12:34:21 +0100 Subject: [PATCH 07/33] corrected XML --- taxonomy/software-list.xml | 1 + 1 file changed, 1 insertion(+) diff --git a/taxonomy/software-list.xml b/taxonomy/software-list.xml index d09b8efc..90a9f40f 100644 --- a/taxonomy/software-list.xml +++ b/taxonomy/software-list.xml @@ -1596,6 +1596,7 @@ + TEI Boilerplate https://github.com/TEI-Boilerplate/TEI-Boilerplate From c4cc2d011ad41891e099744f64fea1d77b1de894 Mon Sep 17 00:00:00 2001 From: GitHub Action Date: Sat, 3 Feb 2024 11:35:03 +0000 Subject: [PATCH 08/33] Add updated odd and generated rng --- schema/tei_software_annotation.rng | 22 +++++++++++++++++++--- schema/tei_software_annotation.xml | 8 ++++++++ 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/schema/tei_software_annotation.rng b/schema/tei_software_annotation.rng index 04874480..9df1cb91 100644 --- a/schema/tei_software_annotation.rng +++ b/schema/tei_software_annotation.rng @@ -5,7 +5,7 @@ xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" ns="http://www.tei-c.org/ns/1.0"> @@ -22,5 +23,17 @@ + + + + + + + + + + + diff --git a/utilities/ids2lowercase.xsl b/utilities/ids2lowercase.xsl new file mode 100644 index 00000000..c09a02f1 --- /dev/null +++ b/utilities/ids2lowercase.xsl @@ -0,0 +1,22 @@ + + + + + + + + + + + + + + + + + + + + From 1d54f3365d9e002a92d84c29c66b062d1e516041 Mon Sep 17 00:00:00 2001 From: Anne Ferger Date: Mon, 5 Feb 2024 11:57:19 +0100 Subject: [PATCH 15/33] fix typo in url and adapted schema to new rs types --- schema/tei_jtei_annotated.odd | 336 +++++++++++--------------- utilities/addSoftwareList2Oddjtei.xsl | 4 +- 2 files changed, 147 insertions(+), 193 deletions(-) diff --git a/schema/tei_jtei_annotated.odd b/schema/tei_jtei_annotated.odd index bfa4557e..f68014df 100644 --- a/schema/tei_jtei_annotated.odd +++ b/schema/tei_jtei_annotated.odd @@ -1,12 +1,12 @@ + xmlns:eg="http://www.tei-c.org/ns/Examples" + xmlns:egXML="http://www.tei-c.org/ns/Examples" + xmlns:sch="http://purl.oclc.org/dsdl/schematron" + xmlns:sqf="http://www.schematron-quickfix.com/validator/process" + xmlns:tei="http://www.tei-c.org/ns/1.0" + xmlns:xsl="http://www.w3.org/1999/XSL/Transform" + xml:lang="en"> @@ -2039,17 +2039,17 @@

+ start="TEI" + defaultExceptions="http://www.tei-c.org/ns/1.0 eg:egXML"> + include="abbr author bibl biblScope cit date desc editor email emph foreign gap graphic head hi item label lb list listBibl mentioned name note num p pubPlace publisher q quote ptr ref resp respStmt rs series soCalled term title"/> + include="appInfo application availability catRef change classCode edition encodingDesc fileDesc idno keywords langUsage language licence listChange profileDesc projectDesc publicationStmt rendition revisionDesc seriesStmt sourceDesc tagsDecl teiHeader textClass titleStmt"/> + include="affiliation forename listPerson person placeName orgName roleName surname"/> @@ -2081,8 +2081,8 @@ Add a @type='main' attribute to the first title. main + node-type="attribute" + target="type">main @@ -2141,9 +2141,9 @@ + module="tei" + type="atts" + mode="delete"/> @@ -2153,17 +2153,17 @@ + module="tagdocs" + type="atts" + mode="change"> + module="transcr" + type="atts" + mode="delete"/> @@ -2451,20 +2451,20 @@ + module="namesdates" + type="atts" + mode="delete"/> + module="namesdates" + type="atts" + mode="delete"/> + module="tagdocs" + type="atts" + mode="delete"/> @@ -2492,23 +2492,23 @@ + module="tei" + type="atts" + mode="change"> + module="transcr" + type="atts" + mode="delete"/> + module="tagdocs" + type="atts" + mode="change"> @@ -2516,9 +2516,9 @@ + module="tei" + type="atts" + mode="delete"/> @@ -2532,18 +2532,18 @@ + module="core" + type="atts" + mode="delete"/> + module="header" + type="atts" + mode="delete"/> @@ -2557,9 +2557,9 @@ + module="tei" + type="atts" + mode="change"> @@ -2568,13 +2568,13 @@ + module="tagdocs" + type="atts" + mode="delete"/> + module="tei" + type="atts" + mode="change"> @@ -2594,9 +2594,9 @@ + module="tei" + type="atts" + mode="change"> @@ -2608,9 +2608,9 @@ + module="tei" + type="atts" + mode="change"> @@ -2629,7 +2629,7 @@ Author information in the <titleStmt> must include <name>, + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#structure"> Author information in the <titleStmt> must include <name>, <affiliation> and <email>. @@ -2649,8 +2649,8 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back" + role="warning"> A bibliographic entry should have a unique value for @xml:id. @@ -2659,9 +2659,9 @@ + role="warning"> This bibliographic entry is an orphan: no ref[@type="bibl"] references to it + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> This bibliographic entry is an orphan: no ref[@type="bibl"] references to it occur in the text. @@ -2669,8 +2669,8 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back" + role="warning"> A bibliographic entry should end with a single period. @@ -2679,7 +2679,7 @@ An analytic title and a journal title in a bibliographic entry should only be + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> An analytic title and a journal title in a bibliographic entry should only be separated by a comma or a period (or the end punctuation of the analytic title). @@ -2721,7 +2721,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#quotations"> is normally expected to have a bibliographic reference (ref[@type="bibl"]). Please make sure you intended not to add one here. @@ -2747,7 +2747,7 @@ A text division of type may only occur inside + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#front"> A text division of type may only occur inside <front>. @@ -2756,14 +2756,14 @@ Only text divisions of type may appear in the <front>. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#front"> Only text divisions of type may appear in the <front>. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> Bibliography ([@type="bibliography"]) and appendices ([@type="appendix"]) may only occur inside <back>. @@ -2775,7 +2775,7 @@ An editorial introduction ([@type="editorialIntroduction"]) may + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#body"> An editorial introduction ([@type="editorialIntroduction"]) may only occur inside <body>. @@ -2784,7 +2784,7 @@ A must contain a <head>. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#divs"> A must contain a <head>. @@ -3047,23 +3047,23 @@ Footnotes should follow punctuation marks, not precede them. Place your + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#footnotes"> Footnotes should follow punctuation marks, not precede them. Place your <> element after the punctuation mark. Footnotes should precede the dash, not follow it. Place your + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#footnotes"> Footnotes should precede the dash, not follow it. Place your <> element before the dash. Footnotes may be placed before closing parentheses, though this is + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#footnotes"> Footnotes may be placed before closing parentheses, though this is exceptional. Please check if this note's placement is correct. Otherwise, move it after the closing parenthesis. A footnote should end a with a single closing punctuation character. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#footnotes"> A footnote should end a with a single closing punctuation character. @@ -3072,7 +3072,7 @@ No block-level elements (<cit>, <table>, <figure>, + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#footnotes"> No block-level elements (<cit>, <table>, <figure>, <egXML>, <eg>, <list> which do not have the value inline for @rend) are allowed inside . @@ -3091,7 +3091,7 @@ Headings are numbered and labeled automatically, please remove the hard-coded + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#body"> Headings are numbered and labeled automatically, please remove the hard-coded label from the text. @@ -3100,7 +3100,7 @@ Figure titles (<head>) must have a type 'legend' or 'license'. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#figures"> Figure titles (<head>) must have a type 'legend' or 'license'. @@ -3175,7 +3175,7 @@ Multiple values in @target are only allowed for [@type='crossref']. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#internal_linking"> Multiple values in @target are only allowed for [@type='crossref']. @@ -3215,7 +3215,7 @@ + test="id(substring-after(@source, '#'))/(self::tei:ref[@type eq 'bibl']|self::tei:bibl[ancestor::tei:body])"> must have a @source that points to the @xml:id of either a ref[type='bibl'], or a <bibl> in the <body>. @@ -3230,7 +3230,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#external_linking"> with multiple values for @target is not supported. @@ -3239,7 +3239,7 @@ Parentheses are not part of bibliographic references. Please move them out of + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#internal_linking"> Parentheses are not part of bibliographic references. Please move them out of . @@ -3248,7 +3248,7 @@ A bibliographic reference must point with a @target to the @xml:id of an entry + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#internal_linking"> A bibliographic reference must point with a @target to the @xml:id of an entry in the div[@type="bibliography"]. @@ -3257,8 +3257,8 @@ A bibliographic reference must be typed as @type="bibl". + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#internal_linking" + sqf:fix="bibltype.add"> A bibliographic reference must be typed as @type="bibl". Add @type='bibl'. @@ -3343,7 +3343,7 @@ An article must have a keyword list in the header. This should be a list of + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#header"> An article must have a keyword list in the header. This should be a list of <term> elements in TEI/teiHeader/profileDesc/textClass/keywords @@ -3353,7 +3353,7 @@ An article must have a front section with an abstract (div[@type='abstract']). + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#front"> An article must have a front section with an abstract (div[@type='abstract']). @@ -3362,7 +3362,7 @@ An article must have a back section with a bibliography + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> An article must have a back section with a bibliography (div[@type='bibliography']). @@ -3475,7 +3475,7 @@ If contains a div, and that div is not an editorial introduction, + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#body"> If contains a div, and that div is not an editorial introduction, then there should be more than one div. Rather than using only a single div, you may place the content directly in the element. @@ -3487,7 +3487,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> must have a bibliography (div[@type="bibliography"]), which must be organized inside a <listBibl> element. @@ -3498,60 +3498,14 @@ - - - - Bibliographieeintrag für Software - (Bib.Soft) - Die Bibliographie enthält einen - Eintrag für die Software selbst. Dieser Eintrag kann den Namen der Software - selbst enthalten, Namen von Verantwortlichen, eine URL, einen PID, - Versionsangaben, usw. - - - Bibliographieeintrag für - Referenzpublikation (Bib.Ref) - Die Bibliographie enthält einen - Eintrag mit einer Publikation über die Software. - - - Nur namentliche Nennung der - Software (Name.Only) - Die Software ist nur namentlich - genannt. - - - Namentliche Nennung der - Verantwortlichen (Agent) - Personen, Gruppen oder - Institutionen, die für die Entwicklung der Software verantwortlich sind, - werden namentlich genannt. - - - URL - Die Zitation enthält eine URL, die - auf die Software selbst verweist (z.B. zu einer Webseite über die Softwaree, - einem Code-Repositorium, einem Metadatensatz oder einer ausführbaren Version). - URLs zu Publikationen über die Software (z.B. zu Zeitschriftenartikeln, - Büchern oder auch User Manuals) werden nicht gezählt. - - - Persistenter Identifikator - (PID) - Die Zitation enthält einen - persistenten Identifikator (PID), z.B. eine DOI, der auf die Software selbst - verweist (z.B. zu einer Webseite über die Software, einem Code-Repositorium, - einem Metadatensatz oder einer ausführbaren Version). PIDs zu Publikationen - über die Software (z.B. zu Zeitschriftenartikeln, Büchern oder auch User - Manuals) werden nicht gezählt. - - - Version (Ver) - Die Zitation enthält die Angabe - einer bestimmten Softwareversion oder -revision und ggf. anderweitig - notwendige Spezifikationen (z.B. eine Version für ein spezifisches - Betriebssystem, ein bestimmtes Softwarepaket oder ein Datum). - + + + + + + + + @@ -3574,20 +3528,20 @@ + match="@target[starts-with(normalize-space(.), '#')]|@rendition[starts-with(normalize-space(.), '#')]" + use="for $i in tokenize(., '\s+') return substring-after($i, '#')"/> + value="('abstract', 'acknowledgements', 'authorNotes', 'editorNotes', 'corrections', 'dedication')"/> + value="'https://jenkins.tei-c.org/job/TEIP5/lastStableBuild/artifact/P5/release/doc/tei-p5-doc/VERSION'"/> + value="if (unparsed-text-available($tei.version.url)) then normalize-space(unparsed-text($tei.version.url)) else ()"/> @@ -3596,7 +3550,7 @@ Tag delimiters such as angle brackets and tag-closing slashes are not allowed + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#inline_technical"> Tag delimiters such as angle brackets and tag-closing slashes are not allowed for : they are completed at processing time via XSLT. @@ -3607,7 +3561,7 @@ Attribute value delimiters are not allowed for : they are completed + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#inline_technical"> Attribute value delimiters are not allowed for : they are completed at processing time via XSLT. @@ -3618,7 +3572,7 @@ Please remove square brackets from : they are completed at + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#inline_rhetorical"> Please remove square brackets from : they are completed at processing time via XSLT. @@ -3629,7 +3583,7 @@ If a bibliographic entry has a formal DOI code, it should be placed at the + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> If a bibliographic entry has a formal DOI code, it should be placed at the very end of the bibliographic description. @@ -3646,7 +3600,7 @@ Width and height in pixels must be specified for any . + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#figures"> Width and height in pixels must be specified for any . @@ -3655,7 +3609,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#figures"> may only occur inside <figure>. @@ -3666,7 +3620,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#front"> must have an abstract (div[@type='abstract']). @@ -3677,7 +3631,7 @@ No tables are allowed inside lists. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#lists"> No tables are allowed inside lists. @@ -3687,7 +3641,7 @@ A element should follow a period rather than precede it when an + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#inline_rhetorical"> A element should follow a period rather than precede it when an ellipsis follows the end of a sentence. @@ -3696,7 +3650,7 @@ A should follow a period directly, without preceding whitespace. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#inline_rhetorical"> A should follow a period directly, without preceding whitespace. @@ -3705,10 +3659,10 @@ + role="warning"> + sqf:fix="apostrophe.replace" + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#faq"> "Straight apostrophe" characters are not permitted. Please use the Right Single Quotation Mark (U+2019 or ’) character instead. On the other hand, if the straight apostrophe characters function as quotation marks, please replace them with @@ -3726,9 +3680,9 @@ + role="warning"> + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#faq"> Left and Right Single Quotation Marks should be used in the right place. Please check their placement in this text node. @@ -3738,7 +3692,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#faq"> Quotation marks are not permitted in plain text. Please use appropriate mark-up that will ensure the appropriate quotation marks will be generated consistently. @@ -3763,7 +3717,7 @@ Numeric ranges should not be indicated with a hyphen. + sqf:fix="hyphen.replace"> Numeric ranges should not be indicated with a hyphen. Please use the EN Dash (U+2013 or –) character instead. @@ -3777,7 +3731,7 @@ + role="warning"> You should put a comma after "i.e." and "e.g.". @@ -3792,7 +3746,7 @@ + context="text()[not(ancestor::tei:eg|ancestor::eg:egXML|ancestor::tei:code|ancestor::tei:tag)]"> This text contains a non-breaking space character. Please consider changing this to a normal space character. @@ -3810,7 +3764,7 @@ + value="for $p in tokenize(., '\s+')[starts-with(., '#')] return if (not(id(substring-after($p, '#')))) then $p else ()"/> There's no local target for : . Please make sure you're referring to an existing @xml:id value. @@ -3821,7 +3775,7 @@ + value="for $p in tokenize(., '\s+')[starts-with(., '#')] return for $id in id(substring-after($p, '#'))[not(self::tei:rendition)] return $p"/> point to a <rendition> target: . @@ -3831,14 +3785,14 @@ Quotation mark delimiters are not allowed for + sqf:fix="quotation.remove"> Quotation mark delimiters are not allowed for : they are completed at processing time via XSLT. Remove quotation marks. + select="replace(., concat('^', $double.quotes, '|', $double.quotes, '$'), '')"/> @@ -3846,7 +3800,7 @@ + role="warning"> You're strongly advised to add an @xml:id attribute to to ease formal cross-referencing with (ptr|ref)[@type='crossref'] @@ -3856,7 +3810,7 @@ + role="warning"> Please replace literal references to tables, figures, examples, and sections with a formal crosslink: (ptr|ref)[@type="crossref"] @@ -3866,7 +3820,7 @@ + value="for $p in tokenize(@target, '\s+')[starts-with(., '#')] return for $id in id(substring-after($p, '#'))[not(self::tei:div or self::tei:figure or self::tei:table or self::tei:note)] return $p"/> Cross-links ([@type="crossref"]) should be targeted at <div>, <figure>, <table>, or <note> elements. The target of doesn't satisfy this condition: . @@ -3877,16 +3831,16 @@ Please type internal cross-references as 'crossref' + sqf:fix="crossreftype.add"> Please type internal cross-references as 'crossref' ([@type="crossref"]). Add @type='crossref'. + node-type="attribute" + target="type" + select="'crossref'"/> @@ -3914,8 +3868,8 @@ ). + target="target" + select="replace(., '^https?://(www\.)?tei-c\.org/release/', concat('https://www.tei-c.org/Vault/P5/', $tei.version, '/'))"/> @@ -3923,9 +3877,9 @@ + role="warning"> + value="replace(., '^https?://(www\.)?jtei\.revues\.org/?', 'https://journals.openedition.org/jtei/')"/> Please refer to the correct jTEI URL: . diff --git a/utilities/addSoftwareList2Oddjtei.xsl b/utilities/addSoftwareList2Oddjtei.xsl index 43c8032e..4ab480d4 100644 --- a/utilities/addSoftwareList2Oddjtei.xsl +++ b/utilities/addSoftwareList2Oddjtei.xsl @@ -25,10 +25,10 @@ - + + match="//tei:TEI/tei:text/tei:back/tei:div/tei:schemaSpec/tei:elementSpec[@ident='rs']/tei:attList/tei:attDef[@ident='type']/tei:valList[1]"> From 6587590ea36b9efa879ee3968e1dc0faddb839a7 Mon Sep 17 00:00:00 2001 From: GitHub Action Date: Mon, 5 Feb 2024 10:58:00 +0000 Subject: [PATCH 16/33] Add updated odd and generated rng --- schema/tei_jtei_annotated.odd | 274 ++++++++++++++--------------- schema/tei_jtei_annotated.rng | 23 +-- schema/tei_software_annotation.rng | 6 +- 3 files changed, 144 insertions(+), 159 deletions(-) diff --git a/schema/tei_jtei_annotated.odd b/schema/tei_jtei_annotated.odd index f68014df..7e49308a 100644 --- a/schema/tei_jtei_annotated.odd +++ b/schema/tei_jtei_annotated.odd @@ -1,12 +1,12 @@ + xmlns:eg="http://www.tei-c.org/ns/Examples" + xmlns:egXML="http://www.tei-c.org/ns/Examples" + xmlns:sch="http://purl.oclc.org/dsdl/schematron" + xmlns:sqf="http://www.schematron-quickfix.com/validator/process" + xmlns:tei="http://www.tei-c.org/ns/1.0" + xmlns:xsl="http://www.w3.org/1999/XSL/Transform" + xml:lang="en"> @@ -2039,17 +2039,17 @@
+ start="TEI" + defaultExceptions="http://www.tei-c.org/ns/1.0 eg:egXML"> + include="abbr author bibl biblScope cit date desc editor email emph foreign gap graphic head hi item label lb list listBibl mentioned name note num p pubPlace publisher q quote ptr ref resp respStmt rs series soCalled term title"/> + include="appInfo application availability catRef change classCode edition encodingDesc fileDesc idno keywords langUsage language licence listChange profileDesc projectDesc publicationStmt rendition revisionDesc seriesStmt sourceDesc tagsDecl teiHeader textClass titleStmt"/> + include="affiliation forename listPerson person placeName orgName roleName surname"/> @@ -2081,8 +2081,8 @@ Add a @type='main' attribute to the first title. main + node-type="attribute" + target="type">main @@ -2141,9 +2141,9 @@ + module="tei" + type="atts" + mode="delete"/> @@ -2153,17 +2153,17 @@ + module="tagdocs" + type="atts" + mode="change"> + module="transcr" + type="atts" + mode="delete"/> @@ -2451,20 +2451,20 @@ + module="namesdates" + type="atts" + mode="delete"/> + module="namesdates" + type="atts" + mode="delete"/> + module="tagdocs" + type="atts" + mode="delete"/> @@ -2492,23 +2492,23 @@ + module="tei" + type="atts" + mode="change"> + module="transcr" + type="atts" + mode="delete"/> + module="tagdocs" + type="atts" + mode="change"> @@ -2516,9 +2516,9 @@ + module="tei" + type="atts" + mode="delete"/> @@ -2532,18 +2532,18 @@ + module="core" + type="atts" + mode="delete"/> + module="header" + type="atts" + mode="delete"/> @@ -2557,9 +2557,9 @@ + module="tei" + type="atts" + mode="change"> @@ -2568,13 +2568,13 @@ + module="tagdocs" + type="atts" + mode="delete"/> + module="tei" + type="atts" + mode="change"> @@ -2594,9 +2594,9 @@ + module="tei" + type="atts" + mode="change"> @@ -2608,9 +2608,9 @@ + module="tei" + type="atts" + mode="change"> @@ -2629,7 +2629,7 @@ Author information in the <titleStmt> must include <name>, + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#structure"> Author information in the <titleStmt> must include <name>, <affiliation> and <email>. @@ -2649,8 +2649,8 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back" + role="warning"> A bibliographic entry should have a unique value for @xml:id. @@ -2659,9 +2659,9 @@ + role="warning"> This bibliographic entry is an orphan: no ref[@type="bibl"] references to it + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> This bibliographic entry is an orphan: no ref[@type="bibl"] references to it occur in the text. @@ -2669,8 +2669,8 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back" + role="warning"> A bibliographic entry should end with a single period. @@ -2679,7 +2679,7 @@ An analytic title and a journal title in a bibliographic entry should only be + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> An analytic title and a journal title in a bibliographic entry should only be separated by a comma or a period (or the end punctuation of the analytic title). @@ -2721,7 +2721,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#quotations"> is normally expected to have a bibliographic reference (ref[@type="bibl"]). Please make sure you intended not to add one here. @@ -2747,7 +2747,7 @@ A text division of type may only occur inside + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#front"> A text division of type may only occur inside <front>. @@ -2756,14 +2756,14 @@ Only text divisions of type may appear in the <front>. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#front"> Only text divisions of type may appear in the <front>. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> Bibliography ([@type="bibliography"]) and appendices ([@type="appendix"]) may only occur inside <back>. @@ -2775,7 +2775,7 @@ An editorial introduction ([@type="editorialIntroduction"]) may + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#body"> An editorial introduction ([@type="editorialIntroduction"]) may only occur inside <body>. @@ -2784,7 +2784,7 @@ A must contain a <head>. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#divs"> A must contain a <head>. @@ -3047,23 +3047,23 @@ Footnotes should follow punctuation marks, not precede them. Place your + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#footnotes"> Footnotes should follow punctuation marks, not precede them. Place your <> element after the punctuation mark. Footnotes should precede the dash, not follow it. Place your + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#footnotes"> Footnotes should precede the dash, not follow it. Place your <> element before the dash. Footnotes may be placed before closing parentheses, though this is + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#footnotes"> Footnotes may be placed before closing parentheses, though this is exceptional. Please check if this note's placement is correct. Otherwise, move it after the closing parenthesis. A footnote should end a with a single closing punctuation character. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#footnotes"> A footnote should end a with a single closing punctuation character. @@ -3072,7 +3072,7 @@ No block-level elements (<cit>, <table>, <figure>, + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#footnotes"> No block-level elements (<cit>, <table>, <figure>, <egXML>, <eg>, <list> which do not have the value inline for @rend) are allowed inside . @@ -3091,7 +3091,7 @@ Headings are numbered and labeled automatically, please remove the hard-coded + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#body"> Headings are numbered and labeled automatically, please remove the hard-coded label from the text. @@ -3100,7 +3100,7 @@ Figure titles (<head>) must have a type 'legend' or 'license'. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#figures"> Figure titles (<head>) must have a type 'legend' or 'license'. @@ -3175,7 +3175,7 @@ Multiple values in @target are only allowed for [@type='crossref']. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#internal_linking"> Multiple values in @target are only allowed for [@type='crossref']. @@ -3215,7 +3215,7 @@ + test="id(substring-after(@source, '#'))/(self::tei:ref[@type eq 'bibl']|self::tei:bibl[ancestor::tei:body])"> must have a @source that points to the @xml:id of either a ref[type='bibl'], or a <bibl> in the <body>. @@ -3230,7 +3230,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#external_linking"> with multiple values for @target is not supported. @@ -3239,7 +3239,7 @@ Parentheses are not part of bibliographic references. Please move them out of + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#internal_linking"> Parentheses are not part of bibliographic references. Please move them out of . @@ -3248,7 +3248,7 @@ A bibliographic reference must point with a @target to the @xml:id of an entry + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#internal_linking"> A bibliographic reference must point with a @target to the @xml:id of an entry in the div[@type="bibliography"]. @@ -3257,8 +3257,8 @@ A bibliographic reference must be typed as @type="bibl". + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#internal_linking" + sqf:fix="bibltype.add"> A bibliographic reference must be typed as @type="bibl". Add @type='bibl'. @@ -3343,7 +3343,7 @@ An article must have a keyword list in the header. This should be a list of + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#header"> An article must have a keyword list in the header. This should be a list of <term> elements in TEI/teiHeader/profileDesc/textClass/keywords @@ -3353,7 +3353,7 @@ An article must have a front section with an abstract (div[@type='abstract']). + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#front"> An article must have a front section with an abstract (div[@type='abstract']). @@ -3362,7 +3362,7 @@ An article must have a back section with a bibliography + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> An article must have a back section with a bibliography (div[@type='bibliography']). @@ -3475,7 +3475,7 @@ If contains a div, and that div is not an editorial introduction, + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#body"> If contains a div, and that div is not an editorial introduction, then there should be more than one div. Rather than using only a single div, you may place the content directly in the element. @@ -3487,7 +3487,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> must have a bibliography (div[@type="bibliography"]), which must be organized inside a <listBibl> element. @@ -3528,20 +3528,20 @@ + match="@target[starts-with(normalize-space(.), '#')]|@rendition[starts-with(normalize-space(.), '#')]" + use="for $i in tokenize(., '\s+') return substring-after($i, '#')"/> + value="('abstract', 'acknowledgements', 'authorNotes', 'editorNotes', 'corrections', 'dedication')"/> + value="'https://jenkins.tei-c.org/job/TEIP5/lastStableBuild/artifact/P5/release/doc/tei-p5-doc/VERSION'"/> + value="if (unparsed-text-available($tei.version.url)) then normalize-space(unparsed-text($tei.version.url)) else ()"/> @@ -3550,7 +3550,7 @@ Tag delimiters such as angle brackets and tag-closing slashes are not allowed + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#inline_technical"> Tag delimiters such as angle brackets and tag-closing slashes are not allowed for : they are completed at processing time via XSLT. @@ -3561,7 +3561,7 @@ Attribute value delimiters are not allowed for : they are completed + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#inline_technical"> Attribute value delimiters are not allowed for : they are completed at processing time via XSLT. @@ -3572,7 +3572,7 @@ Please remove square brackets from : they are completed at + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#inline_rhetorical"> Please remove square brackets from : they are completed at processing time via XSLT. @@ -3583,7 +3583,7 @@ If a bibliographic entry has a formal DOI code, it should be placed at the + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#back"> If a bibliographic entry has a formal DOI code, it should be placed at the very end of the bibliographic description. @@ -3600,7 +3600,7 @@ Width and height in pixels must be specified for any . + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#figures"> Width and height in pixels must be specified for any . @@ -3609,7 +3609,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#figures"> may only occur inside <figure>. @@ -3620,7 +3620,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#front"> must have an abstract (div[@type='abstract']). @@ -3631,7 +3631,7 @@ No tables are allowed inside lists. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#lists"> No tables are allowed inside lists. @@ -3641,7 +3641,7 @@ A element should follow a period rather than precede it when an + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#inline_rhetorical"> A element should follow a period rather than precede it when an ellipsis follows the end of a sentence. @@ -3650,7 +3650,7 @@ A should follow a period directly, without preceding whitespace. + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#inline_rhetorical"> A should follow a period directly, without preceding whitespace. @@ -3659,10 +3659,10 @@ + role="warning"> + sqf:fix="apostrophe.replace" + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#faq"> "Straight apostrophe" characters are not permitted. Please use the Right Single Quotation Mark (U+2019 or ’) character instead. On the other hand, if the straight apostrophe characters function as quotation marks, please replace them with @@ -3680,9 +3680,9 @@ + role="warning"> + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#faq"> Left and Right Single Quotation Marks should be used in the right place. Please check their placement in this text node. @@ -3692,7 +3692,7 @@ + see="https://tei-c.org/release/doc/tei-p5-exemplars/html/tei_jtei.doc.html#faq"> Quotation marks are not permitted in plain text. Please use appropriate mark-up that will ensure the appropriate quotation marks will be generated consistently. @@ -3717,7 +3717,7 @@ Numeric ranges should not be indicated with a hyphen. + sqf:fix="hyphen.replace"> Numeric ranges should not be indicated with a hyphen. Please use the EN Dash (U+2013 or –) character instead. @@ -3731,7 +3731,7 @@ + role="warning"> You should put a comma after "i.e." and "e.g.". @@ -3746,7 +3746,7 @@ + context="text()[not(ancestor::tei:eg|ancestor::eg:egXML|ancestor::tei:code|ancestor::tei:tag)]"> This text contains a non-breaking space character. Please consider changing this to a normal space character. @@ -3764,7 +3764,7 @@ + value="for $p in tokenize(., '\s+')[starts-with(., '#')] return if (not(id(substring-after($p, '#')))) then $p else ()"/> There's no local target for : . Please make sure you're referring to an existing @xml:id value. @@ -3775,7 +3775,7 @@ + value="for $p in tokenize(., '\s+')[starts-with(., '#')] return for $id in id(substring-after($p, '#'))[not(self::tei:rendition)] return $p"/> point to a <rendition> target: . @@ -3785,14 +3785,14 @@ Quotation mark delimiters are not allowed for + sqf:fix="quotation.remove"> Quotation mark delimiters are not allowed for : they are completed at processing time via XSLT. Remove quotation marks. + select="replace(., concat('^', $double.quotes, '|', $double.quotes, '$'), '')"/> @@ -3800,7 +3800,7 @@ + role="warning"> You're strongly advised to add an @xml:id attribute to to ease formal cross-referencing with (ptr|ref)[@type='crossref'] @@ -3810,7 +3810,7 @@ + role="warning"> Please replace literal references to tables, figures, examples, and sections with a formal crosslink: (ptr|ref)[@type="crossref"] @@ -3820,7 +3820,7 @@ + value="for $p in tokenize(@target, '\s+')[starts-with(., '#')] return for $id in id(substring-after($p, '#'))[not(self::tei:div or self::tei:figure or self::tei:table or self::tei:note)] return $p"/> Cross-links ([@type="crossref"]) should be targeted at <div>, <figure>, <table>, or <note> elements. The target of doesn't satisfy this condition: . @@ -3831,16 +3831,16 @@ Please type internal cross-references as 'crossref' + sqf:fix="crossreftype.add"> Please type internal cross-references as 'crossref' ([@type="crossref"]). Add @type='crossref'. + node-type="attribute" + target="type" + select="'crossref'"/> @@ -3868,8 +3868,8 @@ ). + target="target" + select="replace(., '^https?://(www\.)?tei-c\.org/release/', concat('https://www.tei-c.org/Vault/P5/', $tei.version, '/'))"/> @@ -3877,9 +3877,9 @@ + role="warning"> + value="replace(., '^https?://(www\.)?jtei\.revues\.org/?', 'https://journals.openedition.org/jtei/')"/> Please refer to the correct jTEI URL: . diff --git a/schema/tei_jtei_annotated.rng b/schema/tei_jtei_annotated.rng index abd8c137..e807ff99 100644 --- a/schema/tei_jtei_annotated.rng +++ b/schema/tei_jtei_annotated.rng @@ -5,7 +5,7 @@ xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" ns="http://www.tei-c.org/ns/1.0"> - + diff --git a/schema/tei_jtei_annotated.rng b/schema/tei_jtei_annotated.rng index 0db65ae0..68f6d920 100644 --- a/schema/tei_jtei_annotated.rng +++ b/schema/tei_jtei_annotated.rng @@ -5,7 +5,7 @@ xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" ns="http://www.tei-c.org/ns/1.0"> - + diff --git a/schema/tei_jtei_annotated.rng b/schema/tei_jtei_annotated.rng index 68f6d920..c050aed3 100644 --- a/schema/tei_jtei_annotated.rng +++ b/schema/tei_jtei_annotated.rng @@ -5,7 +5,7 @@ xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" ns="http://www.tei-c.org/ns/1.0"> - + From 03cf79dd40bd61e086d42cbb5ad8eb20c5365776 Mon Sep 17 00:00:00 2001 From: GitHub Action Date: Mon, 5 Feb 2024 11:45:13 +0000 Subject: [PATCH 23/33] Add updated odd and generated rng --- schema/tei_jtei_annotated.odd | 2 +- schema/tei_jtei_annotated.rng | 21 +++++++++++++++++---- schema/tei_software_annotation.rng | 6 +++--- 3 files changed, 21 insertions(+), 8 deletions(-) diff --git a/schema/tei_jtei_annotated.odd b/schema/tei_jtei_annotated.odd index 3a34bbf7..579d460e 100644 --- a/schema/tei_jtei_annotated.odd +++ b/schema/tei_jtei_annotated.odd @@ -3498,7 +3498,7 @@ - + diff --git a/schema/tei_jtei_annotated.rng b/schema/tei_jtei_annotated.rng index c050aed3..55233a49 100644 --- a/schema/tei_jtei_annotated.rng +++ b/schema/tei_jtei_annotated.rng @@ -5,7 +5,7 @@ xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" ns="http://www.tei-c.org/ns/1.0">
@@ -673,7 +743,8 @@

The second example is structured using time references. This example (see and ) corresponds to the Praat example above (see ) corresponds to the + Praat example above (see , ). In this case, each part of the transcription is represented according to the timeline, but there is also a hierarchy which is represented by the spanGrp and span tags. Each @@ -686,8 +757,8 @@ s37).

- ELAN example of a temporal division + ELAN example of a temporal division
@@ -715,17 +786,18 @@

The spanGrp and span offer a generic representation of data coming from relatively unconstrained representations produced by partition software. The - names of the tiers used in the ELAN and Praat tools are given in the content of + names of the tiers used in the ELAN and + Praat tools are given in the content of the type attribute. These names are not used to provide structural information, the structure being represented only by the spanGrp and span hierarchy. However, the organization into spanGrp and span is not always sufficient to represent all the details of the tier organization of each software feature. This is the case for some of the ELAN structures, which can specify the nature of span elements further than in the TEI feature. For example, the timediv - ELAN property specifies that only contiguous temporal division is allowed, whereas the incl property allows non-contiguous elements. It was therefore necessary to include the type of organization in the header of the TEI file, @@ -737,12 +809,16 @@

Exporting to Research Tools -

In the TEICORPO approach, no modification is made to the original format and conversion +

In the + TEICORPO approach, no modification is made to the original format and conversion remains as lossless as possible. This allows for all types of corpora to be stored for long-term preservation purposes. It also allows the corpora to be used with other - editing tools, some of which are suited to specific processing: for example, Praat for - phonetics/phonology; Transcriber/CLAN for raw transcription; and ELAN for gesture + editing tools, some of which are suited to specific processing: for example, + Praat for + phonetics/phonology; + Transcriber/ + CLAN for raw transcription; and ELAN for gesture and visual coding.

However, a large proportion of scientific research and applications done using corpora requires further processing of the data. For example, although querying or using raw @@ -754,7 +830,8 @@ structure. This microstructure is integrated in Schmidt’s approach, in which the TEI file can contain standardized information about words, specific spoken language information, and sometimes even POS information.

-

This approach was not adopted in TEICORPO for several reasons. First, we had to deal +

This approach was not adopted in + TEICORPO for several reasons. First, we had to deal with a large variety of coding approaches, which makes it difficult to conduct work similar to that done in CHILDES (MacWhinney 2000; see ). Second, there was no @@ -769,7 +846,8 @@ span elements without modifying the original u element information. Second, we decided to design another category of tools for processing or making it possible to process the spoken language corpus, and to use powerful tools in corpus - analysis. This part of the TEICORPO library is described in the next section.

+ analysis. This part of the + TEICORPO library is described in the next section.

@@ -792,16 +870,22 @@
Basic Import and Export Functions

The command-line interface (see ) can - perform conversions between TEI and the formats used by the following programs: CLAN, - ELAN, Praat, and Transcriber. The conversions can be performed on single files + perform conversions between TEI and the formats used by the following programs: + + CLAN, + ELAN, + Praat, and + Transcriber. The conversions can be performed on single files or on whole directories or on a file tree. The command-line interface is suited to - automatic processing in offline environments. The online interface (see ) can convert one or several files + automatic processing in offline environments. The online interface (see + ) can convert one or several files selected by the user, but not whole directories. Results appear in the user’s download folder.

In addition to the conversion to and from the alignment software, the online version of - TEICORPO offers import and export in common spreadsheet formats (.xlsx and .csv) and + + TEICORPO offers import and export in common spreadsheet formats (.xlsx and .csv) and word processing formats (.docx and .txt). Importing data is useful to create new data, and exporting is used to make reports or examples for a publication and for end users not familiar with transcription tasks or computer software (see

Other features are available in both types of interface (command line and web service). - TEICORPO allows the user to exclude some tiers, for example adult tiers in acquisition + + TEICORPO allows the user to exclude some tiers, for example adult tiers in acquisition research where the user wants to study child production only, or comment tiers which are not necessary for some studies.

Export to Specialized Software -

Another kind of export concerns textometric software. TEICONVERT makes spoken language - data available for TXM (Heiden 2010; see ), Le Trameur (Fleury and Zimina 2014; see ), and Iramuteq (see and de Souza et - al. 2018), providing a dedicated TEI export for these tools. For example, for - the Another kind of export concerns textometric software. + TEICONVERT makes spoken language + data available for TXM (Heiden 2010; see ), + Le Trameur (Fleury and Zimina 2014; see ), and + Iramuteq (see and de Souza et + al. 2018), providing a dedicated TEI export for these tools. For example, for + the TXM software, the export includes a text element made of utterance elements including age and speaker attributes. - presents an example for the TXM software.

+ presents an example for the TXM software.

@@ -902,8 +992,8 @@ - Example of XML for the TXM software + Example of XML for the TXM software

An export has been developed for Lexico and Le Trameur textometric software with a @@ -917,7 +1007,8 @@ Example of export for the Lexico or Le Trameur software

-

Likewise, another export is available for the textometric tool Iramuteq without +

Likewise, another export is available for the textometric tool Iramuteq without timelines (see ).

**** -*MOT you have to rest now ? -*CHI yes . -*MOT from your big singing @@ -925,7 +1016,8 @@ sure was some party . Example of export for the IRAMUTEQ software
-

In all these cases, TEICORPO is able to provide an export file and to remove +

In all these cases, + TEICORPO is able to provide an export file and to remove unnecessary information from the TEI pivot format. This is useful, for example, with textometric software, which works only with orthographic tiers without a timeline or dependent information.

@@ -938,25 +1030,36 @@ linguistic research. A present difficulty with these grammatical analyzers is that most often they run only on raw orthographic material, excluding other information. Moreover, their results are not always in a format that can be used with traditional spoken - language software such as CLAN, ELAN, Praat, Transcriber, nor of course in TEI + language software such as + CLAN + , ELAN, + Praat, + Transcriber, nor of course in TEI format.

-

TEICORPO provides a way to solve this problem by running analyzers and putting the +

+ TEICORPO provides a way to solve this problem by running analyzers and putting the results from the analysis back into TEI format. Once the TEI format has been enriched with grammatical information, it is possible to use the results and convert them back to - ELAN or Praat and use the grammatical information in these spoken language - software packages. It is also possible to export to TXM and to use the grammatical + ELAN or + Praat and use the grammatical information in these spoken language + software packages. It is also possible to export to TXM and to use the grammatical information in the textometric software. Two grammatical analyzers have been implemented - in TEICORPO: TreeTagger and CoreNLP.

+ in + TEICORPO: + TreeTagger and + CoreNLP.

- TreeTagger -

TreeTagger -

Accessed March 11, 2021, .

- (Schmid 1994; 1995) is a tool for annotating text with part-of-speech + + TreeTagger +

+ TreeTagger +

Accessed March 11, 2021, .

+ (Schmid 1994; 1995) is a tool for annotating text with part-of-speech and lemma information. The software is freely available for research, education, and evaluation. It is available in twenty-five languages, provides high-quality results, and can be easily improved by enriching the training set, as was done for instance by @@ -964,9 +1067,14 @@ the PERCEO project. They defined a syntactic model suitable for spoken language corpora, using the training feature of TreeTagger and an iterative process including manual corrections to improve the results of the automatic tool.

-

The command-line version of TEICORPO should be used to generate an annotated file - with lemma and POS information based on Treetagger. TreeTagger should be installed - separately. The implementation of TreeTagger in TEICORPO includes the ability to use +

The command-line version of + TEICORPO should be used to generate an annotated file + with lemma and POS information based on + TreeTagger. + TreeTagger should be installed + separately. The implementation of + TreeTagger in + TEICORPO includes the ability to use any syntactic model. For French data, we used the PERCEO model (Benzitoun, Fort, and Sagot 2012).

The command line to be used is: java -cp TEICORPO.jar @@ -982,13 +1090,15 @@

-model filename

-

filename is the full name of the TreeTagger syntactic model. In +

filename is the full name of the + TreeTagger syntactic model. In our case, we use the PERCEO model.

-program filename

-

filename is the full location of the TreeTagger program, according +

filename is the full location of the + TreeTagger program, according to the system used (Windows, MacOS, or Linux).

@@ -999,7 +1109,8 @@

The environment variable TREE_TAGGER can be used to locate the model and the program. - If no -program option is used, the default name for the TreeTagger + If no -program option is used, the default name for the + TreeTagger program is used.

The -model parameter is mandatory.

The resulting filename ends with .tei_corpo_ttg.tei_corpo.xml or a @@ -1115,20 +1226,26 @@

- Stanford CoreNLP -

The Stanford Core Natural Language Processing -

Accessed March 11, 2021, .

- (CoreNLP) package is a suite of tools (Manning et al. 2014) that can be used under a GNU General Public License. The + + Stanford CoreNLP +

+ The Stanford Core Natural Language Processing +

Accessed March 11, 2021, .

+ ( + CoreNLP) package is a suite of tools (Manning et al. 2014) that can be used under a GNU General Public License. The suite provides several tools such as a tokenizer, a POS tagger, a parser, a named entity recognizer, temporal tagging, and coreference resolution. All the tools are available for English, but only some of them are available for all languages. All - software libraries are integrated into Java JAR files, so all that is + software libraries are integrated into Java JAR files, so all that is required is to download JAR files from the CoreNLP website

Accessed May 5, 2021, .

-
to use them with TEICORPO. Using the analyzer is similar to using TreeTagger. + to use them with + TEICORPO. Using the analyzer is similar to using + TreeTagger + . The -model and -syntaxformat parameters can be used in a similar way to specify the grammatical model to be used and the output format. A command line example is:

java -cp "teicorpo.jar:directory_for_SNLP/*" fr.ortolang.teicorpo.TeiSNLP @@ -1136,7 +1253,7 @@

The directory_for_SNLP is the name of the location on a computer where all the CoreNLP JAR files can be found. Note that using the CoreNLP software makes heavy demands on the computer’s memory resources and it is necessary to instruct the - Java software to use a large amount of memory (for example to insert parameter -mx5g before parameter -cp to indicate that 5 GB of memory will be used for a full English analysis).

@@ -1152,18 +1269,20 @@
Exporting the Grammatical Analysis

The results from the grammatical analysis can be used in transcription files such as - those used by Praat and ELAN. A partition-like visual presentation of data + those used by + Praat and ELAN. A partition-like visual presentation of data is very handy to represent a part of speech or a CONLL result. The orthographic line will appear at the top with divisions into words, divisions into parts of speech, and other syntactic information below. As the result of the analysis can contain a large number of tiers (each speaker will have as many tiers as there are elements in the grammatical analysis: for example, word, POS, and lemma for TreeTagger; ten tiers for CoreNLP full analysis), it is helpful to limit the number of visible tiers, either using - the -a option of TEICORPO, or limiting the display with the annotation + the -a option of + TEICORPO, or limiting the display with the annotation tool.

-

An example is presented below in the ELAN tool (see An example is presented below in the ELAN tool (see ). The original utterance was si c’est comme ça je m’en vais (if that’s how it is, I’m leaving). It is displayed in the first line, highlighted in pink. The analysis into words (second line, consisting of numbers), @@ -1173,20 +1292,23 @@ (is).

- Example of TreeTagger analysis representation in a partition + Example of + TreeTagger + analysis representation in a partition software program

Export can be done from TEI into a format used by textometric software (see ). This is the case for TXM, -

See the Textométrie website, last updated June 29, 2020, .

+ type="software" xml:id="R160" target="#txm"/>TXM, +

See the Textométrie website, last updated June 29, 2020, .

a textometric software application. In this case, instead of using a partition representation, the information from the grammatical analysis is inserted at the word level in an XML structure. For example, in the case below, the TXM export includes - Treetagger annotations in POS, adding lemma and pos attributes to + xml:id="R161" target="#txm"/>TXM export includes + + TreeTagger annotations in POS, adding lemma and pos attributes to the word element w.

@@ -1218,44 +1340,63 @@ - Example of TreeTagger analysis representation that can be imported - into Example of + + TreeTagger analysis representation that can be imported + into TXM
Comparison with Other Software Suites -

The additional functionalities available in the TEICORPO suite are close to those - available in the Weblicht web services (Hinrichs, Hinrichs, and Zastrow 2010). To a certain extent, the two suites of - tools (Weblicht and TEICORPO) have the same purpose and functionalities. They can import +

The additional functionalities available in the + TEICORPO suite are close to those + available in the + Weblicht web services ( Hinrichs, Hinrichs, and Zastrow 2010). To a certain extent, the two suites of + tools ( + + Weblicht and + TEICORPO) have the same purpose and functionalities. They can import data from various formats, run similar processes on the data, and export the data for - scientific uses. In some cases, the services could complement each other or TEICORPO - could be integrated in the Weblicht services. This is the case, for example, for - handling the CHILDES format, which at the time of writing is more functional in TEICORPO - than in Weblicht.

+ scientific uses. In some cases, the services could complement each other or + TEICORPO + could be integrated in the + Weblicht services. This is the case, for example, for + handling the CHILDES format, which at the time of writing is more functional in + TEICORPO + than in + + Weblicht.

A major difference between the two suites is in the way they can be used and in the - type of data they target. TEICORPO is intended to be used not as an independent tool, + type of data they target. + TEICORPO is intended to be used not as an independent tool, but as a utility tool that helps researchers to go from one type of data to another. For example, the syntactic analysis is intended to be used as a first step before being used - in tools such as Praat, ELAN, or TXM. Our more recent developments + in tools such as + Praat, ELAN, or TXM. Our more recent developments (see Badin et al. 2021) made it possible to insert metadata stored in CSV files (including participant metadata) into the TEI files. This makes it possible to achieve more powerful corpus analysis using a tool such as - TXM.

Our approach is somewhat similar to what is suggested in the conclusion of Schmidt, Hedeland, and Jettka (2017), who describe a - mechanism that makes it possible to use the power of Weblicht to process their files - that are in the ISO/TEI format. A similar mechanism could be used within TEICORPO to - take advantage of the tools that are implemented in Weblicht. However, Schmidt, + mechanism that makes it possible to use the power of + Weblicht to process their files + that are in the ISO/TEI format. A similar mechanism could be used within + TEICORPO to + take advantage of the tools that are implemented in + Weblicht. However, Schmidt, Hedeland, and Jettka (2017) suggest in their conclusion that it would be more interesting to work directly on ISO/TEI files - because they contain a richer format. This is exactly what we did in TEICORPO. Our + because they contain a richer format. This is exactly what we did in + TEICORPO. Our suggestion would be to use the tools created by Schmidt, Hedeland, and Jettka (2017) directly with the TEICORPO files, so + type="bibl" target="#schmidt2017">2017) directly with the + TEICORPO files, so that their work would complement ours. Moreover, in this way, the two projects would be compatible and provide either new functionalities when the projects have clearly different goals, or data variants when the goals are closer.

@@ -1263,31 +1404,39 @@
Conclusion -

TEICORPO is a functional tool, created by the CORLI network and ORTOLANG, that converts +

+ TEICORPO is a functional tool, created by the CORLI network and ORTOLANG, that converts files created by software specializing in editing spoken-language data into TEI format. The result is fully compatible with the most recent developments in TEI, especially those that concern spoken-language material.

The TEI files can also be converted back to the original formats or to other formats used in spoken-language editing to take advantage of their functionalities. This makes TEI a - useful pivot format. Moreover, TEICORPO allows conversion to formats used by tools + useful pivot format. Moreover, + TEICORPO allows conversion to formats used by tools dedicated to corpus exploration and browsing.

-

TEICORPO exists as a command-line interface as well as a web service. It can thus be used +

+ TEICORPO exists as a command-line interface as well as a web service. It can thus be used by novice as well as advanced users, or by developers of linguistic software. The tool is free and open source so it can be further used and developed in other projects.

-

TEICORPO is intended to be part of a large set of tools using TEI for linguistic corpus +

+ TEICORPO is intended to be part of a large set of tools using TEI for linguistic corpus research. It can be used in parallel with or as a complement to other tools such as - Weblicht or the EXMARaLDA tools (see EXMARaLDA tools (see Schmidt, Hedeland, and Jettka 2017). A specificity of - TEICORPO is that it is more suitable for processing extended forms of TEI data (especially - forms which are not inside the main u element in the TEI code). TEICORPO is also - linked to TEIMETA, a flexible tool for describing spoken language corpora in a web + + TEICORPO is that it is more suitable for processing extended forms of TEI data (especially + forms which are not inside the main u element in the TEI code). + TEICORPO is also + linked to + TEIMETA, a flexible tool for describing spoken language corpora in a web interface generated from an ODD file (Etienne, Liégois, and Parisse, accepted). As TEI enables metadata and data to be stored in the same file, sharing this format will promote metadata sharing and will keep metadata linked to their data during the life cycle of the data.

Potential further developments could provide wider coverage of different formats such as - CMDI or linked data for editing or data exploration purposes; allow TEICORPO to work with + CMDI or linked data for editing or data exploration purposes; allow + TEICORPO to work with other external tools such as grammatical analyzers; or enable the visualization of multilevel annotations.

diff --git a/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml b/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml index 0c0dd86e..6475aa0f 100644 --- a/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml +++ b/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml @@ -162,27 +162,38 @@ />.

in favor of a web-based publication. While this decision was critical in that it allowed us to select the most supported and widely-used medium, we soon discovered that it did not make choices any simpler. On the one hand, the XSLT stylesheets provided by TEI are great for HTML rendering, but do not include support for image-related features (such as the text-image linking available thanks to the P5 version of the TEI schema) and tools (including zoom in/out, magnifying lens, and hot spots) that represent a significant part of a digital facsimile and/or diplomatic edition; other features, such as an XML search engine, would have to be integrated separately, in any case. On the other hand, there are powerful frameworks - based on CMS

The Omeka framework () supports - publishing TEI documents; see also Drupal (‎) and - TEICHI ().

and other web - technologies

Such as the eXist XML database, .

which looked far too complex and + based on CMS

The Omeka framework () supports + publishing TEI documents; see also Drupal () and + + TEICHI ( + ).

and other web + technologies

Such as the + eXist XML database, .

which looked far too complex and expensive, particularly when considering future maintenance needs, for our project’s - purposes. Other solutions, such as the EPPT software

Edition - Production and Presentation Technology, .

developed by K. - Kiernan or the Elwood - viewer

Elwood Viewer, .

created - by G. Lyman, either were not yet ready or were unsuitable for other reasons (proprietary + purposes. Other solutions, such as the + EPPT + software

Edition Production and Presentation Technology, + + .

developed by K. + Kiernan or the + + Elwood viewer

Elwood Viewer, + + .

created + by G. Lyman, either were not yet ready or were unsuitable for other reasons (proprietary software, user interface issues, specific hardware and/or software requirements).

@@ -200,7 +211,8 @@
First Experiments -

At first, however, EVT was more an experimental research project for students at the +

At first, however, + EVT was more an experimental research project for students at the Informatica Umanistica course of the University of Pisa

BA course, .

than a real @@ -222,9 +234,12 @@
- The Current EVT Version + The Current + EVT Version
- EVT v. 2.0: Rebooting the Project + + EVT v. 2.0: + Rebooting the Project

To get out of the impasse we decided to completely reboot the project, removing secondary features and giving priority to fundamental ones. We also found a solution for the data-loading problem: instead of finding a way to load the data into the software we @@ -233,30 +248,38 @@ text, with very little configuration needed to create the edition. This approach also allowed us to quickly test XML files belonging to other edition projects, to check if EVT could go beyond being a project-specific tool. The inspiration for these changes - came from work done in similar projects developed within the TEI community, namely TEI Boilerplate,

TEI - Boilerplate, .

John A. - Walsh’s collection of XSLT stylesheets,

tei2html, .

and Solenne Coutagne’s + came from work done in similar projects developed within the TEI community, namely + + + TEI Boilerplate,

TEI + Boilerplate, + .

+ + John A. + Walsh’s collection of XSLT stylesheets,

tei2html + , .

and Solenne Coutagne’s work for the Berliner Intellektuelle 1800–1830 project.

Digitale Edition Briefe und Texte aus dem intellektuellen Berlin um 1800, .

- Through this approach, we achieved two important results: first, usage of EVT is quite - simple—the user applies an XSLT stylesheet to their already marked-up file(s), + Through this approach, we achieved two important results: first, usage of + EVT is quite + simple—the user applies an XSLT stylesheet to their already marked-up file(s), and when the processing is finished they are presented with a web-ready edition; second, the web edition that is produced is based on a client-only architecture and does not require any additional kind of server software, which means that it can be simply copied on a web server to be used at once, or even on a cloud storage service (provided that it is accessible by the general public).

To ensure that it will be working on all the most recent web browsers, and for as long - as possible on the World Wide Web itself, EVT is built on open and standard web - technologies such as HTML, CSS, and JavaScript. Specific + as possible on the World Wide Web itself, + EVT is built on open and standard web + technologies such as HTML, CSS, and JavaScript. Specific features, such as the magnifying lens, are entrusted to jQuery plug-ins, again chosen from the best-supported open-source ones to reduce the risk of future incompatibilities. The general architecture of the software, in any case, is modular, so that any component @@ -268,16 +291,17 @@

Our ideal goal was to have a simple, very user-friendly drop-in tool, requiring little work and/or knowledge of anything beyond XML from the editor. To reach this goal, EVT is based on a modular structure where a single stylesheet (evt_builder.xsl) - starts a chain of XSLT 2.0 transformations calling in turn all the + starts a chain of XSLT 2.0 transformations calling in turn all the other modules. The latter belong to two general categories: those devoted to building the HTML site, and the XML processing ones, which extract the edition text lying between folios using the pb element and format it according to the edition level. All - XSLT modules live inside the builder_pack folder, in order to have a clean and well-organized directory hierarchy.

- The EVT builder_pack directory structure. + The + EVT builder_pack directory structure.

Therefore, assuming the available formatting stylesheets meet your project’s criteria, @@ -290,17 +314,19 @@ evt_builder-conf.xsl, to specify for example the number of edition levels or presence of images; you can then apply the evt_builder.xsl stylesheet to your TEI XML - document using the Oxygen XML editor or another XSLT 2–compliant + document using the Oxygen XML editor or another XSLT 2–compliant engine.

- The EVT data directory structure. + The + EVT data directory structure.

-

When the XSLT processing is finished, the starting point for the edition is +

When the XSLT processing is finished, the starting point for the edition is the index.html file in the root directory, and all the HTML pages resulting from the transformations will be stored in the output_data folder. You can delete everything in this latter folder (and the @@ -308,22 +334,24 @@ and everything will be re-created in the assigned places.

- The XSLT stylesheets + The XSLT stylesheets

The transformation chain has two main purposes: generate the HTML files containing the edition and create the home page which will dynamically recall the other HTML files.

The EVT builder’s transformation system is composed of a modular collection of XSLT 2.0 stylesheets: these modules are designed to permit scholars to freely + type="software" xml:id="R25" target="#XSLT"/>XSLT 2.0 stylesheets: these modules are designed to permit scholars to freely add their own stylesheets and to manage the different desired levels of the edition without influencing other parts of the system, for instance the generation of the home page.

-

The transformation is performed applying a specific XSLT stylesheet +

The transformation is performed applying a specific XSLT stylesheet (evt_builder.xsl) which includes links to all the other stylesheets that are part of the transformation chain and that will be applied to the TEI XML document containing the transcription.

-

EVT can be used to create image-based editions with different edition levels starting +

+ EVT can be used to create image-based editions with different edition levels starting from a single encoded text. The text of the transcription must be divided into smaller parts to recreate the physical structure of the manuscript. Therefore, it is essential that paginated XML documents are marked using a TEI page break element (pb) at @@ -364,17 +392,19 @@ xsl:apply-templates select="current-group()" mode="dipl" instruction before its content is inserted into the diplomatic output file.

-

Using XSLT modes it is possible to separate the rules for the different transformations - of a TEI element and to recall other XSLT stylesheets in order to manage the - transformations or send different parts of a document to different parts of the - transformation chain. This permits the extraction of different texts for different +

Using XSLT modes it is possible to separate the rules for the different + transformations of a TEI element and to recall other XSLT stylesheets in order to + manage the transformations or send different parts of a document to different parts of + the transformation chain. This permits the extraction of different texts for different edition levels (diplomatic, diplomatic-interpretative) processing the same XML file, and to save them in the HTML site structure, which is available as a separate XSLT module.

The use of modes also allows users to separate template rules for the different transformations of a TEI element and to place them in different XSLT files or in + xml:id="R31" target="#XSLT"/>XSLT files or in different parts of a single stylesheet. So templates such as the following and personalize the edition generation parameter as shown above; - copy their own XSLT files containing the template rules to + copy their own XSLT files containing the template rules to generate the desired edition levels in the directory that contains the stylesheets used for TEI element transformation (builder_pack/modules/elements); @@ -407,22 +437,26 @@

For the time being, this kind of customization has to be done by hand-editing the - configuration files, but in a future version of EVT we plan to add a more user-friendly + configuration files, but in a future version of + EVT we plan to add a more user-friendly way to configure the system.

Features -

At present, EVT can be used to create image-based editions with two possible edition +

At present, + EVT can be used to create image-based editions with two possible edition levels: diplomatic and diplomatic-interpretative; this means that a transcription encoded using elements belonging to the appropriate TEI module

See chapter 11, Representation of Primary Sources, in the TEI Guidelines.

should already be - compatible with EVT, or require only minor changes to be made compatible. The Vercelli + compatible with + EVT, or require only minor changes to be made compatible. The Vercelli Book transcription schema is based on the standard TEI schema, with no custom elements or attributes added: our tests with similarly encoded texts showed a high grade of compatibility. A critical edition level is currently being researched and it will be added in the future.

-

When the website produced by EVT is loaded in a browser, the viewer will be presented +

When the website produced by + EVT is loaded in a browser, the viewer will be presented with the manuscript image on the left side, and the corresponding text on the right: this is the default view, but on the main toolbar at the top right corner of the browser window there are icons to access all the available views: @@ -449,7 +483,8 @@ required by the editor. The only necessary requirement at the encoding level, in fact, is that the editor should encode folio numbers by means of the pb element including r and v letters to mark recto - and verso pages, respectively. EVT will take care of automatically associating each + and verso pages, respectively. + EVT will take care of automatically associating each folio to the images copied in the input_data/images folder using a verso-recto naming scheme (for example: 104v-105r.png). It is of course possible that in some cases the @@ -458,7 +493,8 @@ independent from the HTML interface; this file will be updated automatically every time the transformation process is started and can be customized by the editor.

Although the different views access different kinds of content, such as single side and - double side images, the navigation algorithms used by EVT allow the user to move from + double side images, the navigation algorithms used by + EVT allow the user to move from one view to another without losing the current browsing position.

All content is shown inside HTML frames designed to be as flexible as possible. No matter what view one is currently in, one can expand the desired frame to focus on its @@ -492,8 +528,8 @@ target="http://www.tapor.uvic.ca/~mholmes/image_markup/">Image Markup Tool

The UVic Image Markup Tool Project, .

software - and was implemented in XSLT and CSS; all the other features are achieved by + and was implemented in XSLT and CSS; all the other features are achieved by using jQuery plug-ins.

In the text frame tool bar you can see three drop-down menus which are useful for choosing texts, specific folios, and edition levels, and an icon that triggers the @@ -503,20 +539,24 @@

A First Use Case -

On December 24, 2013, after extensive testing and bug fixing work, the EVT team +

On December 24, 2013, after extensive testing and bug fixing work, the + + EVT team published a beta version of the Digital Vercelli Book edition,

Full announcement on the project blog, . The beta edition is directly accessible at .

soliciting feedback from all interested parties. Shortly afterwards, the version of the - EVT software we used, improved by more bug fixes and small enhancements, was made + + EVT software we used, improved by more bug fixes and small enhancements, was made available for the academic community on the project’s SourceForge site.

Edition Visualization Technology: Digital edition visualization - software, .

+ software, .

- The Digital Vercelli Book edition based on EVT v. 0.1.48. + The Digital Vercelli Book edition based on + EVT v. 0.1.48. Image-text linking is active.

@@ -524,9 +564,11 @@
Future Developments -

EVT development will continue during 2014 to fix bugs and to improve the current set of +

+ EVT development will continue during 2014 to fix bugs and to improve the current set of features, but there are also several important features that will be added or that we are - currently considering for inclusion in EVT. Some of the planned features will require + currently considering for inclusion in + EVT. Some of the planned features will require fundamental changes to the software architecture to be implemented effectively: this is probably the case for the Digital Lightbox (see ), which requires a client-server architecture (

New Layout -

One important aspect that has been introduced in the current version of EVT is a +

One important aspect that has been introduced in the current version of + EVT is a completely revised layout: the current user interface includes all the features which were deemed necessary for the Digital Vercelli Book beta, but it also is ready to accept the new features planned for the short and medium terms. Note that nontrivial changes to @@ -548,16 +591,23 @@

Search Engine -

The EVT search engine is already working and being tested in a separate development +

The + EVT search engine is already working and being tested in a separate development branch of the software; merging into the main branch is expected as soon as the user interface is finalized. It was implemented with the goal of keeping it simple and usable for both academics and the general public.

To achieve this goal we began by studying various solutions that could be used as a basis for our efforts. In the first phases of this study we looked at the principal XML - databases, such as of BaseX, eXist, etc., and we found a solution by envisioning EVT as + databases, such as of + BaseX, + eXist, etc., and we found a solution by envisioning + + EVT as a distributed application using the client-server architecture. For this test we - selected the eXist

eXist-db, .

open source XML database, and in a + selected the + eXist

eXist-db, + .

open source XML database, and in a relatively short time we created, sometimes by trial-and-error, a prototype that queried the database for keywords and highlighted them in context.

While this model was a step in the right direction and partially operational, we also @@ -568,7 +618,8 @@ could be accessed anywhere, and possibly distributed in optical formats (CD or DVD). Forcing the prerequisites of an Internet connection and of dependency on a server-based XML database would have undermined our original goal. Going the database route was no - longer an option for a client-only EVT and we immediately felt the need to go back to + longer an option for a client-only + EVT and we immediately felt the need to go back to our original architecture to meet this standard. This sudden turnaround marked another chapter in the research process and brought us to the current implementation of EVT Search.

@@ -582,39 +633,48 @@ expected by the user. Essentially, we found that at least two of them were needed in order to make a functional search engine: free-text search and keyword highlighting. To implement them we looked at existing search engines and plug-ins programmed in the most - popular client-side web language: JavaScript. In the - end, our search produced two answers: Tipue Search and DOM + popular client-side web language: JavaScript. In the + end, our search produced two answers: + Tipue Search and DOM manipulation.

- Tipue Search -

Tipue search

Tipue Search, - .

is a jQuery plug-in + + Tipue Search +

+ Tipue search

+ Tipue Search, + .

is a jQuery plug-in search engine released under the MIT license and aimed at indexing and searching large collections of web pages. It can function both offline and online, and it does not necessarily require a web server or a server-side programming/query language (such as SQL, PHP, or Python) in order to work. While technically a plug-in, its architecture is quite interesting and versatile: Tipue uses a combination of client-side JavaScript for the actual bulk of the work, and JSON (or JavaScript object literal) for storing the content. By + type="software" xml:id="R57" target="#JavaScript"/>JavaScript for the actual bulk of the work, and JSON (or JavaScript object literal) for storing the content. By accessing the data structure, this engine is able to search for a relevant term and bring back the matches.

-

Tipue Search operates in three modes: - in Static mode, Tipue Search operates without a web server by +

+ Tipue Search operates in three modes: + in Static mode, + Tipue Search operates without a web server by accessing the contents stored in a specific file (tipuedrop_content.js); these contents are presented in JSON format; - in Live mode, Tipue Search operates with a web server by indexing + in Live mode, + Tipue Search operates with a web server by indexing the web pages included in a specific file (tipuesearch_set.js); - in JSON mode, Tipue Search operates with a web server by using + in JSON mode, + Tipue Search operates with a web server by using AJAX to load JSON data stored in specific files (as defined by the user).

This plug-in suited our needs very well, but had to be modified slightly in order to - accommodate the requirements of the entire project. Before using Tipue to handle the + accommodate the requirements of the entire project. Before using + Tipue to handle the search we needed to generate the data structure that was going to be used by the engine to perform the queries. We explored some existing XSL stylesheets aimed at TEI to JSON transformation, but we found them too complex for the task at hand. So we @@ -627,18 +687,22 @@

These files are produced by including two templates in the overall flow of XSLT transformations that extract crucial data from the TEI documents and format them with JSON syntax. The procedure complements well the entire logic of - automatic self-generation that characterizes EVT.

+ automatic self-generation that characterizes + EVT.

After we managed to extract the correct data structure, we began to include the - search functionality in EVT. By using the logic behind Tipue JSON mode, we implemented + search functionality in + EVT. By using the logic behind + Tipue JSON mode, we implemented a trigger (under the shape of a select tag) that loaded the desired JSON data structure to handle the search (diplomatic or facsimile, as mentioned above) and a form that managed the query strings and launched the search function. Additionally, we decided to provide the user with a simple virtual keyboard composed of essential keys related to the Anglo-Saxon alphabet used in the Vercelli Book.

-

The performance of Tipue Search was deemed acceptable and our tests showed that even +

The performance of + Tipue Search was deemed acceptable and our tests showed that even large collections of data did not pose any particular problem.

@@ -649,13 +713,15 @@ Keyword Highlighting through DOM Manipulation

The solution to keyword highlighting was found while searching many plug-ins that deal with this very problem. All these plug-ins use JavaScript and DOM manipulation in order to wrap the HTML text nodes that match the query with a specific tag (a span or a user-defined tag) and a CSS class to manage the style of the highlighting. While this implementation was very simple and self-explanatory, making use of simple recursive functions on relevant HTML nodes has - proved to be very difficult to apply to the textual contents handled by EVT.

-

HTML text within EVT is represented as a combination of text nodes and span + proved to be very difficult to apply to the textual contents handled by + EVT.

+

HTML text within + EVT is represented as a combination of text nodes and span elements. These spans are used to define the characteristics of the current selected edition. They contain both philological information about the inner workings of the text and information about its visual representation. Very often the text is composed @@ -704,12 +770,14 @@ information about the image, but is placed inside a zone element, which defines two-dimensional areas within a surface, and is transcribed using one or more line elements.

-

Originally EVT could not handle this particular encoding method, since the Originally + EVT could not handle this particular encoding method, since the XSLT stylesheets could only process TEI XML documents encoded according to the traditional transcription method. Since we think that this is a concrete need in many cases of study (mainly epigraphical inscriptions, but also manuscripts, at least in - some specific cases), we recently added a new feature that will allow EVT to handle + some specific cases), we recently added a new feature that will allow + EVT to handle texts encoded according to the embedded transcription method. This work was possible due to a small grant awarded by EADH.

See EADH Small Grant: Call for Proposals, Support for Critical Edition

One important feature whose development will start at some point this year is the - support for critical editions, since at the present moment EVT allows dealing only + support for critical editions, since at the present moment + EVT allows dealing only with diplomatic and interpretative ones. We aim not only to offer full support for the TEI Critical Apparatus module, but also to find an innovative layout that can take advantage of the digital medium and its dynamic properties to go beyond the @@ -746,7 +815,8 @@

Some of the problems related to this approach are related to the user interface and the way it should be designed in order to be usable and useful: how to conceive and where to place the graphical widgets holding the critical apparatus, how to integrate - these UI elements in EVT, how to contextualize the variants and navigate through the + these UI elements in + EVT, how to contextualize the variants and navigate through the witnesses’ texts, and more. There are other problems, for instance scalability issues (how to deal with very big textual traditions that count tens or even hundreds of witnesses?) or the handling of texts produced by collation software, which strictly @@ -758,12 +828,14 @@

- Digital Lightbox + + Digital Lightbox

Developed first at the University of Pisa, and then at King’s College London as part of the DigiPal

DigiPal: Digital Resource and Database of Palaeography, Manuscript Studies and Diplomatic, .

project, the Digital Lightbox

A beta + target="http://lightbox-dev.dighum.kcl.ac.uk"> + Digital Lightbox

A beta version is available at .

is a web-based visualization framework which aims to support historians, paleographers, art historians, and others in analyzing and studying digital @@ -775,7 +847,8 @@ computational methods are very promising, the results that may be obtained at this time are still significantly less precise (with regard to specific image features, at least) than those produced through human interpretation.

-

Initially developed exclusively for paleographic research, the Digital Lightbox may be +

Initially developed exclusively for paleographic research, the + Digital Lightbox may be used with any type of image because it includes a set of general graphic tools. Indeed, the application allows a detailed and powerful analysis of one or more images, arranged in up to two available workspaces, providing tools for manipulation, management, @@ -799,7 +872,8 @@ Lightbox.

-

Collaboration is a very important characteristic of Digital Lightbox: what makes this +

Collaboration is a very important characteristic of + Digital Lightbox: what makes this tool stand apart from all the image-editing applications available is the possibility of creating and sharing the work done using the software framework. First, you can create collections of images and then export them to the local disk as an XML file; this @@ -812,34 +886,45 @@ work more effective and easy. Thanks to a new HTML5 feature, it is possible to support the importing of images from the local disk to the application without any server-side function.

-

Digital Lightbox has been developed using some of the latest web technologies +

+ Digital Lightbox has been developed using some of the latest web technologies available, such as HTML5, CSS3, the front-end framework Bootstrap,

Bootstrap, .

and the JavaScript (ECMAScript 6) programming language, in combination with the jQuery library.

.

The code architecture has been designed + target="http://getbootstrap.com/">Bootstrap,

Bootstrap, .

and the JavaScript (ECMAScript 6) programming language, in combination with the jQuery library.

.

The code architecture has been designed to be modular and easily extensible by other developers or third parties: indeed, it has - been released as open source software on GitHub,

Digital - Lightbox, .

and is freely available to be downloaded, edited, and tinkered + been released as open source software on GitHub,

+ Digital Lightbox, .

and is freely available to be downloaded, edited, and tinkered with.

-

The Digital Lightbox represents a perfect complementary feature for the EVT project: a +

The + Digital Lightbox represents a perfect complementary feature for the + EVT project: a graphic-oriented tool to explore, visualize, and analyze digital images of manuscripts. - While EVT provides a rich and usable interface to browse and study manuscript texts + While + EVT provides a rich and usable interface to browse and study manuscript texts together with the corresponding images, the tools offered by the Digital Lightbox allow users to identify, gather, and analyze visual details which can be found within the images, and which are important for inquiries relating, for instance, to the style of the handwriting, decorations on manuscript folia, or page layout.

-

An effort to adapt and integrate the Digital Lightbox into EVT is already underway, +

An effort to adapt and integrate the Digital Lightbox into + EVT is already underway, making it available as a separate, image-centered view, but there is a major hurdle to overcome: some of the DL features are only possible within a client-server architecture. - Since EVT or, more precisely, a separate version of EVT will migrate to this + Since + EVT or, more precisely, a separate version of + + EVT will migrate to this architecture, at some point in the future it will be possible to integrate a full version of the DL. Plans for the current, client-only version envision implementing all those features that do not depend on server software: even if this means giving up @@ -853,22 +938,29 @@

In September 2013 we met with researchers of the Clavius on the Web project

See . A - preliminary test using a previous version of EVT is available at + EVT is available at .

to discuss a possible use of - EVT in order to visualize the documents that they are collecting and encoding; the main + + EVT in order to visualize the documents that they are collecting and encoding; the main goal of the project is to produce a web-based edition of all the correspondence of this important sixteenth–seventeenth-century mathematician.

Currently preserved at the Archivio della Pontificia Università Gregoriana.

The integration of - EVT with another web framework used in the project, the eXist XML database, will require + + EVT with another web framework used in the project, the eXist XML database, will require a very important change in how the software works: as mentioned above, everything from - XSLT processing to browsing of the resulting website has been done on the client - side, but the integration with eXist will require a move to the more complex - client-server architecture. A version of EVT based on this architecture would present + side, but the integration with + eXist will require a move to the more complex + client-server architecture. A version of + EVT based on this architecture would present several advantages, not only the integration of a powerful XML database, but also the - implementation of a full version of the Digital Lightbox. We will try to make the move + implementation of a full version of the + Digital Lightbox. We will try to make the move as painless as possible and to preserve the basic simplicity and flexibility that has - been a major feature of EVT so far. The client-only version will not be abandoned, + been a major feature of + EVT so far. The client-only version will not be abandoned, though for quite some time there will be parallel development with features trickling from one version to the other, with the client-only one being preserved as a subset of the more powerful one.

@@ -880,13 +972,14 @@ to the publishing of TEI-encoded digital editions, this software has grown to the point of being a potentially very useful tool for the TEI community: since it requires little configuration, and no knowledge of programming languages or web frameworks except for what - is needed to apply an XSLT stylesheet, it represents a user-friendly method + is needed to apply an XSLT stylesheet, it represents a user-friendly method for producing image-based digital editions. Moreover, its client-only architecture makes it very easy to test the edition-building process (one has only to delete the output folders and start anew) and publish preliminary versions on the web (a shared folder on any cloud-based service such as Dropbox is all that is needed).

-

While EVT has been under development for 3–4 years, it was thanks to the work and focus +

While + EVT has been under development for 3–4 years, it was thanks to the work and focus required by the Digital Vercelli Book release at end of 2013 that we now have a solid foundation on which to build new features and refine the existing ones. Some of the future expansions also pose important research questions: this is the case with the critical @@ -902,8 +995,8 @@ ongoing, see Gabler 2010, Robinson 2005 and 2013, Rosselli Del Turco, - forthcoming.

The collaborative work features of the Digital - Lightbox are also critical to the way modern scholars interact and share their research + forthcoming.

The collaborative work features of the + Digital Lightbox are also critical to the way modern scholars interact and share their research findings. Finally, designing a user interface capable of hosting all the new features, while remaining effective and user-friendly, will itself be very challenging.

diff --git a/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml b/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml index 3d41b158..347c4899 100644 --- a/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml +++ b/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml @@ -73,14 +73,14 @@

The paper presents the database Cretan Institutional Inscriptions, which was created as part of a PhD research project carried out at the University of Venice Ca’ Foscari. The database, built using the EpiDoc Front-End Services (EFES) platform, collects the EpiDoc editions of six hundred inscriptions that shed light on the institutions of the political entities of Crete from the seventh to the first century BCE. The aim of the paper is to outline the main issues addressed during the creation of the database and the encoding of the inscriptions and to illustrate the core features of the database, with an emphasis on the advantages deriving from the combined use of the TEI-EpiDoc standard and of the EFES platform.

@@ -124,15 +124,15 @@ document. The editions of these inscriptions, along with a collection of the most relevant literary sources, have been collected in the database Cretan Institutional Inscriptions, which I created using the EpiDoc Front-End Services (EFES) platform. To facilitate consulting the epigraphic records, the database also includes, in addition to the ancient sources, two catalogs providing information about the Cretan political entities and the institutional elements considered.

The aim of this paper is to illustrate the main issues tackled during the creation of the database and to examine the choices made, focusing on the advantages offered by - the use of EpiDoc and EFES.

+ the use of EpiDoc and EFES.

Cretan Epigraphy and Cretan Institutions @@ -246,7 +246,7 @@
Towards the Creation of a Born-Digital Epigraphic Collection with EFES

Once the relevant material had been defined, another major issue that I had to face was to decide how to deal efficiently with it. While I was in the process of starting @@ -269,20 +269,21 @@ collection of editions of the previously selected six hundred inscriptions to creating it as a born-digital epigraphic collection because of another event that also happened in 2017: the appearance of a powerful new tool for digital epigraphy, - EpiDoc Front-End Services (EFES).

GitHub repository, accessed July - 21, 2021, .

Although I + EpiDoc Front-End Services (EFES).

GitHub repository, accessed July + 21, 2021, .

Although I was already aware of the many benefits deriving from a semantic markup of the inscriptions,

On which see and .

what really persuaded me to adopt a TEI-based approach for the creation of my epigraphic - editions was actually the great facilitation that EFES offered in using + editions was actually the great facilitation that EFES offered in using TEI-EpiDoc, which I will discuss in the following section.

- The Benefits of Using EpiDoc and EFES + The Benefits of Using EpiDoc and EFES

I was already familiar with the epigraphic subset of the TEI standard, EpiDoc,

EpiDoc: Epigraphic Documents in TEI XML, accessed July 21, 2021,

This is particularly true for the creation of publishable output of the encoded - inscriptions. The EpiDoc Reference XSLT Stylesheets, created for transformation of - EpiDoc XML files into HTML,

Accessed July 21, 2021, .

require + inscriptions. The EpiDoc Reference + XSLT Stylesheets, created for transformation of + EpiDoc XML files into HTML,

Accessed July 21, 2021, .

require relatively advanced knowledge of XSLT to use them to produce a satisfying HTML edition for online publication or to generate a printable PDF. Not to mention the creation of a complete searchable database to be published online, equipped with @@ -335,58 +337,64 @@ their research work on a collection of ancient documents, without aiming at the publication of the encoded inscriptions. The querying of a set of EpiDoc inscriptions is possible to some extent even without technical support: in some advanced XML - editors, particularly Oxygen, it is possible to perform XPath queries that allow the + editors, particularly + Oxygen, it is possible to perform XPath queries that allow the identification of all the occurrences of specific features in the epigraphic collection according to their markup. The XPath queries in an advanced XML editor also allow the creation of lists of specific elements mentioned in the inscriptions, but to my knowledge the creation of proper indexes—before EFES—was + xml:id="R11" target="#EFES"/>EFES—was almost impossible to achieve without the help of an IT expert.

Thus, despite the many benefits that EpiDoc encoding potentially offers, epigraphists might often be discouraged from adopting it by the amount of time that such an approach requires, combined with the fact that in many cases these benefits become tangible only at the end of the work, and only if one has IT support.

In light of these limitations, it is easy to understand how deeply the release of - EFES has transformed the field of digital epigraphy. EFES, developed at the Institute of Classical Studies of the School of + EFES has transformed the field of digital epigraphy. + EFES, developed at the Institute of Classical Studies of the School of Advanced Study of the University of London as the epigraphic specialization of the - Kiln platform,

New Digital Publishing Tool: EpiDoc Front-End + <ptr type="software" xml:id="R14" target="#kiln"/> + <rs type="soft.name" ref="#R14">Kiln platform</rs> + ,<note><p><title level="a">New Digital Publishing Tool: EpiDoc Front-End Services, September 1, 2017, ; see also the Kiln GitHub repository, accessed July 21, 2021,.

is a platform that + target="https://ics.sas.ac.uk/about-us/news/new-digital-publishing-tool-epidoc-front-end-services"/>; + see also the Kiln GitHub repository, accessed July 21, 2021, + .

is a platform that simplifies the creation and management of databases of inscriptions encoded following - the EpiDoc Guidelines. More specifically, EFES was developed to make + the EpiDoc Guidelines. More specifically, + EFES was developed to make it easy for EpiDoc users to view a publishable form of their inscriptions, and to publish them online in a full-featured searchable database, by easily ingesting EpiDoc texts and providing formatting for their display and indexing through the - EpiDoc reference XSLT stylesheets. The ease of configuration of the XSLT + + EpiDoc reference XSLT stylesheets. The ease of configuration of the XSLT transformations, and the possibility of already having, during construction, an immediate front-end visualization of the desired final outcome of the TEI-EpiDoc marked-up documents, allow smooth creation of an epigraphic database even without a - large team or in-depth IT skills. Beyond this, EFES is also remarkable for + large team or in-depth IT skills. Beyond this, EFES is also remarkable for the ease of creation and display of the indexes of the various categories of marked-up terms, which significantly simplifies comparative analysis of the data - under consideration. EFES is thus proving to be an extremely useful + under consideration. EFES is thus proving to be an extremely useful tool not only for publishing inscriptions online, but also for studying them before their publication or even without the intention of publishing them, especially when dealing with large collections of documents and data sets.

See Bodard and Yordanova (2020).

-

Some of these useful features of EFES are common to other existing tools, - such as TEI Publisher,

Accessed July 21, 2021, 2020).

+

Some of these useful features of EFES are common to other existing tools, + such as TEI Publisher,

Accessed July 21, 2021, .

- TAPAS,

Accessed July 21, 2021, TAPAS,

Accessed July 21, 2021, .

or Kiln itself, which is - EFES’s direct ancestor. What makes EFES unique, + EFES’s direct ancestor. What makes EFES unique, however, is the fact that it is the only one of those tools to have be designed specifically for epigraphic purposes and to be deeply integrated with the EpiDoc Schema/Guidelines and with its reference stylesheets. Not only does it use, by @@ -402,20 +410,20 @@ symbols, numerals, abbreviations, and uninterpreted text fragments. New facets and indexes can easily be added even without mastering XSLT, along the lines of the existing ones and by following the detailed instructions provided in the EFES Wiki documentation.

Accessed July 21, 2021, . Creation of new facets, last updated April 11, 2018: . Creation of new indexes, last updated May 27, 2020: .

- Furthermore, EFES makes it possible to create an epigraphic concordance of the + Furthermore, EFES makes it possible to create an epigraphic concordance of the various editions of each inscription and to add information pages as TEI XML files (suitable for displaying both information on the database itself and potential additional accompanying information).

Against this background, the combined use of the EpiDoc encoding and of the EFES tool seemed to be a promising approach for the purposes of my research project, and so it was.

I initially aimed to create updated digital editions of the inscriptions mentioning @@ -425,18 +433,18 @@ inscriptions in EpiDoc, totally met my needs, and helped me very much in the identification of recurring patterns. As I was expected to submit my doctoral thesis in PDF format, I also needed to convert the epigraphic editions into PDF, and by - running EFES locally I have been able to view their transformed HTML + running EFES locally I have been able to view their transformed HTML versions on a browser and to naively copy and paste them into a Microsoft Word file.

I am very grateful to Pietro Maria Liuzzo for teaching me how to avoid this conversion step by using XSL-FO, which can be used to generate a PDF directly from the raw XML files. The use of XSL-FO, however, requires some additional skills that are not needed in the copy-and-paste-from-the-browser process.

Although I had not planned it from the beginning, EFES also proved to be useful in the (online) publication of the results of - my research. The ease with which EFES allows the creation of a searchable + my research. The ease with which EFES allows the creation of a searchable epigraphic database, in fact, spontaneously led me to decide to publish it online once completed, making available not only the HTML editions—which can also be downloaded as printable PDFs—but also the raw XML files for reuse. The aim of the @@ -447,8 +455,8 @@
Cretan Institutional Inscriptions: An Overview of the Database -

The core of the EFES-based database Cretan + <p>The core of the <ptr type="software" xml:id="R30" target="#EFES"/><rs + type="soft.name" ref="#R30">EFES</rs>-based database <title level="m">Cretan Institutional Inscriptions consists of the EpiDoc editions of the previously selected six hundred inscriptions, which can be exported both in PDF and in their original XML format. Each edition is composed of an essential descriptive @@ -486,8 +494,8 @@ >Political entities, Institutions, Literary sources, and Bibliographic references, have been added to the database as pages generated from TEI - XML files, which could be natively included in EFES.

+ XML files, which could be natively included in EFES.

As mentioned above, the database also includes several thematic indexes listing the marked-up terms along with the references to the inscriptions in which they occur, divided into institutions, toponyms and ethnic adjectives, lemmata (both of @@ -663,8 +671,8 @@ type="crossref"/> (I.Cret. II 23 5). -

Given the markup described above, EFES was able to generate detailed indexes +

Given the markup described above, EFES was able to generate detailed indexes having the appearance of rich tables, where each piece of information is displayed in a dedicated column and can easily be combined with the other ones at a glance.

In the most complex case, that of the institutions, the index displays for each @@ -702,8 +710,8 @@ An excerpt from the prosopographical index.

In addition to the more tabular institutional and - prosopographical indexes, EFES facilitated the creation of other more + prosopographical indexes, EFES facilitated the creation of other more traditional indexes, including the indexed terms and the references to the inscriptions that mention them. The encoding of the most significant words with w lemma="" led to the creation of a word index of relevant @@ -721,8 +729,8 @@

Conclusions

In conclusion, I would like to emphasize how particularly efficient the combined use - of EpiDoc and EFES has proven to be for the creation of a thematic database + of EpiDoc and EFES has proven to be for the creation of a thematic database like Cretan Institutional Inscriptions. By collecting in a searchable database all the inscriptions pertaining to the Cretan institutions, records that were hitherto accessible only in a scattered way, Cretan Institutional Inscriptions is a new diff --git a/taxonomy/software-list.xml b/taxonomy/software-list.xml index 2a929420..38ef49ef 100644 --- a/taxonomy/software-list.xml +++ b/taxonomy/software-list.xml @@ -1331,7 +1331,7 @@ CLAN - http://dali.talkbank.org/clan/ + http://dali.talkbank.org/clan/ From 4e4e2755f87a258822208052620d362f06c6c67c Mon Sep 17 00:00:00 2001 From: Fernanda Alvares Freire Date: Fri, 2 Feb 2024 17:35:22 +0100 Subject: [PATCH 30/33] annotation --- .../jtei-cc-ra-parisse-182-source.xml | 1020 +++++++++-------- .../JTEI/8_2014-15/jtei-8-rosselli-source.xml | 762 ++++++------ .../jtei-vagionakis-204-source.xml | 176 +-- 3 files changed, 1029 insertions(+), 929 deletions(-) diff --git a/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml b/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml index 363ecf32..d0cd2c31 100644 --- a/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml +++ b/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml @@ -95,30 +95,36 @@ collect and transcribe spoken language resources, their number is limited and thus corpora need to be interoperable and reusable in order to improve research on themes such as phonology, prosody, interaction, syntax, and textometry. To help researchers reach this - goal, CORLI has designed a pair of tools: - TEICORPO to assist in the conversion and use of - spoken language corpora, and - TEIMETA for metadata purposes. - TEICORPO is based on the - principle of an underlying common format, namely TEI XML as described in its specification - for spoken language use (ISO 2016). This tool enables the conversion of transcriptions - created with alignment software such as - CLAN, - Transcriber, - Praat, or ELAN as well as - common file formats (CSV, XLSX, TXT, or DOCX) and the TEI format, which plays the role of - a lossless pivot format. Backward conversion is possible in many cases, with limitations - inherent in the destination target format. - TEICORPO can run the - Treetagger part-of-speech - tagger and the - Stanford CoreNLP tools on TEI files and can export the resulting files to - textometric tools such as TXM, Le Trameur, or Iramuteq, making it suitable for - spoken language corpora editing as well as for various research purposes.

+ goal, CORLI has designed a pair of tools: + TEICORPO to assist in the conversion and use of spoken + language corpora, and + TEIMETA for metadata purposes. + TEICORPO is based on the principle of an underlying + common format, namely TEI XML as described in its specification for spoken language use + (ISO 2016). This tool enables the conversion of transcriptions created with alignment + software such as + CLAN, + Transcriber, + Praat, or ELAN as well as common file formats + (CSV, XLSX, TXT, or DOCX) and the TEI format, which plays the role of a lossless pivot + format. Backward conversion is possible in many cases, with limitations inherent in the + destination target format. + TEICORPO can run the + Treetagger part-of-speech tagger and the + Stanford CoreNLP tools on TEI files and can export + the resulting files to textometric tools such as TXM, Le Trameur, or + Iramuteq, making it suitable for spoken language corpora editing as well as for + various research purposes.

@@ -225,20 +231,21 @@
Similarities with and Differences from Other Approaches

Many software packages dedicated to editing spoken language transcription contain - utilities that can convert many formats: for example, EXMARaLDA (Schmidt 2004 - ; see ), Anvil ( - Kipp - 2001; see ), and ELAN (Wittenburg et al. 2006; - see ). However, in all cases, the + utilities that can convert many formats: for example, EXMARaLDA (Schmidt 2004 + ; see ), + + Anvil ( + Kipp 2001; see ), and ELAN + (Wittenburg + et al. 2006; see + ). However, in all cases, the conversions are limited to the features implemented in the tool itself—for example, with a limited set of metadata—and they cannot always be used to prepare data to be used by - another tool. + another tool.

While our work is similar to that of Schmidt (2011), several differences make our approaches complementary. First, the two main common features are as follows:

@@ -249,122 +256,136 @@

The list of tools that are considered in the two projects is nearly the same. The only tools missing in the - TEICORPO approach are EXMARaLDA and - FOLKER - (Schmidt and - Schütte 2010; see ), but this was only because the - conversion tools from and to EXMARaLDA, - FOLKER, and TEI already exist. - They are available as XSLT stylesheets in the open-source distribution of - EXMARaLDA (). The other common point is the - use of the TEI format, and especially the more recent ISO version of TEI for spoken - language (ISO/TEI; see ISO 2016). The TEI - format produced by the EXMARaLDA and - FOLKER software fit within the - process chain of - TEICORPO. This demonstrates the usefulness of a well-known and - efficient format such as TEI. + TEICORPO approach are EXMARaLDA and + FOLKER (Schmidt and Schütte + 2010; see ), but this was only because the + conversion tools from and to EXMARaLDA, + FOLKER, and TEI already exist. They are available + as XSLT stylesheets in the open-source distribution of EXMARaLDA (). The other common point is the use of the TEI format, and especially the more + recent ISO version of TEI for spoken language (ISO/TEI; see ISO 2016). The TEI format produced by the EXMARaLDA and + + FOLKER software fit within the process chain of + TEICORPO. This demonstrates the usefulness of a well-known and + efficient format such as TEI.

There are, however, differences between the two projects that make them nonredundant but complementary, each project having specificities that can be useful or damaging - depending on the user’s needs. One minor difference is that the - TEICORPO project is not - a functionality of an editing tool, but is a standalone tool for converting data between - one format and another. This had certain effects on the user interface and explains some - of the choices made in the development of the two tools.

-

There are two major differences between - TEICORPO and Schmidt’s approach, which affected + depending on the user’s needs. One minor difference is that the + TEICORPO project is not a functionality of an + editing tool, but is a standalone tool for converting data between one format and + another. This had certain effects on the user interface and explains some of the choices + made in the development of the two tools.

+

There are two major differences between + TEICORPO and Schmidt’s approach, which affected both the design of the tools and how they can be used. The first difference is that in - developing TEICORPO, - it was decided that the conversion between the original formats and - TEI had to be lossless (or as lossless as possible) because we wanted to offer a means - to store the research data for long-term conservation and dissemination in a standard - XML format instead of in proprietary formats such as those used by - CLAN (MacWhinney 2000; see ), ELAN, - Praat (Boersma and van Heuven 2001; see ), and - Transcriber (Barras et al. 2000; see and ). These proprietary - formats are in XML or Unicode formats so that they can be conserved for the long term. - However, they are not all well described or constrained, at least not in the same way as - TEI—which, moreover, offers a semantically relevant structure as well as an official - format for long-term conservation in France. Moreover, as the durability of these four - pieces of software cannot be guaranteed in the long term, it does not seem safe to keep - corpora in a format available only for a given tool that may disappear or fall into - disuse. -

The second major difference is that the + developing TEICORPO, it was decided that the conversion between the original + formats and TEI had to be lossless (or as lossless as possible) because we wanted to + offer a means to store the research data for long-term conservation and dissemination in + a standard XML format instead of in proprietary formats such as those used by + CLAN (MacWhinney 2000; see ), ELAN, + Praat (Boersma and van Heuven 2001; see ), and + Transcriber (Barras et al. 2000; see and + ). These + proprietary formats are in XML or Unicode formats so that they can be conserved for the + long term. However, they are not all well described or constrained, at least not in the + same way as TEI—which, moreover, offers a semantically relevant structure as well as an + official format for long-term conservation in France. Moreover, as the durability of + these four pieces of software cannot be guaranteed in the long term, it does not seem + safe to keep corpora in a format available only for a given tool that may disappear or + fall into disuse.

+

The second major difference is that the TEICORPO initiative does not target only spoken language, but all types of annotation, including media of any type. This covers all spoken languages, vocal as well as sign languages, and also gesture and any type of multimodal coding. The goal of - TEICORPO was not to advocate a linguistic mode of coding - spoken data as a transcription convention does, but rather to propose a research model - for storing and sharing data about language and other modalities. Consequently, the - focus of the work was not on how the spoken data were coded (i.e., the microstructure), - nor on the standard that should be used for transcribing in orthographic format. - Instead, the - TEICORPO approach focused on how to integrate multiple pieces of - information into the TEI semantics (the macrostructure), as this is possible with tools - such as ELAN or - PRAAT. The goal was to be able to convert a file produced by - these tools so that it can be saved in TEI format for long-term conservation.

-

Data in - PRAAT and ELAN formats can contain information that is - different from what is usually present in an ISO/TEI description, but that nonetheless - remains within the structures authorized in the ISO/TEI. For example, the information is - stored as described below in spanGrp, an element available in the ISO/TEI - description. This means that whenever information is organized according to the - classical approach to spoken language (by + TEICORPO was not to advocate a linguistic mode of + coding spoken data as a transcription convention does, but rather to propose a research + model for storing and sharing data about language and other modalities. Consequently, + the focus of the work was not on how the spoken data were coded (i.e., the + microstructure), nor on the standard that should be used for transcribing in + orthographic format. Instead, the + TEICORPO approach focused on how to integrate + multiple pieces of information into the TEI semantics (the macrostructure), as this is + possible with tools such as ELAN or + PRAAT. The goal was to be able to convert a file + produced by these tools so that it can be saved in TEI format for long-term + conservation.

+

Data in + PRAAT and ELAN formats can contain + information that is different from what is usually present in an ISO/TEI description, + but that nonetheless remains within the structures authorized in the ISO/TEI. For + example, the information is stored as described below in spanGrp, an element + available in the ISO/TEI description. This means that whenever information is organized + according to the classical approach to spoken language (by classical, we mean approaches based on an orthographic transcription represented as a list, as in the script of a play), it will be available - for further processing by using the export features of - TEICORPO (see and further below for export functionalities) - but other types of information are also available. Compared to - PRAAT and ELAN, the integration of tools such as - CLAN or - Transcriber was much more - straightforward, as the organization of the files is less varied and more - classical.

+ for further processing by using the export features of + TEICORPO (see and further below for export functionalities) but other types of + information are also available. Compared to + PRAAT and ELAN, the integration of tools + such as + CLAN or + Transcriber was much more straightforward, as the + organization of the files is less varied and more classical.

Choice of the Microstructure Representation

Processing of the microstructure, with the exception of information already available - in the tools themselves (for example silence in - Transcriber), is not done during the - conversion to TEI. The division into words or other elements such as morphemes or - phonemes is not systematically done in any of the tools used by researchers in CORLI. - When it exists, it is not included in the main transcription line but most often in - dependent lines, as it represents an annotation with its own rules and guidelines. - Division into words or other elements is part of the linguistic analysis rather than a - simple storage operation.

+ in the tools themselves (for example silence in + Transcriber), is not done during the conversion + to TEI. The division into words or other elements such as morphemes or phonemes is not + systematically done in any of the tools used by researchers in CORLI. When it exists, + it is not included in the main transcription line but most often in dependent lines, + as it represents an annotation with its own rules and guidelines. Division into words + or other elements is part of the linguistic analysis rather than a simple storage + operation.

- TEICORPO therefore preserves as long-term storage data both the original information - that was created in the original software—the full unprocessed transcription—and the - other linguistically processed transcriptions and annotations. For - TEICORPO, - microstructure processing, such as division into words, or text standardization when - necessary, belongs to the linguistic analysis of the corpora. Hence, the TEI data file - can be used both for data exploration and for scientific purposes. For example, when a - researcher needs to parse the data, or to explore the data with textometric tools, - then it is necessary to decide which type of preprocessing is necessary. As this - decision often depends on the initial project as well as on linguistic choices, it is - difficult to standardize this task.

+ TEICORPO therefore preserves as long-term storage + data both the original information that was created in the original software—the full + unprocessed transcription—and the other linguistically processed transcriptions and + annotations. For + TEICORPO, microstructure processing, such as + division into words, or text standardization when necessary, belongs to the linguistic + analysis of the corpora. Hence, the TEI data file can be used both for data + exploration and for scientific purposes. For example, when a researcher needs to parse + the data, or to explore the data with textometric tools, then it is necessary to + decide which type of preprocessing is necessary. As this decision often depends on the + initial project as well as on linguistic choices, it is difficult to standardize this + task.

@@ -372,10 +393,10 @@ The TEICORPO Project

The - TEICORPO project contains two different sets of tools. One set focuses on conversion - between various software packages used for spoken language coding and TEI. The other set - focuses on using the TEI format for linguistic analyses (textometric or grammatical - analyses).

+ TEICORPO project contains two different sets of + tools. One set focuses on conversion between various software packages used for spoken + language coding and TEI. The other set focuses on using the TEI format for linguistic + analyses (textometric or grammatical analyses).

Alignment Tools @@ -388,33 +409,35 @@ software are of course possible:

- Transcriber is widely used in sociolinguistics; + Transcriber is widely used in + sociolinguistics; - CLAN is widely used in language acquisition and especially in the Talkbank - project; + CLAN is widely used in language acquisition and + especially in the Talkbank project; Praat is more specialized for phonetic or phonological annotations; - ELAN is recommended for annotating video and particularly - multimodality (for example, components such as gazes, gestures, and movements), and is - often used for rare languages to describe the organization of the segments. + ELAN is recommended for annotating video and particularly multimodality (for + example, components such as gazes, gestures, and movements), and is often used for + rare languages to describe the organization of the segments.

It should be pointed out here that whereas Transcriber and CLAN files nearly always contain classical orthographic transcriptions, this is not the case for Praat and ELAN files. As our goal is to provide a generic solution for - long-term conservation and use for any type of project, conversion of all types of files - produced by the four tools cited above will be possible. It is up to the user to - determine which part of a corpus can be used with a classical approach, which parts - should not, and how they should be processed.

+ ref="#R54">ELAN files. As our goal is to provide a generic solution for long-term + conservation and use for any type of project, conversion of all types of files produced + by the four tools cited above will be possible. It is up to the user to determine which + part of a corpus can be used with a classical approach, which parts should not, and how + they should be processed.

The list of tools reflects the uses and practices in the CORLI network, and is very similar to the list suggested by Schmidt (2011) with the exception of EXMARaLDA and - FOLKER. - These two tools already have built-in conversion features, so adding them to the - + target="#exmaralda"/>EXMARaLDA and + FOLKER. These two tools already have built-in + conversion features, so adding them to the TEICORPO project would be easy at a later date.

Alignment applications deal with two main types of data presentation and organization. The presentation of the data has direct consequences for how the data are exploited, and @@ -450,26 +473,29 @@ offered by the presentation formats, and because the same software, even within the same presentation models, rarely provides a solution for all the needs of all users, researchers often have to use two or more pieces of software.

-

The use of multiple tools is quite common. For example, - Praat and - Transcriber cannot be - used when working on video recordings because these programs are limited to audio - formats. But if researchers need to conduct spectral analysis for some purpose, they - will have to use the - Praat software and convert not only the transcription, but also the - media. In the field of language acquisition, where the - CLAN software is generally used - to describe both the child productions and the adult productions, when researchers are - interested in gestures, they use the ELAN software, importing the CLAN file to add - gesture tiers, as ELAN is more suitable for the fine-grained analysis - of visual data. Another common practice consists in first doing a rapid transcription - using only orthographic annotations in Transcriber and then in a second stage annotating - some more interesting excerpts in greater detail including new information. In this case +

The use of multiple tools is quite common. For example, + Praat and + Transcriber cannot be used when working on video + recordings because these programs are limited to audio formats. But if researchers need + to conduct spectral analysis for some purpose, they will have to use the + Praat software and convert not only the + transcription, but also the media. In the field of language acquisition, where the + CLAN software is generally used to describe both + the child productions and the adult productions, when researchers are interested in + gestures, they use the ELAN software, importing the CLAN file to add gesture + tiers, as ELAN is more suitable for the fine-grained analysis of visual data. + Another common practice consists in first doing a rapid transcription using only + orthographic annotations in Transcriber and then in a second stage annotating some more + interesting excerpts in greater detail including new information. In this case researchers will import the first transcription file into other tools such as Praat or - ELAN and annotate them partially. It is therefore necessary to import or export + ELAN and annotate them partially. It is therefore necessary to import or export files in different formats if researchers need to use different tools for different parts of their work.

Another need concerns the pooling of corpora coming from other resources or other @@ -484,9 +510,9 @@ requirements of the conversion process options. For these reasons, we decided in the CORLI consortium and in collaboration with the ORTOLANG infrastructure to design a common tool that could be used by the whole linguistic community. The goal was to make - open-source software with proper maintenance freely available on - .

+ open-source software with proper maintenance freely available on + .

Conversion to and from TEI @@ -501,31 +527,34 @@ metadata and all the macrostructure information into the TEI format.

Basic Structures -

Converting the metadata is straightforward, as the four tools ( - CLAN, ELAN, - Praat, and - - Transcriber) do not enable a large amount of metadata to be - edited. Most of the metadata available concerns the content of the sequence; some user - metadata is also available, especially in - CLAN - . The insertion of metadata follows the +

Converting the metadata is straightforward, as the four tools ( + CLAN, ELAN, + Praat, and + Transcriber) do not enable a large amount of + metadata to be edited. Most of the metadata available concerns the content of the + sequence; some user metadata is also available, especially in + CLAN . The insertion of metadata follows the indications of the ISO/TEI 24624:2016 standard (ISO 2016).

Moreover, some tools, such as Transcriber, include information about silences, - pauses, and events in their XML format. This information is also processed within - - TEICORPO, once again following the recommendations of the ISO/TEI standard.

+ pauses, and events in their XML format. This information is also processed within + TEICORPO, once again following the + recommendations of the ISO/TEI standard.

Conversion of the main data, the transcription and the annotations, cannot always be done solely on the basis of the description provided in the ISO/TEI guidelines. These - guidelines do, however, suffice to fully describe the content of the - CLAN and - - Transcriber software. We took advantage of the new annotationBlock element, - which codes several annotation levels, a function that is commonly required in - spoken-language annotations.

+ guidelines do, however, suffice to fully describe the content of the + CLAN and + Transcriber software. We took advantage of the + new annotationBlock element, which codes several annotation levels, a + function that is commonly required in spoken-language annotations.

The annotationBlock contains two major elements: the u element, which contains the transcription in orthographic form, and the spanGrp elements, which contain tier elements that annotate the utterance described in the @@ -534,25 +563,25 @@ indicated in the parent spanGrp element. and provide an example of conversion from a - CLAN file to illustrate how a production annotated on - different levels (orthography, morphosyntax, dependencies) is represented in TEI with - a first main utterance element u to which two spanGrps are linked, - one for each annotation level, in our case one spanGrp for morphosyntax and - one spanGrp for dependencies (see ). A - timeline element gives the start (T1) and end (T2) - timecodes and an annotationBlock element specifies the speaker with the - who attribute and the start and end attributes with - the timecode anchors #T1 and #T2. The annotationBlock - element includes both the utterance element and the two annotations. No semantic - constraint is imposed on the inner content of the span elements. The content of the - type attribute in the spanGrp element represents and documents - the choice of the researchers who produced the original corpus. The content generated - is preserved as it was in the original file, making backward conversion possible. In - , the mor and - gra attribute values represent grammatical knowledge. Using the content - of these elements to produce advanced grammatical representation in more elaborate TEI - and XML formats is of course possible, but would be a tailored task which is beyond - the scope of the + CLAN file to illustrate how a production + annotated on different levels (orthography, morphosyntax, dependencies) is represented + in TEI with a first main utterance element u to which two spanGrps + are linked, one for each annotation level, in our case one spanGrp for + morphosyntax and one spanGrp for dependencies (see ). A timeline element gives the start (T1) and + end (T2) timecodes and an annotationBlock element specifies the + speaker with the who attribute and the start and end + attributes with the timecode anchors #T1 and #T2. The + annotationBlock element includes both the utterance element and the two + annotations. No semantic constraint is imposed on the inner content of the span + elements. The content of the type attribute in the spanGrp element + represents and documents the choice of the researchers who produced the original + corpus. The content generated is preserved as it was in the original file, making + backward conversion possible. In , the + mor and gra attribute values represent grammatical knowledge. + Using the content of these elements to produce advanced grammatical representation in + more elaborate TEI and XML formats is of course possible, but would be a tailored task + which is beyond the scope of the TEICORPO project.

*MOT: look at the tree ! 2263675_2265197 @@ -591,22 +620,25 @@ tools, a single-level annotation structure within the spanGrp elements is insufficient to represent the complex organization that can be constructed with the ELAN and - Praat tools. ELAN is a tool used by many researchers to - describe data of greater complexity than the data presented in the ISO/TEI guidelines. - As the goal of the - TEICORPO project was to convert all types of structure used in the - spoken language community, including ELAN and - Praat, it was necessary to extend - the description method presented in .

-

In ELAN and - Praat, the multitiered annotations can be organized in a - structured manner. These tools take advantage of the partition presentation of the - data, so that the relationship between a parent tier and a child tier can be precisely - organized. There are two main types of organization: symbolic and temporal.

+ >ELAN and + Praat tools. ELAN is a tool used by many + researchers to describe data of greater complexity than the data presented in the + ISO/TEI guidelines. As the goal of the + TEICORPO project was to convert all types of + structure used in the spoken language community, including ELAN and + Praat, it was necessary to extend the description + method presented in .

+

In ELAN and + Praat, the multitiered annotations can be + organized in a structured manner. These tools take advantage of the partition + presentation of the data, so that the relationship between a parent tier and a child + tier can be precisely organized. There are two main types of organization: symbolic + and temporal.

In symbolic division, the elements of a child tier, C1 to Cn, can be related to an element of a parent tier P. For example, a word is divided into morphemes. In , the main @@ -653,7 +685,8 @@ usual spoken language corpus (such as those described in Schmidt 2011). However, as this type of data is produced by members of the CORLI consortium, it needs to be preserved. Encoding the data in TEI using a standard tool makes the process - reproducible, which is one of the goals of + reproducible, which is one of the goals of TEICORPO.

Although this type of data is not described in the ISO/TEI guidelines, it is in fact possible to store it in TEI format using current TEI features. TEI provides a general @@ -662,30 +695,31 @@ attributes that can point to other elements or to timelines. Using this coding schema, it is therefore possible to store any type of structure, symbolic and/or temporal, that can be generated with ELAN or - + type="soft.name" ref="#R88">ELAN or PRAAT, as described above.

To do this, each element which is in a symbolic or temporal relation is represented by a spanGrp element of the TEI. The spanGrp element contains as many span elements as necessary to store all the elements present in the ELAN or - PRAAT representation. The parent element of a spanGrp is the - main annotationBlock element when the division in ELAN or PRAAT is - the first division of a main element. The parent element is another span - element when the division in ELAN or - PRAAT is a subdivision of another element - which is not a main element. This XML structure is complemented by explicit - information as allowed in TEI. The span elements are linked to the element - they depend on, either with a symbolic link using the target attribute of - the span element, or with temporal links using the from and - to attributes of the span element.

+ >ELAN or + PRAAT representation. The parent element of a + spanGrp is the main annotationBlock element when the division in + ELAN or PRAAT is the first division of a main element. The parent element is + another span element when the division in ELAN or + PRAAT is a subdivision of another element which + is not a main element. This XML structure is complemented by explicit information as + allowed in TEI. The span elements are linked to the element they depend on, + either with a symbolic link using the target attribute of the span + element, or with temporal links using the from and to attributes + of the span element.

Two examples of how this is displayed in a TEI file are given below. The first example (see and ) corresponds to the ELAN example above (see ELAN example above (see , ). The TEI encoding represents the words of the sentence from left to right (from gahwat to endi in our example). The detail of the @@ -743,14 +777,14 @@

The second example is structured using time references. This example (see and ) corresponds to the - Praat example above (see , ). In this case, each part - of the transcription is represented according to the timeline, but there is also a - hierarchy which is represented by the spanGrp and span tags. Each - span is part of the parent spanGrp with starting and ending points - (which correspond to the from and to attributes in the example - below). The use of from + />) corresponds to the + Praat example above (see , ). In this case, each part of the transcription is represented according to the + timeline, but there is also a hierarchy which is represented by the spanGrp + and span tags. Each span is part of the parent spanGrp with + starting and ending points (which correspond to the from and to + attributes in the example below). The use of from to versus target is the only difference between the two organizations. In the example below, the syllable Sa is divided into two phonemes, S and a (see xml:id s s34, s36, and @@ -758,7 +792,7 @@

ELAN example of a temporal division + type="soft.name" ref="#98">ELAN example of a temporal division
@@ -787,9 +821,10 @@

The spanGrp and span offer a generic representation of data coming from relatively unconstrained representations produced by partition software. The names of the tiers used in the ELAN and - Praat tools are given in the content of - the type attribute. These names are not used to provide structural + type="soft.name" ref="#R99">ELAN and + Praat tools are given in the content of the + type attribute. These names are not used to provide structural information, the structure being represented only by the spanGrp and span hierarchy. However, the organization into spanGrp and span is not always sufficient to represent all the details of the tier @@ -810,16 +845,18 @@

Exporting to Research Tools

In the - TEICORPO approach, no modification is made to the original format and conversion - remains as lossless as possible. This allows for all types of corpora to be stored for - long-term preservation purposes. It also allows the corpora to be used with other - editing tools, some of which are suited to specific processing: for example, - Praat for - phonetics/phonology; - Transcriber/ - CLAN for raw transcription; and ELAN for gesture - and visual coding.

+ TEICORPO approach, no modification is made to the + original format and conversion remains as lossless as possible. This allows for all + types of corpora to be stored for long-term preservation purposes. It also allows the + corpora to be used with other editing tools, some of which are suited to specific + processing: for example, + Praat for phonetics/phonology; + Transcriber/ + CLAN for raw transcription; and ELAN for gesture and visual coding.

However, a large proportion of scientific research and applications done using corpora requires further processing of the data. For example, although querying or using raw language forms is possible, many research investigations and tools use words, parts of @@ -831,8 +868,8 @@ TEI file can contain standardized information about words, specific spoken language information, and sometimes even POS information.

This approach was not adopted in - TEICORPO for several reasons. First, we had to deal - with a large variety of coding approaches, which makes it difficult to conduct work + TEICORPO for several reasons. First, we had to + deal with a large variety of coding approaches, which makes it difficult to conduct work similar to that done in CHILDES (MacWhinney 2000; see ). Second, there was no consensus about the way tokenization should be performed, as many researchers consider @@ -847,7 +884,8 @@ Second, we decided to design another category of tools for processing or making it possible to process the spoken language corpus, and to use powerful tools in corpus analysis. This part of the - TEICORPO library is described in the next section.

+ TEICORPO library is described in the next + section.

@@ -870,27 +908,28 @@
Basic Import and Export Functions

The command-line interface (see ) can - perform conversions between TEI and the formats used by the following programs: - - CLAN, - ELAN, - Praat, and - Transcriber. The conversions can be performed on single files - or on whole directories or on a file tree. The command-line interface is suited to - automatic processing in offline environments. The online interface (see - ) can convert one or several files - selected by the user, but not whole directories. Results appear in the user’s download - folder.

+ perform conversions between TEI and the formats used by the following programs: + CLAN, ELAN, + Praat, and + Transcriber. The conversions can be performed on + single files or on whole directories or on a file tree. The command-line interface is + suited to automatic processing in offline environments. The online interface (see + ) can + convert one or several files selected by the user, but not whole directories. Results + appear in the user’s download folder.

In addition to the conversion to and from the alignment software, the online version of - - TEICORPO offers import and export in common spreadsheet formats (.xlsx and .csv) and - word processing formats (.docx and .txt). Importing data is useful to create new data, - and exporting is used to make reports or examples for a publication and for end users - not familiar with transcription tasks or computer software (see and ).

+ + TEICORPO offers import and export in common + spreadsheet formats (.xlsx and .csv) and word processing formats (.docx and .txt). + Importing data is useful to create new data, and exporting is used to make reports or + examples for a publication and for end users not familiar with transcription tasks or + computer software (see and ).

Visual representation of data from Example 1 after being processed through TEI @@ -938,31 +977,34 @@ transcription.

Other features are available in both types of interface (command line and web service). - - TEICORPO allows the user to exclude some tiers, for example adult tiers in acquisition - research where the user wants to study child production only, or comment tiers which are - not necessary for some studies.

+ + TEICORPO allows the user to exclude some tiers, + for example adult tiers in acquisition research where the user wants to study child + production only, or comment tiers which are not necessary for some studies.

Export to Specialized Software -

Another kind of export concerns textometric software. - TEICONVERT makes spoken language - data available for TXM (Heiden 2010; see ), - Le Trameur (Fleury and Zimina 2014; see ), and +

Another kind of export concerns textometric software. + TEICONVERT makes spoken language data available + for TXM (Heiden + 2010; see ), + Le Trameur (Fleury and Zimina 2014; see ), + and Iramuteq (see and de Souza et - al. 2018), providing a dedicated TEI export for these tools. For example, for - the TXM software, the export includes a text element made of utterance elements - including age and speaker attributes. - presents an example for the TXM software.

+ target="http://iramuteq.org/"/> and de Souza et al. 2018), providing a + dedicated TEI export for these tools. For example, for the TXM software, the + export includes a text element made of utterance elements including age and speaker + attributes. presents an example for the + TXM software.

@@ -1007,9 +1049,9 @@ Example of export for the Lexico or Le Trameur software
-

Likewise, another export is available for the textometric tool Iramuteq without - timelines (see ).

+

Likewise, another export is available for the textometric tool Iramuteq + without timelines (see ).

**** -*MOT you have to rest now ? -*CHI yes . -*MOT from your big singing extravaganza ? -*CHI yes that was a party . -*MOT woof . -*MOT that was a party that @@ -1017,10 +1059,10 @@ Example of export for the IRAMUTEQ software

In all these cases, - TEICORPO is able to provide an export file and to remove - unnecessary information from the TEI pivot format. This is useful, for example, with - textometric software, which works only with orthographic tiers without a timeline or - dependent information.

+ TEICORPO is able to provide an export file and to + remove unnecessary information from the TEI pivot format. This is useful, for example, + with textometric software, which works only with orthographic tiers without a timeline + or dependent information.

@@ -1031,25 +1073,28 @@ often they run only on raw orthographic material, excluding other information. Moreover, their results are not always in a format that can be used with traditional spoken language software such as - CLAN - , ELAN, - Praat, - Transcriber, nor of course in TEI - format.

+ CLAN , ELAN, + Praat, + Transcriber, nor of course in TEI format.

- TEICORPO provides a way to solve this problem by running analyzers and putting the - results from the analysis back into TEI format. Once the TEI format has been enriched - with grammatical information, it is possible to use the results and convert them back to - ELAN or - Praat and use the grammatical information in these spoken language - software packages. It is also possible to export to TXM and to use the grammatical - information in the textometric software. Two grammatical analyzers have been implemented - in - TEICORPO: - TreeTagger and + TEICORPO provides a way to solve this problem by + running analyzers and putting the results from the analysis back into TEI format. Once + the TEI format has been enriched with grammatical information, it is possible to use the + results and convert them back to ELAN or + Praat and use the grammatical information in these + spoken language software packages. It is also possible to export to TXM and to use the + grammatical information in the textometric software. Two grammatical analyzers have been + implemented in + TEICORPO: + TreeTagger and CoreNLP.

@@ -1057,25 +1102,30 @@

TreeTagger

Accessed March 11, 2021, .

- (Schmid 1994; 1995) is a tool for annotating text with part-of-speech - and lemma information. The software is freely available for research, education, and - evaluation. It is available in twenty-five languages, provides high-quality results, - and can be easily improved by enriching the training set, as was done for instance by - Benzitoun, Fort, and Sagot (2012) in - the PERCEO project. They defined a syntactic model suitable for spoken language - corpora, using the training feature of TreeTagger and an iterative process including - manual corrections to improve the results of the automatic tool.

+ target="https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/"/>.

+ (Schmid + 1994; 1995) is a tool for annotating text with + part-of-speech and lemma information. The software is freely available for research, + education, and evaluation. It is available in twenty-five languages, provides + high-quality results, and can be easily improved by enriching the training set, as was + done for instance by Benzitoun, Fort, and Sagot (2012) in the PERCEO project. They defined a syntactic + model suitable for spoken language corpora, using the training feature of TreeTagger + and an iterative process including manual corrections to improve the results of the + automatic tool.

The command-line version of - TEICORPO should be used to generate an annotated file - with lemma and POS information based on - TreeTagger. - TreeTagger should be installed - separately. The implementation of - TreeTagger in - TEICORPO includes the ability to use - any syntactic model. For French data, we used the PERCEO model (TEICORPO should be used to generate an annotated + file with lemma and POS information based on + TreeTagger. + TreeTagger should be installed separately. The + implementation of + TreeTagger in + TEICORPO includes the ability to use any + syntactic model. For French data, we used the PERCEO model (Benzitoun, Fort, and Sagot 2012).

The command line to be used is: java -cp TEICORPO.jar fr.ortolang.TEICORPO.TeiTreeTagger filenames... with additional @@ -1090,16 +1140,18 @@

-model filename

-

filename is the full name of the - TreeTagger syntactic model. In - our case, we use the PERCEO model.

+

filename is the full name of the + TreeTagger syntactic model. In our case, + we use the PERCEO model.

-program filename

-

filename is the full location of the - TreeTagger program, according - to the system used (Windows, MacOS, or Linux).

+

filename is the full location of the + TreeTagger program, according to the system + used (Windows, MacOS, or Linux).

-normalize @@ -1109,9 +1161,9 @@

The environment variable TREE_TAGGER can be used to locate the model and the program. - If no -program option is used, the default name for the - TreeTagger - program is used.

+ If no -program option is used, the default name for the + TreeTagger program is used.

The -model parameter is mandatory.

The resulting filename ends with .tei_corpo_ttg.tei_corpo.xml or a specific name provided by the user (option -o).

@@ -1230,16 +1282,18 @@ Stanford CoreNLP

The Stanford Core Natural Language Processing -

Accessed March 11, 2021, .

+

Accessed March 11, 2021, .

( - CoreNLP) package is a suite of tools (Manning et al. 2014) that can be used under a GNU General Public License. The - suite provides several tools such as a tokenizer, a POS tagger, a parser, a named - entity recognizer, temporal tagging, and coreference resolution. All the tools are - available for English, but only some of them are available for all languages. All - software libraries are integrated into Java JAR files, so all that is - required is to download JAR files from the CoreNLP website + CoreNLP) package is a suite of tools (Manning et al. + 2014) that can be used under a GNU General Public License. The suite + provides several tools such as a tokenizer, a POS tagger, a parser, a named entity + recognizer, temporal tagging, and coreference resolution. All the tools are available + for English, but only some of them are available for all languages. All software + libraries are integrated into Java JAR files, so all that is required is to + download JAR files from the CoreNLP website

Accessed May 5, 2021, .

to use them with @@ -1253,7 +1307,7 @@

The directory_for_SNLP is the name of the location on a computer where all the CoreNLP JAR files can be found. Note that using the CoreNLP software makes heavy demands on the computer’s memory resources and it is necessary to instruct the - Java software to use a large amount of memory (for example to insert parameter -mx5g before parameter -cp to indicate that 5 GB of memory will be used for a full English analysis).

@@ -1270,19 +1324,20 @@ Exporting the Grammatical Analysis

The results from the grammatical analysis can be used in transcription files such as those used by - Praat and ELAN. A partition-like visual presentation of data - is very handy to represent a part of speech or a CONLL result. The orthographic line - will appear at the top with divisions into words, divisions into parts of speech, and - other syntactic information below. As the result of the analysis can contain a large - number of tiers (each speaker will have as many tiers as there are elements in the - grammatical analysis: for example, word, POS, and lemma for TreeTagger; ten tiers for - CoreNLP full analysis), it is helpful to limit the number of visible tiers, either using - the -a option of - TEICORPO, or limiting the display with the annotation - tool.

+ Praat and ELAN. A partition-like visual + presentation of data is very handy to represent a part of speech or a CONLL result. The + orthographic line will appear at the top with divisions into words, divisions into parts + of speech, and other syntactic information below. As the result of the analysis can + contain a large number of tiers (each speaker will have as many tiers as there are + elements in the grammatical analysis: for example, word, POS, and lemma for TreeTagger; + ten tiers for CoreNLP full analysis), it is helpful to limit the number of visible + tiers, either using the -a option of + TEICORPO, or limiting the display with the + annotation tool.

An example is presented below in the ELAN tool (see ELAN tool (see ). The original utterance was si c’est comme ça je m’en vais (if that’s how it is, I’m leaving). It is displayed in the first line, highlighted in pink. The analysis into words (second line, consisting of numbers), @@ -1293,23 +1348,22 @@

Example of - TreeTagger - analysis representation in a partition - software program + TreeTagger analysis representation in a + partition software program

Export can be done from TEI into a format used by textometric software (see ). This is the case for TXM, -

See the Textométrie website, last updated June 29, 2020, .

+ type="software" xml:id="R160" target="#txm"/>TXM, +

See the Textométrie website, last updated June 29, 2020, .

a textometric software application. In this case, instead of using a partition representation, the information from the grammatical analysis is inserted at the word level in an XML structure. For example, in the case below, the TXM export includes - - TreeTagger annotations in POS, adding lemma and pos attributes to - the word element w.

+ + TreeTagger annotations in POS, adding + lemma and pos attributes to the word element w.

@@ -1340,105 +1394,113 @@ - Example of - - TreeTagger analysis representation that can be imported - into TXM + Example of + TreeTagger analysis representation that can be + imported into TXM
Comparison with Other Software Suites -

The additional functionalities available in the - TEICORPO suite are close to those - available in the - Weblicht web services ( Hinrichs, Hinrichs, and Zastrow 2010). To a certain extent, the two suites of - tools ( - - Weblicht and - TEICORPO) have the same purpose and functionalities. They can import - data from various formats, run similar processes on the data, and export the data for - scientific uses. In some cases, the services could complement each other or - TEICORPO - could be integrated in the - Weblicht services. This is the case, for example, for - handling the CHILDES format, which at the time of writing is more functional in - TEICORPO - than in - +

The additional functionalities available in the + TEICORPO suite are close to those available in the + + Weblicht web services ( Hinrichs, Hinrichs, and Zastrow + 2010). To a certain extent, the two suites of tools ( + Weblicht and + TEICORPO) have the same purpose and + functionalities. They can import data from various formats, run similar processes on the + data, and export the data for scientific uses. In some cases, the services could + complement each other or + TEICORPO could be integrated in the + Weblicht services. This is the case, for example, + for handling the CHILDES format, which at the time of writing is more functional in + TEICORPO than in Weblicht.

A major difference between the two suites is in the way they can be used and in the type of data they target. - TEICORPO is intended to be used not as an independent tool, - but as a utility tool that helps researchers to go from one type of data to another. For - example, the syntactic analysis is intended to be used as a first step before being used - in tools such as - Praat, ELAN, or TXM. Our more recent developments - (see Badin et al. 2021) made it possible to - insert metadata stored in CSV files (including participant metadata) into the TEI files. - This makes it possible to achieve more powerful corpus analysis using a tool such as - TXM.

+ TEICORPO is intended to be used not as an + independent tool, but as a utility tool that helps researchers to go from one type of + data to another. For example, the syntactic analysis is intended to be used as a first + step before being used in tools such as + Praat, ELAN, or TXM. Our more + recent developments (see Badin et al. 2021) + made it possible to insert metadata stored in CSV files (including participant metadata) + into the TEI files. This makes it possible to achieve more powerful corpus analysis + using a tool such as TXM.

Our approach is somewhat similar to what is suggested in the conclusion of Schmidt, Hedeland, and Jettka (2017), who describe a - mechanism that makes it possible to use the power of - Weblicht to process their files - that are in the ISO/TEI format. A similar mechanism could be used within - TEICORPO to - take advantage of the tools that are implemented in - Weblicht. However, Schmidt, - Hedeland, and Jettka (2017) suggest in - their conclusion that it would be more interesting to work directly on ISO/TEI files - because they contain a richer format. This is exactly what we did in - TEICORPO. Our - suggestion would be to use the tools created by Schmidt, Hedeland, and Jettka (2017) directly with the - TEICORPO files, so - that their work would complement ours. Moreover, in this way, the two projects would be - compatible and provide either new functionalities when the projects have clearly - different goals, or data variants when the goals are closer.

+ mechanism that makes it possible to use the power of + Weblicht to process their files that are in the + ISO/TEI format. A similar mechanism could be used within + TEICORPO to take advantage of the tools that are + implemented in + Weblicht. However, Schmidt, Hedeland, and Jettka + (2017) suggest in their conclusion that + it would be more interesting to work directly on ISO/TEI files because they contain a + richer format. This is exactly what we did in + TEICORPO. Our suggestion would be to use the tools + created by Schmidt, Hedeland, and Jettka (2017) directly with the + TEICORPO files, so that their work would + complement ours. Moreover, in this way, the two projects would be compatible and provide + either new functionalities when the projects have clearly different goals, or data + variants when the goals are closer.

Conclusion

- TEICORPO is a functional tool, created by the CORLI network and ORTOLANG, that converts - files created by software specializing in editing spoken-language data into TEI format. - The result is fully compatible with the most recent developments in TEI, especially those - that concern spoken-language material.

+ TEICORPO is a functional tool, created by the CORLI + network and ORTOLANG, that converts files created by software specializing in editing + spoken-language data into TEI format. The result is fully compatible with the most recent + developments in TEI, especially those that concern spoken-language material.

The TEI files can also be converted back to the original formats or to other formats used in spoken-language editing to take advantage of their functionalities. This makes TEI a useful pivot format. Moreover, TEICORPO allows conversion to formats used by tools dedicated to corpus exploration and browsing.

- TEICORPO exists as a command-line interface as well as a web service. It can thus be used - by novice as well as advanced users, or by developers of linguistic software. The tool is - free and open source so it can be further used and developed in other projects.

+ TEICORPO exists as a command-line interface as well + as a web service. It can thus be used by novice as well as advanced users, or by + developers of linguistic software. The tool is free and open source so it can be further + used and developed in other projects.

- TEICORPO is intended to be part of a large set of tools using TEI for linguistic corpus - research. It can be used in parallel with or as a complement to other tools such as - Weblicht or the EXMARaLDA tools (see Schmidt, Hedeland, and Jettka 2017). A specificity of - - TEICORPO is that it is more suitable for processing extended forms of TEI data (especially - forms which are not inside the main u element in the TEI code). - TEICORPO is also - linked to - TEIMETA, a flexible tool for describing spoken language corpora in a web - interface generated from an ODD file (Etienne, Liégois, - and Parisse, accepted). As TEI enables metadata and data to be stored in the same - file, sharing this format will promote metadata sharing and will keep metadata linked to - their data during the life cycle of the data.

+ TEICORPO is intended to be part of a large set of + tools using TEI for linguistic corpus research. It can be used in parallel with or as a + complement to other tools such as Weblicht or the EXMARaLDA tools (see Schmidt, Hedeland, and Jettka 2017). A + specificity of + TEICORPO is that it is more suitable for processing + extended forms of TEI data (especially forms which are not inside the main u + element in the TEI code). + TEICORPO is also linked to + TEIMETA, a flexible tool for describing spoken + language corpora in a web interface generated from an ODD file (Etienne, Liégois, and Parisse, accepted). As TEI enables + metadata and data to be stored in the same file, sharing this format will promote metadata + sharing and will keep metadata linked to their data during the life cycle of the data.

Potential further developments could provide wider coverage of different formats such as - CMDI or linked data for editing or data exploration purposes; allow - TEICORPO to work with - other external tools such as grammatical analyzers; or enable the visualization of - multilevel annotations.

+ CMDI or linked data for editing or data exploration purposes; allow + TEICORPO to work with other external tools such as + grammatical analyzers; or enable the visualization of multilevel annotations.

diff --git a/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml b/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml index 6475aa0f..94556410 100644 --- a/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml +++ b/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml @@ -162,39 +162,39 @@ />.

in favor of a web-based publication. While this decision was critical in that it allowed us to select the most supported and widely-used medium, we soon discovered that it did not make choices any simpler. On the one hand, the XSLT stylesheets provided by TEI are great for HTML rendering, but do not - include support for image-related features (such as the text-image linking available - thanks to the P5 version of the TEI schema) and tools (including zoom in/out, magnifying - lens, and hot spots) that represent a significant part of a digital facsimile and/or - diplomatic edition; other features, such as an XML search engine, would have to be - integrated separately, in any case. On the other hand, there are powerful frameworks - based on CMS

The Omeka framework () supports - publishing TEI documents; see also Drupal () and - - TEICHI ( + type="software" xml:id="R1" target="#XSLT"/>XSLT + stylesheets provided by TEI are great for HTML rendering, but do not include support for + image-related features (such as the text-image linking available thanks to the P5 + version of the TEI schema) and tools (including zoom in/out, magnifying lens, and hot + spots) that represent a significant part of a digital facsimile and/or diplomatic + edition; other features, such as an XML search engine, would have to be integrated + separately, in any case. On the other hand, there are powerful frameworks based on + CMS

The Omeka framework () supports publishing TEI documents; see also + Drupal () + and TEICHI ( ).

and other web - technologies

Such as the - eXist XML database, .

which looked far too complex and - expensive, particularly when considering future maintenance needs, for our project’s - purposes. Other solutions, such as the - EPPT - software

Edition Production and Presentation Technology, - - .

developed by K. - Kiernan or the + technologies

Such as the + eXist XML database, .

which looked far too + complex and expensive, particularly when considering future maintenance needs, for our + project’s purposes. Other solutions, such as the + EPPT software

Edition Production and Presentation Technology, + .

developed by K. Kiernan or + the - Elwood viewer

Elwood Viewer, - - .

created - by G. Lyman, either were not yet ready or were unsuitable for other reasons (proprietary - software, user interface issues, specific hardware and/or software requirements).

+ Elwood viewer

Elwood Viewer, + .

created by G. Lyman, either were + not yet ready or were unsuitable for other reasons (proprietary software, user interface + issues, specific hardware and/or software requirements).

Standard vs. Fragmentation @@ -212,9 +212,9 @@
First Experiments

At first, however, - EVT was more an experimental research project for students at the - Informatica Umanistica - course of the University of Pisa

BA course, EVT was more an experimental research project for + students at the Informatica + Umanistica course of the University of Pisa

BA course, .

than a real attempt to solve the digital edition viewer problem. We aimed at investigating some user interface–related aspects of such a viewer, in particular certain usability problems @@ -238,8 +238,8 @@ EVT Version
- EVT v. 2.0: - Rebooting the Project + EVT + v. 2.0: Rebooting the Project

To get out of the impasse we decided to completely reboot the project, removing secondary features and giving priority to fundamental ones. We also found a solution for the data-loading problem: instead of finding a way to load the data into the software we @@ -248,60 +248,63 @@ text, with very little configuration needed to create the edition. This approach also allowed us to quickly test XML files belonging to other edition projects, to check if EVT could go beyond being a project-specific tool. The inspiration for these changes - came from work done in similar projects developed within the TEI community, namely - + came from work done in similar projects developed within the TEI community, namely - TEI Boilerplate,

TEI - Boilerplate, - .

+ TEI Boilerplate,

TEI Boilerplate, + + .

- John A. - Walsh’s collection of XSLT stylesheets,

tei2html + John A. Walsh’s collection of XSLT stylesheets,

+ tei2html , .

and Solenne Coutagne’s - work for the Berliner Intellektuelle 1800–1830 - project.

Digitale Edition Briefe und Texte aus dem - intellektuellen Berlin um 1800, .

and Solenne + Coutagne’s work for the Berliner + Intellektuelle 1800–1830 project.

Digitale Edition Briefe und Texte aus dem intellektuellen Berlin um 1800, .

- Through this approach, we achieved two important results: first, usage of - EVT is quite - simple—the user applies an XSLT stylesheet to their already marked-up file(s), - and when the processing is finished they are presented with a web-ready edition; second, - the web edition that is produced is based on a client-only architecture and does not - require any additional kind of server software, which means that it can be simply copied - on a web server to be used at once, or even on a cloud storage service (provided that it - is accessible by the general public).

+ Through this approach, we achieved two important results: first, usage of + EVT is quite simple—the user applies an XSLT + stylesheet to their already marked-up file(s), and when the processing is finished they + are presented with a web-ready edition; second, the web edition that is produced is + based on a client-only architecture and does not require any additional kind of server + software, which means that it can be simply copied on a web server to be used at once, + or even on a cloud storage service (provided that it is accessible by the general + public).

To ensure that it will be working on all the most recent web browsers, and for as long - as possible on the World Wide Web itself, - EVT is built on open and standard web - technologies such as HTML, CSS, and JavaScript. Specific - features, such as the magnifying lens, are entrusted to jQuery plug-ins, again chosen - from the best-supported open-source ones to reduce the risk of future incompatibilities. - The general architecture of the software, in any case, is modular, so that any component - which may cause trouble or turn out to be not completely up to the task can be replaced - easily.

+ as possible on the World Wide Web itself, + EVT is built on open and standard web technologies + such as HTML, CSS, and JavaScript. Specific features, such as the magnifying + lens, are entrusted to jQuery plug-ins, again chosen from the best-supported open-source + ones to reduce the risk of future incompatibilities. The general architecture of the + software, in any case, is modular, so that any component which may cause trouble or turn + out to be not completely up to the task can be replaced easily.

How it Works

Our ideal goal was to have a simple, very user-friendly drop-in tool, requiring little work and/or knowledge of anything beyond XML from the editor. To reach this goal, EVT is based on a modular structure where a single stylesheet (evt_builder.xsl) - starts a chain of XSLT 2.0 transformations calling in turn all the - other modules. The latter belong to two general categories: those devoted to building - the HTML site, and the XML processing ones, which extract the edition text lying between - folios using the pb element and format it according to the edition level. All - XSLT modules live inside the builder_pack folder, in order to - have a clean and well-organized directory hierarchy.

+ starts a chain of XSLT + 2.0 transformations calling in turn all the other + modules. The latter belong to two general categories: those devoted to building the HTML + site, and the XML processing ones, which extract the edition text lying between folios + using the pb element and format it according to the edition level. All XSLT + modules live inside the builder_pack folder, in order to have a clean and + well-organized directory hierarchy.
The - EVT builder_pack directory structure. + EVT + builder_pack directory structure.

Therefore, assuming the available formatting stylesheets meet your project’s criteria, @@ -314,9 +317,9 @@ evt_builder-conf.xsl, to specify for example the number of edition levels or presence of images; you can then apply the evt_builder.xsl stylesheet to your TEI XML - document using the Oxygen XML editor or another XSLT 2–compliant + document using the Oxygen XML editor or another XSLT 2–compliant engine.

@@ -326,38 +329,38 @@

When the XSLT processing is finished, the starting point for the edition is - the index.html file in the root directory, and all the HTML pages - resulting from the transformations will be stored in the output_data - folder. You can delete everything in this latter folder (and the - index.html file), modify the configuration options, and start again, - and everything will be re-created in the assigned places.

+ ref="#R23">XSLT processing is finished, the starting point for the edition is the + index.html file in the root directory, and all the HTML pages resulting + from the transformations will be stored in the output_data folder. You + can delete everything in this latter folder (and the index.html file), + modify the configuration options, and start again, and everything will be re-created in + the assigned places.

The XSLT stylesheets + ref="#R24">XSLT stylesheets

The transformation chain has two main purposes: generate the HTML files containing the edition and create the home page which will dynamically recall the other HTML files.

The EVT builder’s transformation system is composed of a modular collection of XSLT 2.0 stylesheets: these modules are designed to permit scholars to freely - add their own stylesheets and to manage the different desired levels of the edition - without influencing other parts of the system, for instance the generation of the home - page.

+ type="software" xml:id="R25" target="#XSLT"/>XSLT + 2.0 stylesheets: these modules are designed to permit + scholars to freely add their own stylesheets and to manage the different desired levels + of the edition without influencing other parts of the system, for instance the + generation of the home page.

The transformation is performed applying a specific XSLT stylesheet + target="#XSLT"/>XSLT stylesheet (evt_builder.xsl) which includes links to all the other stylesheets that are part of the transformation chain and that will be applied to the TEI XML document containing the transcription.

- EVT can be used to create image-based editions with different edition levels starting - from a single encoded text. The text of the transcription must be divided into smaller - parts to recreate the physical structure of the manuscript. Therefore, it is essential - that paginated XML documents are marked using a TEI page break element (pb) at - the start of each new page or folio side, so that the transformation system will be able - to recognize and handle everything that stands between a pb element and the - next one as the content of a single page.

+ EVT can be used to create image-based editions with + different edition levels starting from a single encoded text. The text of the + transcription must be divided into smaller parts to recreate the physical structure of + the manuscript. Therefore, it is essential that paginated XML documents are marked using + a TEI page break element (pb) at the start of each new page or folio side, so + that the transformation system will be able to recognize and handle everything that + stands between a pb element and the next one as the content of a single + page.

The system is designed to generate an arbitrary number of edition levels: as a consequence, the user is required to indicate how many (and which) output levels they intend to create by modifying the corresponding parameter in the configuration file.

@@ -400,8 +403,8 @@ the transformation chain. This permits the extraction of different texts for different edition levels (diplomatic, diplomatic-interpretative) processing the same XML file, and to save them in the HTML site structure, which is available as a separate XSLT module.

+ type="software" xml:id="R30" target="#XSLT"/>XSLT + module.

The use of modes also allows users to separate template rules for the different transformations of a TEI element and to place them in different XSLT files or in @@ -425,7 +428,7 @@ required from users is to: personalize the edition generation parameter as shown above; copy their own XSLT files containing the template rules to + type="soft.name" ref="#R32">XSLT files containing the template rules to generate the desired edition levels in the directory that contains the stylesheets used for TEI element transformation (builder_pack/modules/elements); @@ -437,29 +440,30 @@

For the time being, this kind of customization has to be done by hand-editing the - configuration files, but in a future version of - EVT we plan to add a more user-friendly - way to configure the system.

+ configuration files, but in a future version of + EVT we plan to add a more user-friendly way to + configure the system.

Features

At present, - EVT can be used to create image-based editions with two possible edition - levels: diplomatic and diplomatic-interpretative; this means that a transcription - encoded using elements belonging to the appropriate TEI module

See chapter 11, - Representation of Primary Sources, in the EVT can be used to create image-based editions with + two possible edition levels: diplomatic and diplomatic-interpretative; this means that a + transcription encoded using elements belonging to the appropriate TEI module

See + chapter 11, Representation of Primary Sources, in the TEI Guidelines.

should already be compatible with - EVT, or require only minor changes to be made compatible. The Vercelli - Book transcription schema is based on the standard TEI schema, with no custom elements - or attributes added: our tests with similarly encoded texts showed a high grade of - compatibility. A critical edition level is currently being researched and it will be - added in the future.

+ EVT, or require only minor changes to be made + compatible. The Vercelli Book transcription schema is based on the standard TEI schema, + with no custom elements or attributes added: our tests with similarly encoded texts + showed a high grade of compatibility. A critical edition level is currently being + researched and it will be added in the future.

When the website produced by - EVT is loaded in a browser, the viewer will be presented - with the manuscript image on the left side, and the corresponding text on the right: - this is the default view, but on the main toolbar at the top right corner of the browser - window there are icons to access all the available views: + EVT is loaded in a browser, the viewer will be + presented with the manuscript image on the left side, and the corresponding text on the + right: this is the default view, but on the main toolbar at the top right corner of the + browser window there are icons to access all the available views: Image-Text view: as mentioned above, this is the default view showing a manuscript folio image and the corresponding text in one or more edition levels; @@ -484,8 +488,8 @@ is that the editor should encode folio numbers by means of the pb element including r and v letters to mark recto and verso pages, respectively. - EVT will take care of automatically associating each - folio to the images copied in the input_data/images folder using a + EVT will take care of automatically associating + each folio to the images copied in the input_data/images folder using a verso-recto naming scheme (for example: 104v-105r.png). It is of course possible that in some cases the transformation process is unable to return the correct result: this is why we decided to @@ -493,9 +497,10 @@ independent from the HTML interface; this file will be updated automatically every time the transformation process is started and can be customized by the editor.

Although the different views access different kinds of content, such as single side and - double side images, the navigation algorithms used by - EVT allow the user to move from - one view to another without losing the current browsing position.

+ double side images, the navigation algorithms used by + EVT allow the user to move from one view to another + without losing the current browsing position.

All content is shown inside HTML frames designed to be as flexible as possible. No matter what view one is currently in, one can expand the desired frame to focus on its specific content, temporarily hiding the other components of the user interface. It is @@ -539,25 +544,27 @@

A First Use Case -

On December 24, 2013, after extensive testing and bug fixing work, the - - EVT team - published a beta version of the Digital - Vercelli Book edition,

Full announcement on the project blog, On December 24, 2013, after extensive testing and bug fixing work, the + EVT team published a beta version of the Digital Vercelli Book + edition,

Full announcement on the project blog, . The beta edition is directly accessible at .

soliciting feedback from all interested parties. Shortly afterwards, the version of the - - EVT software we used, improved by more bug fixes and small enhancements, was made - available for the academic community on + EVT software we used, improved by more bug fixes and + small enhancements, was made available for the academic community on the project’s SourceForge site.

Edition Visualization Technology: Digital edition visualization - software, .

+ software, .

- The Digital Vercelli Book edition based on - EVT v. 0.1.48. - Image-text linking is active. + The Digital Vercelli Book edition based on + EVT + v. 0.1.48. Image-text linking is active.

@@ -565,9 +572,10 @@
Future Developments

- EVT development will continue during 2014 to fix bugs and to improve the current set of - features, but there are also several important features that will be added or that we are - currently considering for inclusion in + EVT development will continue during 2014 to fix bugs + and to improve the current set of features, but there are also several important features + that will be added or that we are currently considering for inclusion in EVT. Some of the planned features will require fundamental changes to the software architecture to be implemented effectively: this is probably the case for the Digital Lightbox (see

New Layout -

One important aspect that has been introduced in the current version of - EVT is a - completely revised layout: the current user interface includes all the features which - were deemed necessary for the Digital Vercelli Book beta, but it also is ready to accept - the new features planned for the short and medium terms. Note that nontrivial changes to - the general appearance and layout of the resulting web edition will be necessary, and - this is especially the case for the XML search engine and for the critical edition - support. Fortunately the basic framework is flexible enough to be easily expanded by - means of new views or a redesign of the current ones.

+

One important aspect that has been introduced in the current version of + EVT is a completely revised layout: the current + user interface includes all the features which were deemed necessary for the Digital + Vercelli Book beta, but it also is ready to accept the new features planned for the + short and medium terms. Note that nontrivial changes to the general appearance and + layout of the resulting web edition will be necessary, and this is especially the case + for the XML search engine and for the critical edition support. Fortunately the basic + framework is flexible enough to be easily expanded by means of new views or a redesign + of the current ones.

Search Engine

The - EVT search engine is already working and being tested in a separate development - branch of the software; merging into the main branch is expected as soon as the user - interface is finalized. It was implemented with the goal of keeping it simple and usable - for both academics and the general public.

+ EVT search engine is already working and being + tested in a separate development branch of the software; merging into the main branch is + expected as soon as the user interface is finalized. It was implemented with the goal of + keeping it simple and usable for both academics and the general public.

To achieve this goal we began by studying various solutions that could be used as a basis for our efforts. In the first phases of this study we looked at the principal XML databases, such as of - BaseX, - eXist, etc., and we found a solution by envisioning - - EVT as - a distributed application using the client-server architecture. For this test we - selected the - eXist

eXist-db, - .

open source XML database, and in a - relatively short time we created, sometimes by trial-and-error, a prototype that queried - the database for keywords and highlighted them in context.

+ BaseX, + eXist, etc., and we found a solution by envisioning + + EVT as a distributed application using the + client-server architecture. For this test we selected the + eXist

eXist-db, .

open source XML database, and + in a relatively short time we created, sometimes by trial-and-error, a prototype that + queried the database for keywords and highlighted them in context.

While this model was a step in the right direction and partially operational, we also felt that it was not sufficiently user-friendly, which is a critical goal of the entire project. In fact, forcing the editor to install and configure specific server software @@ -633,52 +643,59 @@ expected by the user. Essentially, we found that at least two of them were needed in order to make a functional search engine: free-text search and keyword highlighting. To implement them we looked at existing search engines and plug-ins programmed in the most - popular client-side web language: JavaScript. In the - end, our search produced two answers: + popular client-side web language: JavaScript. In the end, our search produced two + answers: Tipue Search and DOM manipulation.

Tipue Search -

- Tipue search

- Tipue Search, - .

is a jQuery plug-in +

+ Tipue search

+ Tipue Search, .

is a jQuery plug-in search engine released under the MIT license and aimed at indexing and searching large collections of web pages. It can function both offline and online, and it does not necessarily require a web server or a server-side programming/query language (such as SQL, PHP, or Python) in order to work. While technically a plug-in, its architecture is quite interesting and versatile: Tipue uses a combination of client-side JavaScript for the actual bulk of the work, and JSON (or JavaScript object literal) for storing the content. By - accessing the data structure, this engine is able to search for a relevant term and - bring back the matches.

+ type="software" xml:id="R57" target="#JavaScript"/>JavaScript for the actual bulk of the work, and JSON (or JavaScript + object literal) for storing the content. By accessing the data structure, this engine + is able to search for a relevant term and bring back the matches.

- Tipue Search operates in three modes: - in Static mode, + Tipue Search operates in three modes: + in Static mode, Tipue Search operates without a web server by accessing the contents stored in a specific file (tipuedrop_content.js); these contents are presented in JSON format; - in Live mode, - Tipue Search operates with a web server by indexing - the web pages included in a specific file - (tipuesearch_set.js); - in JSON mode, - Tipue Search operates with a web server by using - AJAX to load JSON data stored in specific files (as defined by the user). + in Live mode, + Tipue Search operates with a web server by + indexing the web pages included in a specific file + (tipuesearch_set.js); + in JSON mode, + Tipue Search operates with a web server by + using AJAX to load JSON data stored in specific files (as defined by the + user).

This plug-in suited our needs very well, but had to be modified slightly in order to - accommodate the requirements of the entire project. Before using - Tipue to handle the - search we needed to generate the data structure that was going to be used by the - engine to perform the queries. We explored some existing XSL stylesheets aimed at TEI - to JSON transformation, but we found them too complex for the task at hand. So we - modified our own stylesheets to produce the desired output.

+ accommodate the requirements of the entire project. Before using + Tipue to handle the search we needed to generate + the data structure that was going to be used by the engine to perform the queries. We + explored some existing XSL stylesheets aimed at TEI to JSON transformation, but we + found them too complex for the task at hand. So we modified our own stylesheets to + produce the desired output.

This output consists of two JSON files: diplomatic.json contains the text of the diplomatic edition of the Vercelli Book; @@ -687,23 +704,25 @@

These files are produced by including two templates in the overall flow of XSLT transformations that extract crucial data from the TEI documents and format them with JSON syntax. The procedure complements well the entire logic of - automatic self-generation that characterizes + automatic self-generation that characterizes EVT.

After we managed to extract the correct data structure, we began to include the search functionality in - EVT. By using the logic behind - Tipue JSON mode, we implemented - a trigger (under the shape of a select tag) that loaded the desired JSON data - structure to handle the search (diplomatic or facsimile, as mentioned above) and a - form that managed the query strings and launched the search function. Additionally, we - decided to provide the user with a simple virtual keyboard composed of essential keys - related to the Anglo-Saxon alphabet used in the Vercelli Book.

+ EVT. By using the logic behind + Tipue JSON mode, we implemented a trigger (under + the shape of a select tag) that loaded the desired JSON data structure to handle the + search (diplomatic or facsimile, as mentioned above) and a form that managed the query + strings and launched the search function. Additionally, we decided to provide the user + with a simple virtual keyboard composed of essential keys related to the Anglo-Saxon + alphabet used in the Vercelli Book.

The performance of - Tipue Search was deemed acceptable and our tests showed that even - large collections of data did not pose any particular problem.

+ Tipue Search was deemed acceptable and our tests + showed that even large collections of data did not pose any particular problem.

Experimental search interface. @@ -712,22 +731,23 @@
Keyword Highlighting through DOM Manipulation

The solution to keyword highlighting was found while searching many plug-ins that - deal with this very problem. All these plug-ins use JavaScript and DOM manipulation in order to wrap the HTML text nodes that - match the query with a specific tag (a span or a user-defined tag) and a CSS class to - manage the style of the highlighting. While this implementation was very simple and - self-explanatory, making use of simple recursive functions on relevant HTML nodes has - proved to be very difficult to apply to the textual contents handled by + deal with this very problem. All these plug-ins use JavaScript and DOM + manipulation in order to wrap the HTML text nodes that match the query with a specific + tag (a span or a user-defined tag) and a CSS class to manage the style of the + highlighting. While this implementation was very simple and self-explanatory, making + use of simple recursive functions on relevant HTML nodes has proved to be very + difficult to apply to the textual contents handled by EVT.

HTML text within - EVT is represented as a combination of text nodes and span - elements. These spans are used to define the characteristics of the current selected - edition. They contain both philological information about the inner workings of the - text and information about its visual representation. Very often the text is composed - of spans that handle different versions of words (such as the sub-elements of the TEI - choice element) or highlight an area of a word (based on the TEI - hi element, for example).

+ EVT is represented as a combination of text nodes + and span elements. These spans are used to define the characteristics of the + current selected edition. They contain both philological information about the inner + workings of the text and information about its visual representation. Very often the + text is composed of spans that handle different versions of words (such as the + sub-elements of the TEI choice element) or highlight an area of a word (based + on the TEI hi element, for example).

This type of markup would not have constituted a problem if it had wrapped complete words, since the plug-ins could recursively explore its content and search for a matching term. In certain portions of the text, however, some letters are separated by @@ -771,32 +791,34 @@ defines two-dimensional areas within a surface, and is transcribed using one or more line elements.

Originally - EVT could not handle this particular encoding method, since the XSLT stylesheets could only process TEI XML documents encoded according to the - traditional transcription method. Since we think that this is a concrete need in many - cases of study (mainly epigraphical inscriptions, but also manuscripts, at least in - some specific cases), we recently added a new feature that will allow - EVT to handle - texts encoded according to the embedded transcription method. This work was possible - due to a small grant awarded by EADH.

See EADH Small Grant: Call for - Proposals, .

+ EVT could not handle this particular encoding + method, since the XSLT stylesheets could only process TEI XML + documents encoded according to the traditional transcription method. Since we think + that this is a concrete need in many cases of study (mainly epigraphical inscriptions, + but also manuscripts, at least in some specific cases), we recently added a new + feature that will allow + EVT to handle texts encoded according to the + embedded transcription method. This work was possible due to a small grant awarded by + EADH.

See EADH Small Grant: Call for Proposals, .

Support for Critical Edition

One important feature whose development will start at some point this year is the - support for critical editions, since at the present moment - EVT allows dealing only - with diplomatic and interpretative ones. We aim not only to offer full support for the - TEI Critical Apparatus module, but also to find an innovative layout that can take - advantage of the digital medium and its dynamic properties to go beyond the - traditional, static, printed page: The layers of footnotes, - the multiplicity of textual views, the opportunities for dramatic visualization - interweaving the many with each other and offering different modes of viewing the - one within the many—all this proclaims I am a hypertext: invent a dynamic device - to show me. The computer is exactly this dynamic device (Robinson 2005, § 12).

+ support for critical editions, since at the present moment + EVT allows dealing only with diplomatic and + interpretative ones. We aim not only to offer full support for the TEI Critical + Apparatus module, but also to find an innovative layout that can take advantage of the + digital medium and its dynamic properties to go beyond the traditional, static, + printed page: The layers of footnotes, the multiplicity of + textual views, the opportunities for dramatic visualization interweaving the many + with each other and offering different modes of viewing the one within the many—all + this proclaims I am a hypertext: invent a dynamic device to show me. The + computer is exactly this dynamic device (Robinson 2005, § 12).

A digital edition can, of course, maintain the traditional layout, possibly moving the apparatus from the bottom of the page to a more convenient position, but could and should also explore different ways of organizing and displaying the connection between @@ -816,14 +838,14 @@ the way it should be designed in order to be usable and useful: how to conceive and where to place the graphical widgets holding the critical apparatus, how to integrate these UI elements in - EVT, how to contextualize the variants and navigate through the - witnesses’ texts, and more. There are other problems, for instance scalability issues - (how to deal with very big textual traditions that count tens or even hundreds of - witnesses?) or the handling of texts produced by collation software, which strictly - depend on the current TEI Critical Apparatus module. Considering that there is a - subgroup of the TEI’s Manuscript Special Interest Group devoted to significantly - improving this module, we can only hope that at least some of these problems will be - addressed in a future version.

+ EVT, how to contextualize the variants and + navigate through the witnesses’ texts, and more. There are other problems, for + instance scalability issues (how to deal with very big textual traditions that count + tens or even hundreds of witnesses?) or the handling of texts produced by collation + software, which strictly depend on the current TEI Critical Apparatus module. + Considering that there is a subgroup of the TEI’s Manuscript Special Interest Group + devoted to significantly improving this module, we can only hope that at least some of + these problems will be addressed in a future version.

@@ -834,28 +856,29 @@ the DigiPal

DigiPal: Digital Resource and Database of Palaeography, Manuscript Studies and Diplomatic, .

project, the - Digital Lightbox

A beta - version is available at .

is a web-based visualization framework which aims to support - historians, paleographers, art historians, and others in analyzing and studying digital - reproductions of cultural heritage objects. The methodology of research inspiring - development of this tool is to study paleographic elements in a qualitative way, helping - scholars’ interpretations as much as possible, and therefore to reject any automatic - methods such as pattern recognition and clustering which are supposed to return - quantitative and objective results. Although ongoing projects making use of these - computational methods are very promising, the results that may be obtained at this time - are still significantly less precise (with regard to specific image features, at least) - than those produced through human interpretation.

-

Initially developed exclusively for paleographic research, the - Digital Lightbox may be - used with any type of image because it includes a set of general graphic tools. Indeed, - the application allows a detailed and powerful analysis of one or more images, arranged - in up to two available workspaces, providing tools for manipulation, management, - comparison, and transformation of images. The development of this project is - consistently tested by paleographers at King’s College London working on the DigiPal - project, who are using the web application as a support for analyzing and gathering - samples of paleographic elements.

+ target="http://lightbox-dev.dighum.kcl.ac.uk"> + Digital Lightbox

A beta version is + available at .

is a + web-based visualization framework which aims to support historians, paleographers, art + historians, and others in analyzing and studying digital reproductions of cultural + heritage objects. The methodology of research inspiring development of this tool is to + study paleographic elements in a qualitative way, helping scholars’ interpretations as + much as possible, and therefore to reject any automatic methods such as pattern + recognition and clustering which are supposed to return quantitative and objective + results. Although ongoing projects making use of these computational methods are very + promising, the results that may be obtained at this time are still significantly less + precise (with regard to specific image features, at least) than those produced through + human interpretation.

+

Initially developed exclusively for paleographic research, the + Digital Lightbox may be used with any type of image + because it includes a set of general graphic tools. Indeed, the application allows a + detailed and powerful analysis of one or more images, arranged in up to two available + workspaces, providing tools for manipulation, management, comparison, and transformation + of images. The development of this project is consistently tested by paleographers at + King’s College London working on the DigiPal project, who are using the web application + as a support for analyzing and gathering samples of paleographic elements.

The software offers a rich set of tools: besides basic functions such as resizing, rotation, and dragging, it is possible to use a set of filters—such as opacity, brightness, color inversion, grayscale effect, and contrast—which, used in combination, @@ -872,98 +895,108 @@ Lightbox.

-

Collaboration is a very important characteristic of - Digital Lightbox: what makes this - tool stand apart from all the image-editing applications available is the possibility of - creating and sharing the work done using the software framework. First, you can create - collections of images and then export them to the local disk as an XML file; this - feature not only serves as a way to save the work, but also to share specific - collections with other users. Moreover, it is possible to export (and, consequently, to - import) working sessions, or, in other words, the current status of the work being done - using the application: in fact, all the images, letters, and notes present on the - workspace will be saved when the user leaves and restored when they log in again. These - features have been specifically created to encourage sharing and to make collaborative - work more effective and easy. Thanks to a new HTML5 feature, it is possible to support - the importing of images from the local disk to the application without any server-side +

Collaboration is a very important characteristic of + Digital Lightbox: what makes this tool stand apart + from all the image-editing applications available is the possibility of creating and + sharing the work done using the software framework. First, you can create collections of + images and then export them to the local disk as an XML file; this feature not only + serves as a way to save the work, but also to share specific collections with other + users. Moreover, it is possible to export (and, consequently, to import) working + sessions, or, in other words, the current status of the work being done using the + application: in fact, all the images, letters, and notes present on the workspace will + be saved when the user leaves and restored when they log in again. These features have + been specifically created to encourage sharing and to make collaborative work more + effective and easy. Thanks to a new HTML5 feature, it is possible to support the + importing of images from the local disk to the application without any server-side function.

- Digital Lightbox has been developed using some of the latest web technologies - available, such as HTML5, CSS3, the front-end framework Digital Lightbox has been developed using some of + the latest web technologies available, such as HTML5, CSS3, the front-end framework Bootstrap,

Bootstrap, .

and the JavaScript (ECMAScript 6) programming language, in combination with the jQuery library.

.

The code architecture has been designed - to be modular and easily extensible by other developers or third parties: indeed, it has - been released as open source software on GitHub,

- Digital Lightbox, Bootstrap, .

and the JavaScript (ECMAScript 6) programming language, + in combination with the jQuery + library.

.

The code architecture has been designed to be modular and easily + extensible by other developers or third parties: indeed, it has been released as open + source software on GitHub,

+ Digital Lightbox, .

and is freely available to be downloaded, edited, and tinkered with.

The - Digital Lightbox represents a perfect complementary feature for the - EVT project: a - graphic-oriented tool to explore, visualize, and analyze digital images of manuscripts. - While - EVT provides a rich and usable interface to browse and study manuscript texts - together with the corresponding images, the tools offered by the Digital Lightbox allow - users to identify, gather, and analyze visual details which can be found within the - images, and which are important for inquiries relating, for instance, to the style of - the handwriting, decorations on manuscript folia, or page layout.

-

An effort to adapt and integrate the Digital Lightbox into - EVT is already underway, - making it available as a separate, image-centered view, but there is a major hurdle to - overcome: some of the DL features are only possible within a client-server architecture. - Since - EVT or, more precisely, a separate version of - - EVT will migrate to this - architecture, at some point in the future it will be possible to integrate a full - version of the DL. Plans for the current, client-only version envision implementing all - those features that do not depend on server software: even if this means giving up - interesting features such as collaborative work and annotation, we believe that even a - subset of the available tools will be an invaluable help for manuscript image analysis. - Furthermore, as noted above, thanks to HTML5 and CSS3 it will become more and more - feasible to implement features in a client-only mode.

+ Digital Lightbox represents a perfect complementary + feature for the + EVT project: a graphic-oriented tool to explore, + visualize, and analyze digital images of manuscripts. While + EVT provides a rich and usable interface to browse + and study manuscript texts together with the corresponding images, the tools offered by + the Digital Lightbox allow users to identify, gather, and analyze visual details which + can be found within the images, and which are important for inquiries relating, for + instance, to the style of the handwriting, decorations on manuscript folia, or page + layout.

+

An effort to adapt and integrate the Digital Lightbox into + EVT is already underway, making it available as a + separate, image-centered view, but there is a major hurdle to overcome: some of the DL + features are only possible within a client-server architecture. Since + EVT or, more precisely, a separate version of + EVT will migrate to this architecture, at some point + in the future it will be possible to integrate a full version of the DL. Plans for the + current, client-only version envision implementing all those features that do not depend + on server software: even if this means giving up interesting features such as + collaborative work and annotation, we believe that even a subset of the available tools + will be an invaluable help for manuscript image analysis. Furthermore, as noted above, + thanks to HTML5 and CSS3 it will become more and more feasible to implement features in + a client-only mode.

New Architecture

In September 2013 we met with researchers of the Clavius on the Web project

See . A - preliminary test using a previous version of - EVT is available at + EVT is available at .

to discuss a possible use of - - EVT in order to visualize the documents that they are collecting and encoding; the main - goal of the project is to produce a web-based edition of all the correspondence of this - important sixteenth–seventeenth-century mathematician.

Currently preserved at - the Archivio della Pontificia Università Gregoriana.

The integration of - - EVT with another web framework used in the project, the eXist XML database, will require - a very important change in how the software works: as mentioned above, everything from - XSLT processing to browsing of the resulting website has been done on the client - side, but the integration with + + EVT in order to visualize the documents that they + are collecting and encoding; the main goal of the project is to produce a web-based + edition of all the correspondence of this important sixteenth–seventeenth-century + mathematician.

Currently preserved at the Archivio della Pontificia + Università Gregoriana.

The integration of + EVT with another web framework used in the project, + the eXist XML database, will require a very important change in how the software works: + as mentioned above, everything from XSLT processing to browsing of the resulting + website has been done on the client side, but the integration with eXist will require a move to the more complex - client-server architecture. A version of + client-server architecture. A version of EVT based on this architecture would present several advantages, not only the integration of a powerful XML database, but also the - implementation of a full version of the - Digital Lightbox. We will try to make the move - as painless as possible and to preserve the basic simplicity and flexibility that has - been a major feature of - EVT so far. The client-only version will not be abandoned, - though for quite some time there will be parallel development with features trickling - from one version to the other, with the client-only one being preserved as a subset of - the more powerful one.

+ implementation of a full version of the + Digital Lightbox. We will try to make the move as + painless as possible and to preserve the basic simplicity and flexibility that has been + a major feature of + EVT so far. The client-only version will not be + abandoned, though for quite some time there will be parallel development with features + trickling from one version to the other, with the client-only one being preserved as a + subset of the more powerful one.

@@ -979,26 +1012,29 @@ folders and start anew) and publish preliminary versions on the web (a shared folder on any cloud-based service such as Dropbox is all that is needed).

While - EVT has been under development for 3–4 years, it was thanks to the work and focus - required by the Digital Vercelli Book release at end of 2013 that we now have a solid - foundation on which to build new features and refine the existing ones. Some of the future - expansions also pose important research questions: this is the case with the critical - edition support, which touches a sensitive area of the very recent digital philology - discipline.

Digital philology makes use of ICT methods and tools, such as text - encoding, in the context of textual criticism and philological study of documents to - produce digital editions of texts. While many of the first examples of such editions - were well received (see for instance Kiernan - 2013; also see Siemens 2012 for an - example of the new theoretical possibilities allowed by the digital medium), serious - doubts concerning not only their durability and maintainability, but also their - methodological effectiveness, have been raised by some scholars. The debate is still - ongoing, see Gabler 2010, Robinson 2005 and 2013, Rosselli Del Turco, - forthcoming.

The collaborative work features of the - Digital Lightbox are also critical to the way modern scholars interact and share their research - findings. Finally, designing a user interface capable of hosting all the new features, - while remaining effective and user-friendly, will itself be very challenging.

+ EVT has been under development for 3–4 years, it was + thanks to the work and focus required by the Digital Vercelli Book release at end of 2013 + that we now have a solid foundation on which to build new features and refine the existing + ones. Some of the future expansions also pose important research questions: this is the + case with the critical edition support, which touches a sensitive area of the very recent + digital philology discipline.

Digital philology makes use of ICT methods and + tools, such as text encoding, in the context of textual criticism and philological + study of documents to produce digital editions of texts. While many of the first + examples of such editions were well received (see for instance Kiernan 2013; also see Siemens 2012 for an example of the new theoretical + possibilities allowed by the digital medium), serious doubts concerning not only their + durability and maintainability, but also their methodological effectiveness, have been + raised by some scholars. The debate is still ongoing, see Gabler 2010, Robinson + 2005 and 2013, Rosselli Del Turco, forthcoming.

The + collaborative work features of the + Digital Lightbox are also critical to the way modern + scholars interact and share their research findings. Finally, designing a user interface + capable of hosting all the new features, while remaining effective and user-friendly, will + itself be very challenging.

diff --git a/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml b/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml index 347c4899..95384ef3 100644 --- a/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml +++ b/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml @@ -246,7 +246,7 @@
Towards the Creation of a Born-Digital Epigraphic Collection with EFES

Once the relevant material had been defined, another major issue that I had to face was to decide how to deal efficiently with it. While I was in the process of starting @@ -271,19 +271,19 @@ also happened in 2017: the appearance of a powerful new tool for digital epigraphy, EpiDoc Front-End Services (EFES).

GitHub repository, accessed July - 21, 2021, .

Although I - was already aware of the many benefits deriving from a semantic markup of the - inscriptions,

On which see - and .

what - really persuaded me to adopt a TEI-based approach for the creation of my epigraphic - editions was actually the great facilitation that EFES offered in using - TEI-EpiDoc, which I will discuss in the following section.

+ 21, 2021, .

Although I was already aware of the many benefits + deriving from a semantic markup of the inscriptions,

On which see and .

what really persuaded me to + adopt a TEI-based approach for the creation of my epigraphic editions was actually + the great facilitation that EFES offered in using TEI-EpiDoc, which I will + discuss in the following section.

The Benefits of Using EpiDoc and EFES + />EFES

I was already familiar with the epigraphic subset of the TEI standard, EpiDoc,

EpiDoc: Epigraphic Documents in TEI XML, accessed July 21, 2021,

This is particularly true for the creation of publishable output of the encoded - inscriptions. The EpiDoc Reference - XSLT Stylesheets, created for transformation of - EpiDoc XML files into HTML,

Accessed July 21, 2021, .

require - relatively advanced knowledge of XSLT to use them to produce a satisfying HTML - edition for online publication or to generate a printable PDF. Not to mention the - creation of a complete searchable database to be published online, equipped with - indexes and appropriate search filters: this is far beyond the IT skills of the - average epigraphist.

+ inscriptions. The EpiDoc Reference XSLT Stylesheets, created for + transformation of EpiDoc XML files into HTML,

Accessed July 21, 2021, + .

require relatively advanced knowledge of XSLT to use + them to produce a satisfying HTML edition for online publication or to generate a + printable PDF. Not to mention the creation of a complete searchable database to be + published online, equipped with indexes and appropriate search filters: this is far + beyond the IT skills of the average epigraphist.

The situation is a little better for those who use EpiDoc as a tool for simplifying their research work on a collection of ancient documents, without aiming at the publication of the encoded inscriptions. The querying of a set of EpiDoc inscriptions is possible to some extent even without technical support: in some advanced XML editors, particularly - Oxygen, it is possible to perform XPath queries that allow the - identification of all the occurrences of specific features in the epigraphic - collection according to their markup. The XPath queries in an advanced XML editor - also allow the creation of lists of specific elements mentioned in the inscriptions, - but to my knowledge the creation of proper indexes—before EFES—was - almost impossible to achieve without the help of an IT expert.

+ Oxygen, it is possible to perform XPath queries + that allow the identification of all the occurrences of specific features in the + epigraphic collection according to their markup. The XPath queries in an advanced XML + editor also allow the creation of lists of specific elements mentioned in the + inscriptions, but to my knowledge the creation of proper indexes—before EFES—was almost impossible to achieve without the help of an IT expert.

Thus, despite the many benefits that EpiDoc encoding potentially offers, epigraphists might often be discouraged from adopting it by the amount of time that such an approach requires, combined with the fact that in many cases these benefits become tangible only at the end of the work, and only if one has IT support.

In light of these limitations, it is easy to understand how deeply the release of - EFES has transformed the field of digital epigraphy. - EFES, developed at the Institute of Classical Studies of the School of - Advanced Study of the University of London as the epigraphic specialization of the - - Kiln platform - ,

New Digital Publishing Tool: EpiDoc Front-End - Services, September 1, 2017, ; - see also the Kiln GitHub repository, accessed July 21, 2021, - .

is a platform that - simplifies the creation and management of databases of inscriptions encoded following - the EpiDoc Guidelines. More specifically, - EFES was developed to make - it easy for EpiDoc users to view a publishable form of their inscriptions, and to - publish them online in a full-featured searchable database, by easily ingesting - EpiDoc texts and providing formatting for their display and indexing through the - - EpiDoc reference XSLT stylesheets. The ease of configuration of the XSLT + EFES has transformed the field of digital epigraphy. EFES, developed + at the Institute of Classical Studies of the School of Advanced Study of the + University of London as the epigraphic specialization of the + Kiln platform ,

New + Digital Publishing Tool: EpiDoc Front-End Services, September 1, + 2017, ; see also the Kiln GitHub repository, accessed July 21, 2021, + .

is a + platform that simplifies the creation and management of databases of inscriptions + encoded following the EpiDoc Guidelines. More specifically, + EFES was developed to make it easy for EpiDoc + users to view a publishable form of their inscriptions, and to publish them online in + a full-featured searchable database, by easily ingesting EpiDoc texts and providing + formatting for their display and indexing through the EpiDoc + reference XSLT stylesheets. The ease of configuration of the XSLT transformations, and the possibility of already having, during construction, an immediate front-end visualization of the desired final outcome of the TEI-EpiDoc marked-up documents, allow smooth creation of an epigraphic database even without a @@ -383,49 +385,49 @@ their publication or even without the intention of publishing them, especially when dealing with large collections of documents and data sets.

See Bodard and Yordanova (2020).

-

Some of these useful features of EFES are common to other existing tools, - such as TEI Publisher,

Accessed July 21, 2021, .

- TAPAS,

Accessed July 21, 2021, 2020).

+

Some of these useful features of EFES are common to other existing tools, + such as TEI Publisher,

Accessed July 21, 2021, + .

+ TAPAS,

Accessed July 21, 2021, .

or Kiln itself, which is - EFES’s direct ancestor. What makes EFES unique, - however, is the fact that it is the only one of those tools to have be designed - specifically for epigraphic purposes and to be deeply integrated with the EpiDoc - Schema/Guidelines and with its reference stylesheets. Not only does it use, by - default, the EpiDoc reference stylesheets for transforming the inscriptions and for - indexing, it also comes with a set of default search facets and indexes that are - specifically meant for epigraphic documents. The default facets include the findspot - of the inscription, its place of origin, its current location, its support material, - its object type, its document type, and the type of evidence of its date. The - search/browse page, moreover, also includes a slider for filtering the inscriptions - by date and a box for textual searches, which can be limited to the indexed forms of - the terms. The default indexes include places, personal names (onomastics), - identifiable persons (prosopography), divinities, institutions, words, lemmata, - symbols, numerals, abbreviations, and uninterpreted text fragments. New facets and - indexes can easily be added even without mastering XSLT, along the lines of the - existing ones and by following the detailed instructions provided in the EFES Wiki documentation.

Accessed July 21, 2021, EFES’s direct ancestor. What makes EFES unique, however, is the + fact that it is the only one of those tools to have be designed specifically for + epigraphic purposes and to be deeply integrated with the EpiDoc Schema/Guidelines and + with its reference stylesheets. Not only does it use, by default, the EpiDoc + reference stylesheets for transforming the inscriptions and for indexing, it also + comes with a set of default search facets and indexes that are specifically meant for + epigraphic documents. The default facets include the findspot of the inscription, its + place of origin, its current location, its support material, its object type, its + document type, and the type of evidence of its date. The search/browse page, + moreover, also includes a slider for filtering the inscriptions by date and a box for + textual searches, which can be limited to the indexed forms of the terms. The default + indexes include places, personal names (onomastics), identifiable persons + (prosopography), divinities, institutions, words, lemmata, symbols, numerals, + abbreviations, and uninterpreted text fragments. New facets and indexes can easily be + added even without mastering XSLT, along the lines of the existing ones and by + following the detailed instructions provided in the EFES Wiki + documentation.

Accessed July 21, 2021, . Creation of new facets, last updated April 11, 2018: . Creation of new indexes, last updated May 27, 2020: .

- Furthermore, EFES makes it possible to create an epigraphic concordance of the + Furthermore, EFES makes it possible to create an epigraphic concordance of the various editions of each inscription and to add information pages as TEI XML files (suitable for displaying both information on the database itself and potential additional accompanying information).

Against this background, the combined use of the EpiDoc encoding and of the EFES tool seemed to be a promising approach for the purposes of my research - project, and so it was.

+ type="software" xml:id="R26" target="#EFES"/> + EFES tool seemed to be a promising approach for + the purposes of my research project, and so it was.

I initially aimed to create updated digital editions of the inscriptions mentioning Cretan institutional elements that could be used to facilitate a comparative analysis of the latter. The ability to generate and view the indexes of the mentioned @@ -441,10 +443,10 @@ directly from the raw XML files. The use of XSL-FO, however, requires some additional skills that are not needed in the copy-and-paste-from-the-browser process.

Although I had not planned it from the beginning, EFES also proved to be useful in the (online) publication of the results of my research. The ease with which EFES allows the creation of a searchable + />EFES allows the creation of a searchable epigraphic database, in fact, spontaneously led me to decide to publish it online once completed, making available not only the HTML editions—which can also be downloaded as printable PDFs—but also the raw XML files for reuse. The aim of the @@ -456,7 +458,7 @@ Cretan Institutional Inscriptions: An Overview of the Database

The core of the EFES-based database Cretan + type="soft.name" ref="#R30">EFES</rs>-based database <title level="m">Cretan Institutional Inscriptions consists of the EpiDoc editions of the previously selected six hundred inscriptions, which can be exported both in PDF and in their original XML format. Each edition is composed of an essential descriptive @@ -672,7 +674,7 @@ >I.Cret. II 23 5).

Given the markup described above, EFES was able to generate detailed indexes + />EFES was able to generate detailed indexes having the appearance of rich tables, where each piece of information is displayed in a dedicated column and can easily be combined with the other ones at a glance.

In the most complex case, that of the institutions, the index displays for each @@ -730,9 +732,9 @@ Conclusions

In conclusion, I would like to emphasize how particularly efficient the combined use of EpiDoc and EFES has proven to be for the creation of a thematic database - like Cretan Institutional Inscriptions. By collecting in a searchable database all - the inscriptions pertaining to the Cretan institutions, records that were hitherto + ref="#R34">EFES has proven to be for the creation of a thematic database like + Cretan Institutional Inscriptions. By collecting in a searchable database all the + inscriptions pertaining to the Cretan institutions, records that were hitherto accessible only in a scattered way, Cretan Institutional Inscriptions is a new resource that can facilitate the finding, consultation, and reuse of these very heterogeneous documents, many of which offer further points of reflection only when From 1ccd144a47e2943a91167759646050d2ee657877 Mon Sep 17 00:00:00 2001 From: GitHub Action Date: Mon, 5 Feb 2024 14:56:05 +0000 Subject: [PATCH 31/33] Add updated odd and generated rng --- schema/tei_jtei_annotated.rng | 2 +- schema/tei_software_annotation.rng | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/schema/tei_jtei_annotated.rng b/schema/tei_jtei_annotated.rng index e94fc366..5db61162 100644 --- a/schema/tei_jtei_annotated.rng +++ b/schema/tei_jtei_annotated.rng @@ -5,7 +5,7 @@ xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" ns="http://www.tei-c.org/ns/1.0">SARIT or Buddhist Stonesutras or experiments with EEBO-TCPEarly English Books Online eXist-db app, accessed February 11, 2016, . are more than promising (see, for example, Wicentowski and diff --git a/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml b/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml index 95384ef3..0eed9e2e 100644 --- a/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml +++ b/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml @@ -370,7 +370,7 @@ users to view a publishable form of their inscriptions, and to publish them online in a full-featured searchable database, by easily ingesting EpiDoc texts and providing formatting for their display and indexing through the EpiDoc + xml:id="R16" target="#epidocxslt"/> EpiDoc reference XSLT stylesheets. The ease of configuration of the XSLT transformations, and the possibility of already having, during construction, an immediate front-end visualization of the desired final outcome of the TEI-EpiDoc @@ -387,7 +387,7 @@ Yordanova (2020).

Some of these useful features of EFES are common to other existing tools, + />EFES are common to other existing tools, such as TEI Publisher,

Accessed July 21, 2021, .

diff --git a/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml b/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml index d83c37b6..ffa8c151 100644 --- a/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml +++ b/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml @@ -132,7 +132,7 @@ from manuscripts, to be published alongside the catalogue description of the manuscript itself, we have investigated a series of options, among which we have chosen to use the Transkribus sofware by READ Coop.Accessed February 2, 2022, .

@@ -151,7 +151,7 @@ an historical catalogue that involves copying from the former cataloguer transcription. Having a new transcription, based on autopsy or at least on the images of the manuscript would be preferable and technology as Transkribus allows one to obtain this transcription in an almost entirely automated way. Additionally, most of the internal referencing within a manuscript is done with the indication of the ranges of folios, and in TEI with @@ -202,7 +202,7 @@

The following steps have been taken to carry out an investigation of the possibilities for the automated production of text transcriptions based on images of manuscripts, before we opted for Transkribus + target="#transkribus"/>Transkribus and its integration in the workflow to make texts available in the Beta maṣāḥǝft research environment.

@@ -291,7 +291,7 @@ one script.

- Transkribus

This software is freely accessible and has a subscription model based on credits. The platform was created within the framework of the EU projects @@ -304,7 +304,7 @@ platform. The Pattern Recognition and Human Language Technology (PRHLT) group of the Universitat Politècnica de València and the CITlab group of the University of Rostock should be mentioned in particular.

-

Transkribus comes as an expert tool in its downloadable version and its online version,Accessed February 2, 2022,

Thus, the first stage for developing a model was gathering the data and preparing an initial dataset. Also for this aspect, Transkribus + target="#transkribus"/>Transkribus proved superior to all other options offering support also for this step. Colleagues which we called to contribute could be added to a collection, share their images without publishing them and add their transcriptions in the tool with a very mild learning curve.

-

Within Within Transkribus we have trained a model called Manuscripts from Ethiopia and Eritrea in Classical Ethiopic (Gǝʿǝz).See, accessed February 2, 2022,

Training a model in Transkribus

Gathering data to train an HTR model in Transkribus + target="#transkribus"/>Transkribus was not easy. Researchers were directly asked to contribute images of which they had already done the correct transcription. Sets of images with the relative transcription was thus obtained thanks to the generosity of contributors listed @@ -437,7 +437,7 @@ for the available time of the colleagues to fix the work of the machine, since we intended to train the model again. After three months with a full-time dedicated person, we had more than 50k words in the Transkribus + target="#transkribus"/>Transkribus expert tool, and we could train a model which could be made public, since this is the unofficial threshold to make a model available to everyone.

The features of the final model can be seen in

Adding transcriptions to Beta maṣāḥǝft from Transkribus

Even if a user already worked through each page of a manuscript to produce a transcription, doing it again with Transkribus + target="#transkribus"/>Transkribus and checking it has many advantages, chiefly the alignment of the text regions and lines on the base image to the transcription.Guidelines are provided for this steps to the users in theproject Guidelines, @@ -470,7 +470,7 @@ />.

With the transcribed images, either by hand with the help of the tool, or using the HTR model, the export functionalities of the Transkribus tool, allow to download a TEI encoded version of this transcription where we encourage users to use Line Breaks (lb) instead of l and preserve the coordinates of the boxes.

@@ -487,7 +487,7 @@

We have then prepared a bespoke XSLT transformation which can be used to transform the rich TEI from Transkribus, + target="#transkribus"/>Transkribus, called transkribus2Beta maṣāḥǝft.xsl. This transformation, given a few parameters, @@ -510,7 +510,7 @@

Conclusions -

Working with Working with Transkribus for the Beta maṣāḥǝft project gives the community of users a way to support the process of transcribing to the text on source manuscripts without typing it down. This is not intended to substitute the @@ -607,7 +607,7 @@ Weidemann, Herbert Wurster, and Konstantinos Zagoris. 2019. Transforming scholarship in the archives through handwritten text recognition: <ptr type="software" - xml:id="Transkribus" target="#Transkribus"/><rs type="soft.name" + xml:id="Transkribus" target="#transkribus"/><rs type="soft.name" ref="#Transkribus">Transkribus</rs> as a case study. Journal of Documentation, 75 (5) https://github.com/mandoku/mandoku + + Toolbox + + + + + Ediarum + + + + + Transkribus + + + From c31a115f255e1dcb8f503b1c9354906358cf2c25 Mon Sep 17 00:00:00 2001 From: GitHub Action Date: Mon, 5 Feb 2024 16:07:08 +0000 Subject: [PATCH 33/33] Add updated odd and generated rng --- schema/tei_jtei_annotated.odd | 3 +++ schema/tei_jtei_annotated.rng | 8 +++++++- schema/tei_software_annotation.rng | 12 +++++++++--- schema/tei_software_annotation.xml | 3 +++ 4 files changed, 22 insertions(+), 4 deletions(-) diff --git a/schema/tei_jtei_annotated.odd b/schema/tei_jtei_annotated.odd index 091f2a64..666508b5 100644 --- a/schema/tei_jtei_annotated.odd +++ b/schema/tei_jtei_annotated.odd @@ -2439,6 +2439,9 @@ + + + diff --git a/schema/tei_jtei_annotated.rng b/schema/tei_jtei_annotated.rng index 5db61162..52920f4f 100644 --- a/schema/tei_jtei_annotated.rng +++ b/schema/tei_jtei_annotated.rng @@ -5,7 +5,7 @@ xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" ns="http://www.tei-c.org/ns/1.0">