diff --git a/data/JTEI/10_2016-19/jtei-10-burghart-source.xml b/data/JTEI/10_2016-19/jtei-10-burghart-source.xml index 6d5a965..d53da7e 100644 --- a/data/JTEI/10_2016-19/jtei-10-burghart-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-burghart-source.xml @@ -1,7 +1,6 @@ - - + + @@ -183,7 +182,8 @@

- Towards a Toolbox + Towards a Toolbox

After this survey, I started writing my own scripts in order to support my editorial work, as a set of separate tools. It quickly occurred to me that those scripts could easily be grouped together in a toolbox that, with a little bit of @@ -219,10 +219,11 @@ Description

The TEI Critical Apparatus Toolbox is an online application for the quick and easy visualization and processing of TEI XML critical editions. It is - not meant to be a publication tool: the Critical Apparatus - Toolbox specifically targets the needs of editors during the preparation of - their ongoing work, allowing them to perform quality controls on their TEI files and to - display their work-in-progress text either in the style of a + not meant to be a publication tool: the Critical Apparatus <ptr + type="software" xml:id="Toolbox" target="#Toolbox"/><rs type="soft.name" + ref="#Toolbox">Toolbox</rs> specifically targets the needs of editors during + the preparation of their ongoing work, allowing them to perform quality controls on + their TEI files and to display their work-in-progress text either in the style of a traditional critical edition, and/or in parallel versions corresponding to each witness.

The requirements are very basic: no account, download, installation, or configuration @@ -242,8 +243,9 @@ for finished editions, but they may not be well adapted to ongoing work. Proposing a visualization for such encoding is not easy, because there might not be an identifiable critical text (yet), and the styles can be mixed in the apparatus. The - Critical Apparatus Toolbox will display each style (lemma - and reading[s], or readings only) differently: + Critical Apparatus <ptr type="software" xml:id="Toolbox" + target="#Toolbox"/><rs type="soft.name" ref="#Toolbox">Toolbox</rs> will + display each style (lemma and reading[s], or readings only) differently: In each case, the content of lem and rdg are highlighted, with a white background. When an app element contains a lem and one or more @@ -264,7 +266,8 @@

Displaying this code in the Critical - Apparatus Toolbox + Apparatus Toolbox
@@ -292,7 +295,8 @@
Displaying this code in the Critical - Apparatus Toolbox + Apparatus Toolbox
The use of reading groups is also supported: the content of each @@ -317,7 +321,8 @@
Displaying this code in the Critical - Apparatus Toolbox + Apparatus Toolbox
@@ -332,25 +337,29 @@ case, it is assumed that the page breaks of this witness are of particular interest to the user, and they are displayed in a more prominent fashion, as blocks with a thin blue line representing each break.

-

So far the Critical Apparatus Toolbox is not very different - from other TEI display tools, except perhaps that it can handle a great variety of - encoding styles within the Parallel Segmentation method. But its most distinctive - feature is the ability to perform automated controls of the encoding.

+

So far the Critical Apparatus <ptr type="software" xml:id="Toolbox" + target="#Toolbox"/><rs type="soft.name" ref="#Toolbox">Toolbox</rs> is not + very different from other TEI display tools, except perhaps that it can handle a great + variety of encoding styles within the Parallel Segmentation method. But its most + distinctive feature is the ability to perform automated controls of the encoding.

Controlling the Consistency of Your Encoding

The preparation of a critical edition involves many sessions of meticulous proofreading, especially to check the accuracy of the apparatus. If the Critical Apparatus Toolbox cannot replace the careful eye of the - editor, it offers an efficient way to control the consistency of the encoding by - detecting small inevitable mistakes, like a typo in the list of sigla or the failure - to record the reading of a particular witness in an apparatus entry.

-

To perform those controls, the Critical Apparatus Toolbox - will scan the teiHeader and front sections of the TEI file for a - listWit, and find all the sigla of the witnesses. Then, it will compare - this list to the manuscripts appearing in the wit attribute of lem - and rdg elements. The nature of the controls will depend on the type of - apparatus used in the edition.

+ level="m">Critical Apparatus Toolbox cannot replace the + careful eye of the editor, it offers an efficient way to control the consistency of + the encoding by detecting small inevitable mistakes, like a typo in the list of sigla + or the failure to record the reading of a particular witness in an apparatus + entry.

+

To perform those controls, the Critical Apparatus <ptr + type="software" xml:id="Toolbox" target="#Toolbox"/><rs type="soft.name" + ref="#Toolbox">Toolbox</rs> will scan the teiHeader and + front sections of the TEI file for a listWit, and find all the + sigla of the witnesses. Then, it will compare this list to the manuscripts appearing + in the wit attribute of lem and rdg elements. The nature + of the controls will depend on the type of apparatus used in the edition.

Positive Apparatus

In a positive apparatus, the reading of each witness considered for the edition is @@ -361,7 +370,8 @@ edition) because it forces the editor to be more accurate and makes verifications easier. It is the type of apparatus that allows for the most efficient consistency checks.

-

The Critical Apparatus Toolbox can: +

The Critical Apparatus <ptr type="software" xml:id="Toolbox" + target="#Toolbox"/><rs type="soft.name" ref="#Toolbox">Toolbox</rs> can: Highlight apparatus entries that do not use all witnesses: the content of the incomplete app elements which do not explicitly give a text for each witness listed in the listWit will be highlighted in red. @@ -378,7 +388,8 @@

Displaying this code in the Critical - Apparatus Toolbox + Apparatus Toolbox
Highlight apparatus entries that do not use a specific witness: @@ -422,10 +433,12 @@ end it gives fewer opportunities for automated verification: it is impossible to be sure that the editor added the reading of each witness, since unmentioned manuscripts are assumed by default to correspond to the lemma.

-

In this case, the Critical Apparatus Toolbox can - nevertheless highlight apparatus entries that mention a particular - witness. Practically, this will highlight only apparatus entries where a - witness’s reading differs from the lemma. Example:

+

In this case, the Critical Apparatus <ptr type="software" + xml:id="Toolbox" target="#Toolbox"/><rs type="soft.name" ref="#Toolbox" + >Toolbox</rs> can nevertheless highlight apparatus entries that + mention a particular witness. Practically, this will highlight only + apparatus entries where a witness’s reading differs from the lemma. Example:

dolores corporis leniantur laniantur @@ -435,16 +448,18 @@
Displaying this code in the Critical - Apparatus Toolbox, when highlighting all apparatus entries mentioning - witness R2 + Apparatus Toolbox, when highlighting all + apparatus entries mentioning witness R2

Other Controls -

The Critical Apparatus Toolbox can also highlight - apparatus entries that contain a lem, or that contain only rdg - elements.

+

The Critical Apparatus <ptr type="software" xml:id="Toolbox" + target="#Toolbox"/><rs type="soft.name" ref="#Toolbox">Toolbox</rs> can + also highlight apparatus entries that contain a lem, or that contain only + rdg elements.

@@ -480,13 +495,14 @@
Application Design -

The Critical Apparatus Toolbox is an online application built - on a set of XSLT stylesheets served through PHP files, the output of which is made - interactive thanks to Javascript and CSS. It makes use of some parts of TEI Boilerplate, most notably its web design. But despite the similar look - and feel, the core functions are very different: all the parts of the TEI Boilerplate stylesheets pertaining to critical edition elements have been - overridden.

+

The Critical Apparatus <ptr type="software" xml:id="Toolbox" + target="#Toolbox"/><rs type="soft.name" ref="#Toolbox">Toolbox</rs> is an + online application built on a set of XSLT stylesheets served through PHP files, the + output of which is made interactive thanks to Javascript and CSS. It makes use of some + parts of TEI Boilerplate, most notably its web design. But + despite the similar look and feel, the core functions are very different: all the parts + of the TEI Boilerplate stylesheets pertaining to critical + edition elements have been overridden.

The XSLT stylesheets analyze the TEI XML edition, and determine its characteristics.

I have chosen to concentrate here on the presentation of the general framework, rather than getting into technical details about the XSLT @@ -510,37 +526,43 @@ TEI P5 version 2.9.0 release notes

In the future, keeping up with the developments and evolutions of the Critical Apparatus module will be a - priority. We hope that the Critical Apparatus Toolbox will be - able to adapt to these evolutions: since the functions of the interface are powered by - Javascript, updating the XSLT should be enough to adapt to new rules or elements in the - module.

+ priority. We hope that the Critical Apparatus <ptr type="software" + xml:id="Toolbox" target="#Toolbox"/><rs type="soft.name" ref="#Toolbox" + >Toolbox</rs> will be able to adapt to these evolutions: since the functions + of the interface are powered by Javascript, updating the XSLT should be enough to adapt + to new rules or elements in the module.

Future Developments -

The beginning of the development of the Critical Apparatus - Toolbox was a lonely endeavor, but the project has since benefited from the - collaboration of Magdalena Turska

Magdalena Turska created this prototype as part - of her work as an Experienced Researcher Fellow of the <ref - target="http://dixit.uni-koeln.de">Digital Scholarly Editions Initial Training - Network</ref> (DIXIT) program - (accessed December 4, 2016).

who wrote a prototype for the integration of - the Toolbox into an oXygen framework. Decisive help was also found via a collaboration - with the Erasmus SP+ DEMM program (<ref - target="http://www.digitalmanuscripts.eu">Digital Edition of Medieval - Manuscripts</ref>).

Accessed December 4, 2016, The beginning of the development of the Critical Apparatus <ptr + type="software" xml:id="Toolbox" target="#Toolbox"/><rs type="soft.name" + ref="#Toolbox">Toolbox</rs> was a lonely endeavor, but the project has since + benefited from the collaboration of Magdalena Turska

Magdalena Turska created this + prototype as part of her work as an Experienced Researcher Fellow of the <ref target="http://dixit.uni-koeln.de">Digital Scholarly Editions Initial + Training Network</ref> (DIXIT) program (accessed December 4, 2016).

who + wrote a prototype for the integration of the Toolbox into an oXygen + framework. Decisive help was also found via a collaboration with the Erasmus SP+ DEMM + program (<ref target="http://www.digitalmanuscripts.eu">Digital Edition + of Medieval Manuscripts</ref>).

Accessed December 4, 2016,

For the three years beginning in June 2015 DEMM is holding an annual hackathon event where the Critical - Apparatus Toolbox is the base application that small, mixed teams of textual - scholars and computer scientists try to enhance to meet their particular + Apparatus Toolbox is the base application that small, mixed teams of + textual scholars and computer scientists try to enhance to meet their particular needs.

The first of these events took place at Queen Mary University London in June 2015, with the computer scientists Astrid Bin, Daniel Gabana Arellano, Danielle Gilaberte De Almeida, and Chris Sparks, along with all the current DEMM students as textual scholars (Digital Editing of Medieval Manuscripts, People, accessed December 4, 2016, ).

- These events will play an important role in the future developments of the Toolbox, since - they confront us directly with the real-life experience and needs of editors.

+ These events will play an important role in the future developments of the Toolbox, since they confront us directly with the real-life experience and needs + of editors.

New Controls and Features

During the first hackathon, the students and computer scientists were divided into four @@ -549,8 +571,9 @@ text (from various states of transcription showing either abbreviated or expanded words to parallel versions of a text with potentially different branches), and the last concentrated on the relationship between the edited text and images. These themes could - serve as general directions for enhancing the Critical Apparatus - Toolbox: + serve as general directions for enhancing the Critical Apparatus <ptr + type="software" xml:id="Toolbox" target="#Toolbox"/><rs type="soft.name" + ref="#Toolbox">Toolbox</rs>: offering visualization options for named entities, from a simple index to more elaborate links to maps, when possible; taking into account the visualization of transcription features like @@ -559,8 +582,9 @@ display of parallel witnesses; adding some options to link the text to its representation, or to images generally. This poses the problem of access to the images: in the current state of - the Toolbox, users upload their TEI XML edition but not the other files potentially - linked to it, like images. + the Toolbox, users upload their TEI XML edition but not the other + files potentially linked to it, like images.

@@ -572,13 +596,14 @@ Tool.<ref target="http://tapor.uvic.ca/~mholmes/image_markup/">The UVic Image Markup Tool Project</ref>, accessed December 4, 2016, Even if the Critical Apparatus Toolbox is not a publication application, such an - output would provide users with a ready-to-use static version of their edition, a set of - files (HTML, CSS, Javascript, etc.) that they could publish on their website or show in - a demo session. While complex projects will always need a proper publication framework, - this sort of lightweight publication output would provide a simple tool for basic - self-publication.

For a discussion of the question of TEI and - self-publication, see Pierazzo 2015, + level="m">Critical Apparatus Toolbox is not a publication + application, such an output would provide users with a ready-to-use static version of + their edition, a set of files (HTML, CSS, Javascript, etc.) that they could publish on + their website or show in a demo session. While complex projects will always need a + proper publication framework, this sort of lightweight publication output would provide + a simple tool for basic self-publication.

For a discussion of the question of + TEI and self-publication, see Pierazzo 2015, 129–30.

@@ -598,19 +623,20 @@ numbering, and content of the apparatus notes) to deliver as useful an output as possible.

I am preparing a generic TEI-to-LaTeX and TEI-to-PDF conversion feature that will be - implemented in the Critical Apparatus Toolbox. I chose LaTeX as - an intermediary format because it offers all the desired options, thanks to the - reledmac package especially designed for typesetting critical - editions.

The reledmac package, maintained by Maïeul Rouquette, - replaces the established ledmac and eledmac packages. See - CTAN (Comprehensive TeX Archive Network), <code>reledmac</code>—Typeset Scholarly Editions, accessed December 4, - 2016, I thank M. Rouquette - for kindly helping me to learn how to use the package.

It is better - suited to the specific needs of critical editions than XSL:FO. Another advantage of an - intermediary file is that it leaves users the opportunity to edit the LaTeX code to - obtain a better PDF result, which they might prefer over a modification of the XSLT - templates, depending on their skillset.

+ implemented in the Critical Apparatus <ptr type="software" + xml:id="Toolbox" target="#Toolbox"/><rs type="soft.name" ref="#Toolbox" + >Toolbox</rs>. I chose LaTeX as an intermediary format because it offers all + the desired options, thanks to the reledmac package especially designed for + typesetting critical editions.

The reledmac package, maintained by + Maïeul Rouquette, replaces the established ledmac and + eledmac packages. See CTAN (Comprehensive TeX Archive Network), + <code>reledmac</code>—Typeset Scholarly Editions, + accessed December 4, 2016, + I thank M. Rouquette for kindly helping me to learn how to use the + package.

It is better suited to the specific needs of critical editions + than XSL:FO. Another advantage of an intermediary file is that it leaves users the + opportunity to edit the LaTeX code to obtain a better PDF result, which they might + prefer over a modification of the XSLT templates, depending on their skillset.

This feature, still a work-in-progress but well advanced, lets the user customize many parameters of the output through a graphical interface, without requiring any knowledge of LaTeX. When users need heavy customization of the default settings, they can easily @@ -643,13 +669,14 @@ integrate TEI encoding seamlessly into the workflow of textual scholars.See Turska, Cummings and Rahtz 2016

-

The Critical Apparatus Toolbox belongs in this growing galaxy of - lightweight, user-oriented tools. With features demonstrating the immediate benefits of - TEI encoding, it is a good tool for TEI training and workshops. But its main purpose - remains to facilitate the work of textual scholars, a complex task given the widely - diverse habits of different schools, and even of different people. Our first results - demonstrate that despite the diversity, some level of common ground can be found to - provide generic services.

+

The Critical Apparatus <ptr type="software" xml:id="Toolbox" + target="#Toolbox"/><rs type="soft.name" ref="#Toolbox">Toolbox</rs> belongs in + this growing galaxy of lightweight, user-oriented tools. With features demonstrating the + immediate benefits of TEI encoding, it is a good tool for TEI training and workshops. But + its main purpose remains to facilitate the work of textual scholars, a complex task given + the widely diverse habits of different schools, and even of different people. Our first + results demonstrate that despite the diversity, some level of common ground can be found + to provide generic services.

diff --git a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml index a1234e5..9ec8142 100644 --- a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml @@ -1,5 +1,6 @@ - + + + @@ -396,10 +397,11 @@ >Jupiter noch 18[…]

Another example for similar inline phenomena in manuscripts and printed texts is the underlining of important phrases or keywords, represented in the DTABf as hi - rendition="#u" for printed texts and manuscripts alike. Furthermore, though this - feature is far more frequent in prints, manuscripts may also contain catchwords or - signature marks at the bottom of the page, which we tag as fw with - type=catch or type=sig, respectively.

+ rendition="#u" for printed texts and manuscripts alike. Furthermore, + though this feature is far more frequent in prints, manuscripts may also contain + catchwords or signature marks at the bottom of the page, which we tag as fw + with type=catch or type=sig, + respectively.

The last two lines of running text, followed by a signature mark and @@ -681,7 +683,7 @@
Als der Enkeſche Comet in Paramala wieder - er- ſchien will HHerr + er- ſchien will HHerr Dummler eine Rotation des Schweifes von[…]

In the following example () it is obvious @@ -1207,7 +1209,8 @@

The DTA project is an example of the application of the TEI Guidelines to large-scale corpora. Our primary goal is to be as inclusive as possible, allowing for other projects to benefit from our resources (i.e., our comprehensive guidelines and documentation as - well as the technical infrastructure that includes Schemas, ODDs, and XSLT scripts) and + well as the technical infrastructure that includes Schemas, ODDs, and XSLT scripts) and contribute to our corpora. We also want to ensure interoperability of all data within the DTA corpora. The underlying TEI format has to be continuously maintained and adapted to new necessities with these two premises in mind.

diff --git a/data/JTEI/10_2016-19/jtei-10-romary-source.xml b/data/JTEI/10_2016-19/jtei-10-romary-source.xml index 741b72c..3b75d2b 100644 --- a/data/JTEI/10_2016-19/jtei-10-romary-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-romary-source.xml @@ -1,5 +1,6 @@ - + + + @@ -644,13 +645,15 @@ available at . In our proposal, the etym element has to be made recursive in order to allow the fine-grained representations we propose here. The corresponding ODD customization, together with - reference examples, is available on GitHub. and the fact that a change occurred - within the contemporary lexicon (as opposed to its parent language) is indicated by - means of xml:lang on the source form.There may also be cases in which - it is unknown whether a given etymological process occurred within the contemporary - language or parent system; in such cases the encoder can just use the main language - tag for both the diachronic and synchronic portions of the entry as a default (see, - for instance, ).

+ reference examples, is available on GitHub. and the + fact that a change occurred within the contemporary lexicon (as opposed to its parent + language) is indicated by means of xml:lang on the source form.There + may also be cases in which it is unknown whether a given etymological process occurred + within the contemporary language or parent system; in such cases the encoder can just + use the main language tag for both the diachronic and synchronic portions of the entry + as a default (see, for instance, ).

In the TEI encoding, the former two can be respectively labeled as: and The interested reader may ponder here the possibility to also encode scripts by means of the notation attribute instead of using a cluttering of language subtags on xml:lang. For more on this issue, see the proposal in - the TEI GitHub (). This - is why we have extended the notation attribute to orth in order to - allow for better representation of both language identification and the orthographic - content. With this double mechanism, we intend to describe content expressed in the same - language by means of the same language tag, thus allowing more reliable management, - access, and search procedures over our lexical content.

+ the TEI GitHub (). This is why we have extended the notation attribute to + orth in order to allow for better representation of both language + identification and the orthographic content. With this double mechanism, we intend to + describe content expressed in the same language by means of the same language tag, thus + allowing more reliable management, access, and search procedures over our lexical + content.

We are aware that we open a can of worms here, since such an editorial practice could be easily extended to all text elements in the TEI Guidelines. We have actually identified several cases in the sole context of lexical representations (e.g., @@ -915,39 +920,48 @@ kápŭ - + + kábu - + + k̯áβo̥ - + + t̯áβo - + + t͡sávo̥ - + + t͡šíe̥vo̥ - + + tšíe̥f - + + šyé̥f - + + šé̥f - + + šę́f @@ -972,9 +986,10 @@ respectively.

The dateThe element date as a child of cit is another example which does not adhere to the current TEI standards. We have allowed this - within our ODD document. A feature request proposal will be made on the GitHub page - and this feature may or may not appear in future versions of the TEI - Guidelines. element is listed within each etymon block; the values of + within our ODD document. A feature request proposal will be made on the GitHub page and this feature may or may not appear in future versions of the + TEI Guidelines. element is listed within each etymon block; the values of attributes notBefore and notAfter specify the range of time corresponding to the period of time that the given form was in use according to the authors.In the (French language) source of this example (cardinalNumber ten - + + uni cardinalNumber @@ -2469,11 +2485,12 @@ Problematic and Unresolved Issues

For the issues regarded as the most fundamentally important to creating a dynamic and sustainable model for both etymology and general lexicographic markup in TEI, we have - submitted formal requests for changes to the TEI GitHub, and will continue to submit - change requests as needed. While this work represents a large step in the right direction - for those looking for means of representing etymological information, there are still a - number of unresolved issues that will need to be addressed. These remaining issues pertain - to: + submitted formal requests for changes to the TEI GitHub, and will continue to + submit change requests as needed. While this work represents a large step in the right + direction for those looking for means of representing etymological information, there are + still a number of unresolved issues that will need to be addressed. These remaining issues + pertain to: (i) expanding the types of etymological information and refining the representation of the processes and features which are covered; and (ii) the need for continued progress in a number of issues within the body of diff --git a/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml b/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml index 49ae07a..0b87d58 100644 --- a/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml +++ b/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml @@ -1,6 +1,6 @@ - - + + @@ -109,8 +109,9 @@ textual variation. offers examples of one of the ways in which the variant taxonomy may be linked to the body of the edition.

Although this paper is TEI-centered, other XML technologies will be mentioned. includes a brief commentary on using XSLT to - transform a TEI-conformant definition of constraints into schema rules. However, the + type="crossref" target="#validation"/> includes a brief commentary on using XSLT + to transform a TEI-conformant definition of constraints into schema rules. However, the greatest attention to an additional technology is in , which discusses the use of XQuery to retrieve particular loci critici and to deploy quantitative analyses.

@@ -444,9 +445,10 @@ definition, its typed-feature modeling facilitates the creation of schema constraints. For instance, I process my declaration to further constrict my schema so the feature structure declaration and its actual application are always synchronized and up to date.I use - XSLT to process the feature structure declaration in order to create all required - Schematron rules that will constrict the feature library accordingly. I am currently - working on creating a more generic validator (see my XSLT to process the feature structure declaration in order to create all + required Schematron rules that will constrict the feature library accordingly. I am + currently working on creating a more generic validator (see my Github repository, ).
@@ -539,11 +541,18 @@ >parallel segmentation method (TEI Consortium 2016, 12.2.3) seems to be a popular encoding technique for multi-witness editions, in terms of both the specific tools that have been created for this method and the number - of projects that apply it.Tools include Versioning Machine, CollateX (both - the Java and Python versions), and Juxta. For representative projects using the parallel segmentation method see - Tools include Versioning Machine, CollateX (both + the Java and Python versions), and Juxta. For + representative projects using the parallel segmentation method see Satire in Circulation: James editions Russell Lowell’s Letter from a volunteer in Saltillo, Walden: A Fluid-Text Edition, or f[@name eq "description"], as shown in + contents of f[@name eq "description"], as shown in
HTML edition sample diff --git a/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml b/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml index ecce37d..309ab6c 100644 --- a/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml +++ b/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml @@ -1,8 +1,6 @@ - - - + + @@ -121,11 +119,12 @@ the gaps between them. Finally, I will illustrate my findings with the help of a concrete example, which is as TEI-specific as it can get: I will describe the history of the TEI Conference 2016 abstracts, which have, since the conference, been transformed into a TEI - data set that has been published not only on GitHub, but also in an eXistdb-powered web - application. This is by any standard a wonderful development for a collection of textual - data—and one that would not have been possible had the abstracts not been published under - an open license, especially since their authors come from fourteen different - countries.

+ data set that has been published not only on GitHub, but also in an + eXistdb-powered web application. This is by any standard a wonderful development for a + collection of textual data—and one that would not have been possible had the abstracts not + been published under an open license, especially since their authors come from fourteen + different countries.

Why Nation Matters, Why It Shouldn’t: National Laws vs. WIPO @@ -532,12 +531,14 @@ InDesign. The real fun started when the InDesign file was exported to XML and transformed back into single files (one file per abstract). These files were edited with the Oxygen XML editor to become proper TEI files with extensive headers. Finally, they - were published as a repository together with the TEI schema on GitHub (Hannesschläger and Schopper 2017), again under the - same license. This allowed Martin Sievers, one of the abstract authors, to immediately - correct a typing error in his abstract that the editors had overlooked (see history of - Hannesschläger and Schopper 2017 on - GitHub).

+ were published as a repository together with the TEI schema on GitHub (Hannesschläger and Schopper 2017), again + under the same license. This allowed Martin Sievers, one of the abstract authors, to + immediately correct a typing error in his abstract that the editors had overlooked (see + history of Hannesschläger and Schopper + 2017 on GitHub).

But the story did not end there. The freely available and processable collection of abstracts inspired Peter Andorfer, a colleague of the editors at the Austrian Centre for Digital Humanities, to use this text collection to built an eXistdb-powered web diff --git a/data/JTEI/12_2019-20/jtei-cc-ra-bauman-170-source.xml b/data/JTEI/12_2019-20/jtei-cc-ra-bauman-170-source.xml index f041360..70ce74d 100644 --- a/data/JTEI/12_2019-20/jtei-cc-ra-bauman-170-source.xml +++ b/data/JTEI/12_2019-20/jtei-cc-ra-bauman-170-source.xml @@ -1,6 +1,6 @@ - - - + + + requires that there be a date in a particular format in a particular place (say, in W3C format on the when of - correspAction type="sent"), then our publication software does - not have to worry about what to do if it finds a letter without a date when it - tries to sort them. The publication software (or, to be more precise and stop - anthropomorphizing code, the computer programmer who writes the publication - software) can depend on the fact that every letter will have a date in exactly - the right spot.

+ correspAction type="sent"), then our publication + software does not have to worry about what to do if it finds a letter without a + date when it tries to sort them. The publication software (or, to be more + precise and stop anthropomorphizing code, the computer programmer who writes + the publication software) can depend on the fact that every letter will have a + date in exactly the right spot.

As Wendell Piez points out (Piez 2001), this is exactly analogous to the use of go/no-go gauge.

@@ -375,8 +375,9 @@ - XSLT template that converts an lb into a - space. + XSLT template that converts an + lb into a space.

@@ -773,28 +774,29 @@ the list of possible values for include of moduleRef (grammar-based, lists all possible values), the list of possible values for include of moduleRef - key="core" (rule-based, lists only values from the + key="core" (rule-based, lists only values from the core module), the list of possible values for except of moduleRef (grammar-based, lists all possible values), the list of possible values for except of moduleRef - key="core" (rule-based, lists only values from the + key="core" (rule-based, lists only values from the core module), the list of possible values for key of elementRef, and the list of possible values for ident of elementSpec. Furthermore, the value teiCorpus appears in the value of the - except attribute of the moduleRef key="core" in the - tei_customization.odd (as opposed to the moduleRef it + except attribute of the moduleRef key="core" in + the tei_customization.odd (as opposed to the moduleRef it is defining) — this one should be removed, too, but since current ODD processors would ignore it, would not cause a problem if it were left.

Either way, in order to avoid this potential maintenance nightmare, tei_customization.odd is not a static file, but rather is - generated by running an XSLT program that reads as its input the source to TEI - P5

Remember, the TEI Guidelines are written in TEI. The source to - all of P5 is a single TEI document, although for convenience it is split - into well over 850 separate files.

and writes + generated by running an XSLT program that reads as its input the + source to TEI P5

Remember, the TEI Guidelines are written in TEI. The + source to all of P5 is a single TEI document, although for convenience it is + split into well over 850 separate files.

and writes tei_customization.odd as its output. Running this program is one of the steps in the build process for creating a new set of TEI P5 generated outputs.

These outputs include schemas in RELAX NG and @@ -884,9 +886,12 @@

How to Get it and Use it -

The XSLT program used to generate tei_customization.odd can be - found in the TEI - GitHub repository. It is currently called +

The XSLT program used to generate + tei_customization.odd can be found in the TEI GitHub repository. It is currently called TEI-to-tei_customization.xslt. The generated tei_customization ODD file and the schemas generated from it can be found in each release of the TEI from 3.3.0 on.

For example, the diff --git a/data/JTEI/12_2019-20/jtei-cc-ra-flanders-176-source.xml b/data/JTEI/12_2019-20/jtei-cc-ra-flanders-176-source.xml index 623c3a8..f8b706f 100644 --- a/data/JTEI/12_2019-20/jtei-cc-ra-flanders-176-source.xml +++ b/data/JTEI/12_2019-20/jtei-cc-ra-flanders-176-source.xml @@ -1,7 +1,6 @@ - - + + @@ -129,7 +128,8 @@ pedagogy TAPAS - XSLT + XSLT validation @@ -169,17 +169,18 @@ type="bibl">Flanders and Hamlin 2013), infrastructure for TEI publication was at that time (and still remains to some extent) challenging for individual scholars and small projects, because of the costs and logistics of maintaining servers, developing and - supporting XSLT stylesheets, and maintaining technical expertise for troubleshooting and - longer term support. TAPAS was developed to provide an alternative in the form of a - web-based service for publishing TEI data. It offers a growing infrastructure of TEI - publishing tools, a publishing venue that highlights the value of using TEI, and a - long-term guarantee of visibility and access to the TEI data. TAPAS was originally hosted - at Brown University and is now hosted at Northeastern University, which also provides a - guarantee of long-term repository support for TAPAS data. TAPAS has been generously funded - by the TEI Consortium and by an initial planning grant from the Institute for Museum and - Library Services, a digital humanities startup grant from the NEH, a research and - development grant from the NEH, and most recently a digital humanities advancement grant - which has supported TAPAS Classroom.

+ supporting XSLT stylesheets, and maintaining technical expertise for + troubleshooting and longer term support. TAPAS was developed to provide an alternative in + the form of a web-based service for publishing TEI data. It offers a growing + infrastructure of TEI publishing tools, a publishing venue that highlights the value of + using TEI, and a long-term guarantee of visibility and access to the TEI data. TAPAS was + originally hosted at Brown University and is now hosted at Northeastern University, which + also provides a guarantee of long-term repository support for TAPAS data. TAPAS has been + generously funded by the TEI Consortium and by an initial planning grant from the + Institute for Museum and Library Services, a digital humanities startup grant from the + NEH, a research and development grant from the NEH, and most recently a digital humanities + advancement grant which has supported TAPAS Classroom.

TAPAS Classroom and TEI Pedagogy @@ -388,7 +389,9 @@ ensuring that data is stored safely and can be accessed after the course is over. These functions are valuable for instructors at all levels of technical expertise but are particularly enabling for instructors familiar with TEI but not with its - supporting technologies (XSLT, XML databases, web publishing frameworks). + supporting technologies (XSLT, XML databases, web publishing + frameworks). Expose and demystify: The tool experimentation we saw in the Lyrical Ballads course and in Mary Isbell’s Digital Editing course suggested the value of a publication system that makes it possible to see the @@ -429,17 +432,22 @@
View Packages

The display of TEI files through the TAPAS interface is handled—as with almost all modern - web display of TEI data—through XSLT stylesheets, and indeed one of TAPAS’s most important - functions is to enable users to transform and view their TEI data without having to write - or run XSLT on their own. TAPAS’s XSLT stylesheets do not operate in isolation but as part - of a more complex view package that includes several distinct - components: - One or more XSLT stylesheets which transform the source TEI data into another - format (typically XHTML but potentially JSON or other formats that are particularly - suited to some viewing application) + web display of TEI data—through XSLT stylesheets, and indeed one of TAPAS’s most + important functions is to enable users to transform and view their TEI data without having + to write or run XSLT on their own. TAPAS’s XSLT stylesheets do not operate in + isolation but as part of a more complex view package that includes + several distinct components: + One or more XSLT stylesheets which transform the source TEI + data into another format (typically XHTML but potentially JSON or other formats that + are particularly suited to some viewing application) A CSS stylesheet that provides styling information - Optional JavaScript code to produce features of user interactivity (such as - mouse-overs or selection of specific viewing options) + Optional JavaScript code to produce features of user + interactivity (such as mouse-overs or selection of specific viewing options) Optional XProc code which handles the sequential processing of the data by the individual components of the view package Optional RELAX NG or ISO Schematron files that formalize any specific data @@ -579,11 +587,14 @@ view of the document itself through the lens of a schema. Technically, the process involves: running a validation process in XProc - running an XSLT stylesheet that integrates the validation output with the TEI + running an XSLT stylesheet that integrates the validation output with the TEI document - running an XSLT stylesheet that transforms the whole thing into HTML for display - (with JavaScript as needed to support interaction such as navigation between the error - report and the document itself). + running an XSLT stylesheet that transforms the whole thing into HTML for + display (with JavaScript as needed to support interaction + such as navigation between the error report and the document itself). Along the way, the process substitutes error messages that are more intelligible to novices and also more reassuring and detailed, and collapses multiple occurrences of a given error message into a single instance. The result is to position the validation view diff --git a/data/JTEI/13_2020-22/jtei-cc-pn-kuhry-188-source.xml b/data/JTEI/13_2020-22/jtei-cc-pn-kuhry-188-source.xml index 331e28b..7631ac6 100644 --- a/data/JTEI/13_2020-22/jtei-cc-pn-kuhry-188-source.xml +++ b/data/JTEI/13_2020-22/jtei-cc-pn-kuhry-188-source.xml @@ -1,8 +1,6 @@ - - - + + @@ -416,15 +414,17 @@ encoding. So the second part of the project consists in creating a panel of tools for the encoding of ancient textual sources in TEI XML. These tools will enhance existing software, and not create it from scratch.

-

The tools include XSLT stylesheets that convert styled Word or LibreOffice documents to - TEI XML.Sample transformations can already be made using OxGarage (accessed June - 4, 2021, ), and XSLT stylesheets can be - found in the TEI Consortium’s GitHub repository, accessed June 4, 2021,. Pre-encoding a text in a word - processor can be useful, but frequently a deeper level of encoding is needed, which is - difficult to reach working only on the text document. So a second category of tools is a - series of frameworks or encoding environments made through customization of two widely - used XML editors: +

The tools include XSLT stylesheets that convert styled Word or + LibreOffice documents to TEI XML.Sample transformations can already be made using + OxGarage (accessed June 4, 2021, ), and + XSLT stylesheets can be found in the TEI Consortium’s GitHub repository, + accessed June 4, 2021,. + Pre-encoding a text in a word processor can be useful, but frequently a deeper level of + encoding is needed, which is difficult to reach working only on the text document. So a + second category of tools is a series of frameworks or encoding environments made through + customization of two widely used XML editors: XMLmind XML Editor (XXE),Accessed June 4, 2021, . Customization of this editor is a @@ -528,13 +528,15 @@ type="bibl">Burnard 2019).

The challenge in developing encoding frameworks is not so much technical, as they use features available in each application. Frameworks consist of command configuration and - CSS files, sometimes XSLT files. They provide an economic, versatile way to provide - scholars with customized tools for digital scholarly editing.The Ediarum framework - for Oxygen has been developed at the Berlin-Brandenburg Academy of Sciences and - Humanities to help scholars create digital editions of historical documents in TEI - XML. However, this framework does not enable one to encode variant readings and is - therefore not designed for critical editing of medieval sources. See Mertgens (2019) and the Ediarum website: XSLT files. They provide an economic, versatile way + to provide scholars with customized tools for digital scholarly editing.The + Ediarum framework for Oxygen has been developed at the Berlin-Brandenburg Academy of + Sciences and Humanities to help scholars create digital editions of historical + documents in TEI XML. However, this framework does not enable one to encode variant + readings and is therefore not designed for critical editing of medieval sources. See + Mertgens (2019) and the Ediarum website: + . The Caen team Pôle Document numérique has since published a framework for critical editing in TEI Parallel Segmentation based on XMLmind XML Editor software: see See, for example, the <soCalled>Print an edition</soCalled> tool in the TEI Critical Apparatus Toolbox by M. Burghart, - in which the sample XSLT stylesheet to transform TEI XML into LaTeX, to produce - a printable PDF with a traditional critical apparatus, can be downloaded and - modified: <ptr target="http://teicat.huma-num.fr/print.php"/>. Burghart (<ref - target="#burghart16" type="bibl">2016</ref>) provides a survey of + in which the sample <ptr type="software" xml:id="XSLT" target="#XSLT"/><rs + type="soft.name" ref="#XSLT">XSLT</rs> stylesheet to transform TEI XML into + LaTeX, to produce a printable PDF with a traditional critical apparatus, can be + downloaded and modified: <ptr target="http://teicat.huma-num.fr/print.php"/>. + Burghart (<ref target="#burghart16" type="bibl">2016</ref>) provides a survey of transformation and display tools for scholarly editions encoded with TEI. TEI Publisher (<ptr target="https://teipublisher.com/index.html"/>) and the Max display engine (<ptr @@ -731,8 +734,10 @@ </div> <div xml:id="display"> <head>Display Possibilities</head> - <p>A publication prototype has been created using XSLT stylesheets and Javascript to - produce interactive HTML pages. The current version allows one to <list rend="bulleted"> + <p>A publication prototype has been created using <ptr type="software" xml:id="XSLT" + target="#XSLT"/><rs type="soft.name" ref="#XSLT">XSLT</rs> stylesheets and Javascript + to produce interactive HTML pages. The current version allows one to <list + rend="bulleted"> <item>read the reconstructed text of each manuscript;</item> <item>open each gloss in a pop-up with a click on the lemma (each pop-up can be moved where one likes);</item> diff --git a/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml b/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml index 879bb04..c55001f 100644 --- a/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml +++ b/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml @@ -1,5 +1,7 @@ -<?xml version="1.0" encoding="UTF-8"?><?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_jtei.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?><?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_jtei.rng" type="application/xml" - schematypens="http://purl.oclc.org/dsdl/schematron"?><!--$Id: jtei-cc-ra-parisse-182-source.xml 1047 2021-07-02 13:01:02Z pietro.liuzzo $--> +<?xml version="1.0" encoding="UTF-8"?> +<?xml-model href="https://github.com/DH-RSE/software-citation/raw/main/schema/tei_jtei_annotated.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?> +<?xml-model href="https://github.com/DH-RSE/software-citation/raw/main/schema/tei_jtei_annotated.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> +<!--$Id: jtei-cc-ra-parisse-182-source.xml 1047 2021-07-02 13:01:02Z pietro.liuzzo $--> <TEI xmlns="http://www.tei-c.org/ns/1.0" rend="jTEI"> <teiHeader> <fileDesc> @@ -97,13 +99,15 @@ spoken language corpora, and TEIMETA for metadata purposes. TEICORPO is based on the principle of an underlying common format, namely TEI XML as described in its specification for spoken language use (ISO 2016). This tool enables the conversion of transcriptions - created with alignment software such as CLAN, Transcriber, Praat, or ELAN as well as + created with alignment software such as CLAN, Transcriber, Praat, or <ptr type="software" + xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN">ELAN</rs> as well as common file formats (CSV, XLSX, TXT, or DOCX) and the TEI format, which plays the role of a lossless pivot format. Backward conversion is possible in many cases, with limitations inherent in the destination target format. TEICORPO can run the Treetagger part-of-speech tagger and the Stanford CoreNLP tools on TEI files and can export the resulting files to - textometric tools such as TXM, Le Trameur, or Iramuteq, making it suitable for spoken - language corpora editing as well as for various research purposes.</p> + textometric tools such as <ptr type="software" xml:id="TXM" target="#TXM"/><rs + type="soft.name" ref="#TXM">TXM</rs>, Le Trameur, or Iramuteq, making it suitable for + spoken language corpora editing as well as for various research purposes.</p> </div> </front> <body> @@ -209,15 +213,17 @@ <div xml:id="similarities"> <head>Similarities with and Differences from Other Approaches</head> <p>Many software packages dedicated to editing spoken language transcription contain - utilities that can convert many formats: for example, EXMARaLDA (<ref type="bibl" - target="#schmidt2004">Schmidt 2004</ref>; see <ptr target="https://exmaralda.org"/>), - Anvil (<ref type="bibl" target="#kipp2001">Kipp 2001</ref>; see <ptr - target="https://www.anvil-software.org"/>), and ELAN (<ref type="bibl" - target="#wittenburg2006">Wittenburg et al. 2006</ref>; see <ptr - target="https://archive.mpi.nl/tla/elan"/>). However, in all cases, the conversions - are limited to the features implemented in the tool itself—for example, with a limited - set of metadata—and they cannot always be used to prepare data to be used by another - tool.</p> + utilities that can convert many formats: for example, <ptr type="software" + xml:id="EXMARaLDA" target="#EXMARaLDA"/><rs type="soft.name" ref="#EXMARaLDA" + >EXMARaLDA</rs> (<ref type="bibl" target="#schmidt2004">Schmidt 2004</ref>; see <ptr + target="https://exmaralda.org"/>), Anvil (<ref type="bibl" target="#kipp2001">Kipp + 2001</ref>; see <ptr target="https://www.anvil-software.org"/>), and <ptr + type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs> (<ref type="bibl" target="#wittenburg2006">Wittenburg et al. 2006</ref>; + see <ptr target="https://archive.mpi.nl/tla/elan"/>). However, in all cases, the + conversions are limited to the features implemented in the tool itself—for example, with + a limited set of metadata—and they cannot always be used to prepare data to be used by + another tool.</p> <p>While our work is similar to that of Schmidt (<ref type="bibl" target="#schmidt2011" >2011</ref>), several differences make our approaches complementary. First, the two main common features are as follows:</p> @@ -227,17 +233,23 @@ <item>TEI is used as a destination format.</item> </list> <p>The list of tools that are considered in the two projects is nearly the same. The only - tools missing in the TEICORPO approach are EXMARaLDA and FOLKER (<ref type="bibl" - target="#schmidts2010">Schmidt and Schütte 2010</ref>; see <ptr + tools missing in the TEICORPO approach are <ptr type="software" xml:id="EXMARaLDA" + target="#EXMARaLDA"/><rs type="soft.name" ref="#EXMARaLDA">EXMARaLDA</rs> and FOLKER + (<ref type="bibl" target="#schmidts2010">Schmidt and Schütte 2010</ref>; see <ptr target="https://exmaralda.org/en/folker-en/"/>), but this was only because the - conversion tools from and to EXMARaLDA, FOLKER, and TEI already exist. They are - available as XSLT stylesheets in the open-source distribution of EXMARaLDA (<ptr + conversion tools from and to <ptr type="software" xml:id="EXMARaLDA" target="#EXMARaLDA" + /><rs type="soft.name" ref="#EXMARaLDA">EXMARaLDA</rs>, FOLKER, and TEI already exist. + They are available as <ptr type="software" xml:id="XSLT" target="#XSLT"/><rs + type="soft.name" ref="#XSLT">XSLT</rs> stylesheets in the open-source distribution of + <ptr type="software" xml:id="EXMARaLDA" target="#EXMARaLDA"/><rs type="soft.name" + ref="#EXMARaLDA">EXMARaLDA</rs> (<ptr target="https://github.com/Exmaralda-Org/exmaralda"/>). The other common point is the use of the TEI format, and especially the more recent ISO version of TEI for spoken language (ISO/TEI; see <ref type="bibl" target="#iso2016">ISO 2016</ref>). The TEI - format produced by the EXMARaLDA and FOLKER software fit within the process chain of - TEICORPO. This demonstrates the usefulness of a well-known and efficient format such as - TEI.</p> + format produced by the <ptr type="software" xml:id="EXMARaLDA" target="#EXMARaLDA"/><rs + type="soft.name" ref="#EXMARaLDA">EXMARaLDA</rs> and FOLKER software fit within the + process chain of TEICORPO. This demonstrates the usefulness of a well-known and + efficient format such as TEI.</p> <p>There are, however, differences between the two projects that make them nonredundant but complementary, each project having specificities that can be useful or damaging depending on the user’s needs. One minor difference is that the TEICORPO project is not @@ -251,7 +263,8 @@ to store the research data for long-term conservation and dissemination in a standard XML format instead of in proprietary formats such as those used by CLAN (<ref type="bibl" target="#macwhinney200">MacWhinney 2000</ref>; see <ptr - target="http://dali.talkbank.org/clan/"/>), ELAN, Praat (<ref type="bibl" + target="http://dali.talkbank.org/clan/"/>), <ptr type="software" xml:id="ELAN" + target="#ELAN"/><rs type="soft.name" ref="#ELAN">ELAN</rs>, Praat (<ref type="bibl" target="#boersma2001">Boersma and van Heuven 2001</ref>; see <ptr target="http://www.fon.hum.uva.nl/praat/"/>), and Transcriber (<ref type="bibl" target="#barras2000">Barras et al. 2000</ref>; see <ptr @@ -274,21 +287,25 @@ nor on the standard that should be used for transcribing in orthographic format. Instead, the TEICORPO approach focused on how to integrate multiple pieces of information into the TEI semantics (the macrostructure), as this is possible with tools - such as ELAN or PRAAT. The goal was to be able to convert a file produced by these tools - so that it can be saved in TEI format for long-term conservation.</p> - <p>Data in PRAAT and ELAN formats can contain information that is different from what is - usually present in an ISO/TEI description, but that nonetheless remains within the - structures authorized in the ISO/TEI. For example, the information is stored as - described below in <gi>spanGrp</gi>, an element available in the ISO/TEI description. - This means that whenever information is organized according to the + such as <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" + ref="#ELAN">ELAN</rs> or PRAAT. The goal was to be able to convert a file produced by + these tools so that it can be saved in TEI format for long-term conservation.</p> + <p>Data in PRAAT and <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs> formats can contain information that is + different from what is usually present in an ISO/TEI description, but that nonetheless + remains within the structures authorized in the ISO/TEI. For example, the information is + stored as described below in <gi>spanGrp</gi>, an element available in the ISO/TEI + description. This means that whenever information is organized according to the <soCalled>classical</soCalled> approach to spoken language (by <soCalled>classical</soCalled>, we mean approaches based on an orthographic transcription represented as a list, as in the script of a play), it will be available for further processing by using the export features of TEICORPO (see <ptr target="#exporting" type="crossref"/> and further below for export functionalities) - but other types of information are also available. Compared to PRAAT and ELAN, the - integration of tools such as CLAN or Transcriber was much more straightforward, as the - organization of the files is less varied and more <soCalled>classical</soCalled>.</p> + but other types of information are also available. Compared to PRAAT and <ptr + type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs>, the integration of tools such as CLAN or Transcriber was much more + straightforward, as the organization of the files is less varied and more + <soCalled>classical</soCalled>.</p> <div xml:id="microstructures"> <head>Choice of the Microstructure Representation</head> <p>Processing of the microstructure, with the exception of information already available @@ -335,22 +352,25 @@ </list> <list> <item>Praat is more specialized for phonetic or phonological annotations;</item> - <item>ELAN is recommended for annotating video and particularly multimodality (for - example, components such as gazes, gestures, and movements), and is often used for - rare languages to describe the organization of the segments.</item> + <item><ptr type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" + ref="#ELAN">ELAN</rs> is recommended for annotating video and particularly + multimodality (for example, components such as gazes, gestures, and movements), and is + often used for rare languages to describe the organization of the segments.</item> </list> <p>It should be pointed out here that whereas Transcriber and CLAN files nearly always contain <soCalled>classical</soCalled> orthographic transcriptions, this is not the case - for Praat and ELAN files. As our goal is to provide a generic solution for long-term - conservation and use for any type of project, conversion of all types of files produced - by the four tools cited above will be possible. It is up to the user to determine which - part of a corpus can be used with a classical approach, which parts should not, and how - they should be processed.</p> + for Praat and <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" + ref="#ELAN">ELAN</rs> files. As our goal is to provide a generic solution for + long-term conservation and use for any type of project, conversion of all types of files + produced by the four tools cited above will be possible. It is up to the user to + determine which part of a corpus can be used with a classical approach, which parts + should not, and how they should be processed.</p> <p>The list of tools reflects the uses and practices in the CORLI network, and is very similar to the list suggested by Schmidt (<ref target="#schmidt2011" type="bibl" - >2011</ref>) with the exception of EXMARaLDA and FOLKER. These two tools already have - built-in conversion features, so adding them to the TEICORPO project would be easy at a - later date.</p> + >2011</ref>) with the exception of <ptr type="software" xml:id="EXMARaLDA" + target="#EXMARaLDA"/><rs type="soft.name" ref="#EXMARaLDA">EXMARaLDA</rs> and FOLKER. + These two tools already have built-in conversion features, so adding them to the + TEICORPO project would be easy at a later date.</p> <p>Alignment applications deal with two main types of data presentation and organization. The presentation of the data has direct consequences for how the data are exploited, and therefore on the design of the tools that are used.</p> @@ -371,14 +391,16 @@ chronologically but is sorted by the names of the tiers (or any other order), with all the production within the same tier sorted by timeline.</item> </list> - <p>No tool offers both types of presentation. ELAN offers some alternatives to editing or - displaying data with the partition format, but none of the existing tools offer - full-fledged list format editing. It is possible to represent the two structures within - a similar model, as demonstrated by Bird and Liberman (<ref type="bibl" + <p>No tool offers both types of presentation. <ptr type="software" xml:id="ELAN" + target="#ELAN"/><rs type="soft.name" ref="#ELAN">ELAN</rs> offers some alternatives to + editing or displaying data with the partition format, but none of the existing tools + offer full-fledged list format editing. It is possible to represent the two structures + within a similar model, as demonstrated by Bird and Liberman (<ref type="bibl" target="#bird2001">2001</ref>). However, this is not the case for the four tools listed above: each of them represents the data in a unique underlying data structure. - Transcriber and CLAN are organized in list format; Praat and ELAN have a partition - format.</p> + Transcriber and CLAN are organized in list format; Praat and <ptr type="software" + xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN">ELAN</rs> have a + partition format.</p> <p>Each presentation format has its own pros and cons. Because of the possibilities offered by the presentation formats, and because the same software, even within the same presentation models, rarely provides a solution for all the needs of all users, @@ -389,15 +411,18 @@ will have to use the Praat software and convert not only the transcription, but also the media. In the field of language acquisition, where the CLAN software is generally used to describe both the child productions and the adult productions, when researchers are - interested in gestures, they use the ELAN software, importing the CLAN file to add - gesture tiers, as ELAN is more suitable for the fine-grained analysis of visual data. - Another common practice consists in first doing a rapid transcription using only - orthographic annotations in Transcriber and then in a second stage annotating some more - interesting excerpts in greater detail including new information. In this case + interested in gestures, they use the <ptr type="software" xml:id="ELAN" target="#ELAN" + /><rs type="soft.name" ref="#ELAN">ELAN</rs> software, importing the CLAN file to add + gesture tiers, as <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs> is more suitable for the fine-grained analysis + of visual data. Another common practice consists in first doing a rapid transcription + using only orthographic annotations in Transcriber and then in a second stage annotating + some more interesting excerpts in greater detail including new information. In this case researchers will import the first transcription file into other tools such as Praat or - ELAN and annotate them partially. It is therefore necessary to import or export files in - different formats if researchers need to use different tools for different parts of - their work.</p> + <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs> and annotate them partially. It is therefore necessary to import or export + files in different formats if researchers need to use different tools for different + parts of their work.</p> <p>Another need concerns the pooling of corpora coming from other resources or other projects. Conversions are necessary, and can be problematic because going from one piece of software to another often leads to a loss of information, as each tool handles corpus @@ -426,12 +451,13 @@ metadata and all the macrostructure information into the TEI format.</p> <div xml:id="basicstruct"> <head>Basic Structures</head> - <p>Converting the metadata is straightforward, as the four tools (CLAN, ELAN, Praat, and - Transcriber) do not enable a large amount of metadata to be edited. Most of the - metadata available concerns the content of the sequence; some user metadata is also - available, especially in CLAN. The insertion of metadata follows the indications of - the ISO/TEI 24624:2016 standard (<ref type="bibl" target="#iso2016">ISO - 2016</ref>).</p> + <p>Converting the metadata is straightforward, as the four tools (CLAN, <ptr + type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs>, Praat, and Transcriber) do not enable a large amount of metadata to be + edited. Most of the metadata available concerns the content of the sequence; some user + metadata is also available, especially in CLAN. The insertion of metadata follows the + indications of the ISO/TEI 24624:2016 standard (<ref type="bibl" target="#iso2016">ISO + 2016</ref>).</p> <p>Moreover, some tools, such as Transcriber, include information about silences, pauses, and events in their XML format. This information is also processed within TEICORPO, once again following the recommendations of the ISO/TEI standard.</p> @@ -503,15 +529,19 @@ <p>Although the presentation described above can represent the data of many corpora and tools, a single-level annotation structure within the <gi>spanGrp</gi> elements is insufficient to represent the complex organization that can be constructed with the - ELAN and Praat tools. ELAN is a tool used by many researchers to describe data of - greater complexity than the data presented in the ISO/TEI guidelines. As the goal of - the TEICORPO project was to convert all types of structure used in the spoken language - community, including ELAN and Praat, it was necessary to extend the description method - presented in <ptr target="#basicstruct" type="crossref"/>.</p> - <p>In ELAN and Praat, the multitiered annotations can be organized in a structured - manner. These tools take advantage of the partition presentation of the data, so that - the relationship between a parent tier and a child tier can be precisely organized. - There are two main types of organization: symbolic and temporal.</p> + <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs> and Praat tools. <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs> is a tool used by many researchers to + describe data of greater complexity than the data presented in the ISO/TEI guidelines. + As the goal of the TEICORPO project was to convert all types of structure used in the + spoken language community, including <ptr type="software" xml:id="ELAN" target="#ELAN" + /><rs type="soft.name" ref="#ELAN">ELAN</rs> and Praat, it was necessary to extend + the description method presented in <ptr target="#basicstruct" type="crossref"/>.</p> + <p>In <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" + ref="#ELAN">ELAN</rs> and Praat, the multitiered annotations can be organized in a + structured manner. These tools take advantage of the partition presentation of the + data, so that the relationship between a parent tier and a child tier can be precisely + organized. There are two main types of organization: symbolic and temporal.</p> <p>In symbolic division, the elements of a child tier, C<hi rend="sub">1</hi> to C<hi rend="sub">n</hi>, can be related to an element of a parent tier P. For example, a word is divided into morphemes. In <ptr target="#fig1" type="crossref"/>, the main @@ -525,7 +555,8 @@ links.</p> <figure xml:id="fig1"> <graphic url="media/image1.PNG" width="620px" height="980px"/> - <head type="legend">ELAN annotation with symbolic structures</head> + <head type="legend"><ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs> annotation with symbolic structures</head> </figure> <p>In temporal division, the association between the main tier and the dependent tiers @@ -564,22 +595,27 @@ <gi>span</gi> mechanism. Moreover, the <gi>span</gi> and <gi>spanGrp</gi> tags have attributes that can point to other elements or to timelines. Using this coding schema, it is therefore possible to store any type of structure, symbolic and/or temporal, - that can be generated with ELAN or PRAAT, as described above.</p> + that can be generated with <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs> or PRAAT, as described above.</p> <p>To do this, each element which is in a symbolic or temporal relation is represented by a <gi>spanGrp</gi> element of the TEI. The <gi>spanGrp</gi> element contains as - many <gi>span</gi> elements as necessary to store all the elements present in the ELAN - or PRAAT representation. The parent element of a <gi>spanGrp</gi> is the main - <gi>annotationBlock</gi> element when the division in ELAN or PRAAT is the first - division of a main element. The parent element is another <gi>span</gi> element when - the division in ELAN or PRAAT is a subdivision of another element which is not a main - element. This XML structure is complemented by explicit information as allowed in TEI. - The <gi>span</gi> elements are linked to the element they depend on, either with a - symbolic link using the <att>target</att> attribute of the <gi>span</gi> element, or - with temporal links using the <att>from</att> and <att>to</att> attributes of the - <gi>span</gi> element.</p> + many <gi>span</gi> elements as necessary to store all the elements present in the <ptr + type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs> or PRAAT representation. The parent element of a <gi>spanGrp</gi> is the + main <gi>annotationBlock</gi> element when the division in <ptr type="software" + xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN">ELAN</rs> or PRAAT is + the first division of a main element. The parent element is another <gi>span</gi> + element when the division in <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs> or PRAAT is a subdivision of another element + which is not a main element. This XML structure is complemented by explicit + information as allowed in TEI. The <gi>span</gi> elements are linked to the element + they depend on, either with a symbolic link using the <att>target</att> attribute of + the <gi>span</gi> element, or with temporal links using the <att>from</att> and + <att>to</att> attributes of the <gi>span</gi> element.</p> <p>Two examples of how this is displayed in a TEI file are given below. The first example (see <ptr target="#fig3" type="crossref"/> and <ptr target="#example_code_3" - type="crossref"/>) corresponds to the ELAN example above (see <ptr + type="crossref"/>) corresponds to the <ptr type="software" xml:id="ELAN" + target="#ELAN"/><rs type="soft.name" ref="#ELAN">ELAN</rs> example above (see <ptr target="#advancedstructures" type="crossref"/>, <ptr target="#fig1" type="crossref" />). The TEI encoding represents the words of the sentence from left to right (from <term>gahwat</term> to <term>endi</term> in our example). The detail of the @@ -591,7 +627,8 @@ and <term>-DET</term>.</p> <figure xml:id="fig3"> <graphic url="media/image1.PNG" width="620px" height="980px"/> - <head type="legend">ELAN example of a symbolic division</head> + <head type="legend"><ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs> example of a symbolic division</head> </figure> <!-- EXACTELY THE SAME AS FIG1: why repeat?--> <figure xml:id="example_code_3"> @@ -649,7 +686,8 @@ s37).</p> <figure xml:id="fig4"> <graphic url="media/image2.PNG" width="620px" height="980px"/> - <head type="legend">ELAN example of a temporal division</head> + <head type="legend"><ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs> example of a temporal division</head> </figure> <figure xml:id="example_code_4"> <egXML xmlns="http://www.tei-c.org/ns/Examples"> @@ -669,9 +707,7 @@ <span from="#T7" to="#T6" xml:id="s13">a</span> </spanGrp> </span> - <!-- . . . . . . . . --> </spanGrp> - <!-- . . . . . . . . --> </span> </egXML> <head type="legend">TEI encoding corresponding to <ptr target="#fig4" type="crossref" @@ -679,20 +715,24 @@ </figure> <p>The <gi>spanGrp</gi> and <gi>span</gi> offer a generic representation of data coming from relatively unconstrained representations produced by partition software. The - names of the tiers used in the ELAN and Praat tools are given in the content of the - <att>type</att> attribute. These names are not used to provide structural + names of the tiers used in the <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs> and Praat tools are given in the content of + the <att>type</att> attribute. These names are not used to provide structural information, the structure being represented only by the <gi>spanGrp</gi> and <gi>span</gi> hierarchy. However, the organization into <gi>spanGrp</gi> and <gi>span</gi> is not always sufficient to represent all the details of the tier - organization of each software feature. This is the case for some of the ELAN - structures, which can specify the nature of <gi>span</gi> elements further than in the - TEI feature. For example, the <term>timediv</term> ELAN property specifies that only - contiguous temporal division is allowed, whereas the <term>incl</term> property allows - non-contiguous elements. It was therefore necessary to include the type of - organization in the header of the TEI file, using the <term>note</term> structure. The - <gi>note</gi> element here points to a case where dedicated tags do not currently - exist in TEI, so we used the <gi>note</gi> element as the best way not to lose the - information. Dedicated tags could be added to future versions of TEI.</p> + organization of each software feature. This is the case for some of the <ptr + type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs> structures, which can specify the nature of <gi>span</gi> elements + further than in the TEI feature. For example, the <term>timediv</term> + <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs> property specifies that only contiguous temporal division is allowed, + whereas the <term>incl</term> property allows non-contiguous elements. It was + therefore necessary to include the type of organization in the header of the TEI file, + using the <term>note</term> structure. The <gi>note</gi> element here points to a case + where dedicated tags do not currently exist in TEI, so we used the <gi>note</gi> + element as the best way not to lose the information. Dedicated tags could be added to + future versions of TEI.</p> </div> </div> <div xml:id="exporting"> @@ -701,8 +741,9 @@ remains as lossless as possible. This allows for all types of corpora to be stored for long-term preservation purposes. It also allows the corpora to be used with other editing tools, some of which are suited to specific processing: for example, Praat for - phonetics/phonology; Transcriber/CLAN for raw transcription; and ELAN for gesture and - visual coding.</p> + phonetics/phonology; Transcriber/CLAN for raw transcription; and <ptr type="software" + xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN">ELAN</rs> for gesture + and visual coding.</p> <p>However, a large proportion of scientific research and applications done using corpora requires further processing of the data. For example, although querying or using raw language forms is possible, many research investigations and tools use words, parts of @@ -752,9 +793,10 @@ <head>Basic Import and Export Functions</head> <p>The command-line interface (see <ptr target="http://ct3.ortolang.fr/teicorpo/"/>) can perform conversions between TEI and the formats used by the following programs: CLAN, - ELAN, Praat, and Transcriber. The conversions can be performed on single files or on - whole directories or on a file tree. The command-line interface is suited to automatic - processing in offline environments. The online interface (see <ptr + <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs>, Praat, and Transcriber. The conversions can be performed on single files + or on whole directories or on a file tree. The command-line interface is suited to + automatic processing in offline environments. The online interface (see <ptr target="http://ct3.ortolang.fr/teiconvert/"/>) can convert one or several files selected by the user, but not whole directories. Results appear in the user’s download folder.</p> @@ -819,15 +861,18 @@ <div xml:id="exportspec"> <head>Export to Specialized Software</head> <p>Another kind of export concerns textometric software. TEICONVERT makes spoken language - data available for TXM (<ref type="bibl" target="#heiden2010">Heiden 2010</ref>; see - <ptr target="http://textometrie.ens-lyon.fr"/>), Le Trameur (<ref type="bibl" + data available for <ptr type="software" xml:id="TXM" target="#TXM"/><rs type="soft.name" + ref="#TXM">TXM</rs> (<ref type="bibl" target="#heiden2010">Heiden 2010</ref>; see <ptr + target="http://textometrie.ens-lyon.fr"/>), Le Trameur (<ref type="bibl" target="#fleury2014">Fleury and Zimina 2014</ref>; see <ptr target="http://www.tal.univ-paris3.fr/trameur/"/>), and Iramuteq (see <ptr target="http://iramuteq.org/"/> and <ref type="bibl" target="#souza2018">de Souza et al. 2018</ref>), providing a dedicated TEI export for these tools. For example, for - the TXM software, the export includes a text element made of utterance elements + the <ptr type="software" xml:id="TXM" target="#TXM"/><rs type="soft.name" ref="#TXM" + >TXM</rs> software, the export includes a text element made of utterance elements including age and speaker attributes. <ptr target="#example_code_5" type="crossref"/> - presents an example for the TXM software.</p> + presents an example for the <ptr type="software" xml:id="TXM" target="#TXM"/><rs + type="soft.name" ref="#TXM">TXM</rs> software.</p> <figure xml:id="example_code_5"> <egXML xmlns="http://www.tei-c.org/ns/Examples"> <TEI file="lily-4-00-02.tei_corpo.xml"> @@ -857,7 +902,8 @@ </text> </TEI> </egXML> - <head type="legend">Example of XML for the TXM software</head> + <head type="legend">Example of XML for the <ptr type="software" xml:id="TXM" + target="#TXM"/><rs type="soft.name" ref="#TXM">TXM</rs> software</head> </figure> <p>An export has been developed for Lexico and Le Trameur textometric software with a @@ -892,15 +938,18 @@ linguistic research. A present difficulty with these grammatical analyzers is that most often they run only on raw orthographic material, excluding other information. Moreover, their results are not always in a format that can be used with traditional spoken - language software such as CLAN, ELAN, Praat, Transcriber, nor of course in TEI + language software such as CLAN, <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs>, Praat, Transcriber, nor of course in TEI format.</p> <p>TEICORPO provides a way to solve this problem by running analyzers and putting the results from the analysis back into TEI format. Once the TEI format has been enriched with grammatical information, it is possible to use the results and convert them back to - ELAN or Praat and use the grammatical information in these spoken language software - packages. It is also possible to export to TXM and to use the grammatical information in - the textometric software. Two grammatical analyzers have been implemented in TEICORPO: - TreeTagger and CoreNLP.</p> + <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs> or Praat and use the grammatical information in these spoken language + software packages. It is also possible to export to <ptr type="software" xml:id="TXM" + target="#TXM"/><rs type="soft.name" ref="#TXM">TXM</rs> and to use the grammatical + information in the textometric software. Two grammatical analyzers have been implemented + in TEICORPO: TreeTagger and CoreNLP.</p> <div xml:id="treetagger"> <head>TreeTagger</head> <p>TreeTagger<note> @@ -1058,7 +1107,6 @@ </span> <!-- . . . . . . . . --> </spanGrp> - <!-- . . . . . . . . --> </annotationBlock> </egXML> <head type="legend">Tagging results in <gi>conll</gi> format</head> @@ -1075,8 +1123,9 @@ suite provides several tools such as a tokenizer, a POS tagger, a parser, a named entity recognizer, temporal tagging, and coreference resolution. All the tools are available for English, but only some of them are available for all languages. All - software libraries are integrated into Java JAR files, so all that is required is to - download JAR files from the CoreNLP website<note> + software libraries are integrated into <ptr type="software" xml:id="Java" + target="#Java"/><rs type="soft.name" ref="#Java">Java</rs> JAR files, so all that is + required is to download JAR files from the CoreNLP website<note> <p>Accessed May 5, 2021, <ptr target="https://stanfordnlp.github.io/CoreNLP/index.html#download"/>.</p> </note> to use them with TEICORPO. Using the analyzer is similar to using TreeTagger. @@ -1087,9 +1136,10 @@ <p>The <term>directory_for_SNLP</term> is the name of the location on a computer where all the CoreNLP JAR files can be found. Note that using the CoreNLP software makes heavy demands on the computer’s memory resources and it is necessary to instruct the - Java software to use a large amount of memory (for example to insert parameter -mx5g - before parameter <code>-cp</code> to indicate that 5 GB of memory will be used for a - full English analysis).</p> + <ptr type="software" xml:id="Java" target="#Java"/><rs type="soft.name" ref="#Java" + >Java</rs> software to use a large amount of memory (for example to insert parameter + -mx5g before parameter <code>-cp</code> to indicate that 5 GB of memory will be used + for a full English analysis).</p> <p>The <code>-model</code> parameter can take three values: english (use the full English grammar), french (use the full French grammar), or the name of a CoreNLP parameter file which specifies any type of analysis that is available in CoreNLP.</p> @@ -1102,21 +1152,25 @@ <div xml:id="exportinggrammatical"> <head>Exporting the Grammatical Analysis</head> <p>The results from the grammatical analysis can be used in transcription files such as - those used by Praat and ELAN. A partition-like visual presentation of data is very handy - to represent a part of speech or a CONLL result. The orthographic line will appear at - the top with divisions into words, divisions into parts of speech, and other syntactic - information below. As the result of the analysis can contain a large number of tiers - (each speaker will have as many tiers as there are elements in the grammatical analysis: - for example, word, POS, and lemma for TreeTagger; ten tiers for CoreNLP full analysis), - it is helpful to limit the number of visible tiers, either using the <code>-a</code> - option of TEICORPO, or limiting the display with the annotation tool.</p> - <p>An example is presented below in the ELAN tool (see <ptr target="#fig5" type="crossref" - />). The original utterance was <term>si c’est comme ça je m’en vais</term> (if that’s - how it is, I’m leaving). It is displayed in the first line, highlighted in pink. The - analysis into words (second line, consisting of numbers), lemmas (third line), parts of - speech (fourth line), and orthographic words (final line) is displayed below. So, for - example, word 3 has the lemma <term>être</term> (to be) and the POS VER:pres (verb in - the present tense), and it is the word <term>est</term> (is).</p> + those used by Praat and <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs>. A partition-like visual presentation of data + is very handy to represent a part of speech or a CONLL result. The orthographic line + will appear at the top with divisions into words, divisions into parts of speech, and + other syntactic information below. As the result of the analysis can contain a large + number of tiers (each speaker will have as many tiers as there are elements in the + grammatical analysis: for example, word, POS, and lemma for TreeTagger; ten tiers for + CoreNLP full analysis), it is helpful to limit the number of visible tiers, either using + the <code>-a</code> option of TEICORPO, or limiting the display with the annotation + tool.</p> + <p>An example is presented below in the <ptr type="software" xml:id="ELAN" target="#ELAN" + /><rs type="soft.name" ref="#ELAN">ELAN</rs> tool (see <ptr target="#fig5" + type="crossref"/>). The original utterance was <term>si c’est comme ça je m’en + vais</term> (if that’s how it is, I’m leaving). It is displayed in the first line, + highlighted in pink. The analysis into words (second line, consisting of numbers), + lemmas (third line), parts of speech (fourth line), and orthographic words (final line) + is displayed below. So, for example, word 3 has the lemma <term>être</term> (to be) and + the POS VER:pres (verb in the present tense), and it is the word <term>est</term> + (is).</p> <figure xml:id="fig5"> <graphic url="media/image3.png" width="620px" height="980px"/> <head type="legend">Example of TreeTagger analysis representation in a partition @@ -1124,12 +1178,14 @@ </figure> <p>Export can be done from TEI into a format used by textometric software (see <ptr - target="#example_code_11" type="crossref"/>). This is the case for TXM,<note> + target="#example_code_11" type="crossref"/>). This is the case for <ptr + type="software" xml:id="TXM" target="#TXM"/><rs type="soft.name" ref="#TXM">TXM</rs>,<note> <p>See the Textométrie website, last updated June 29, 2020, <ptr target="http://textometrie.ens-lyon.fr/?lang=en"/>.</p> </note> a textometric software application. In this case, instead of using a partition representation, the information from the grammatical analysis is inserted at the word - level in an XML structure. For example, in the case below, the TXM export includes + level in an XML structure. For example, in the case below, the <ptr type="software" + xml:id="TXM" target="#TXM"/><rs type="soft.name" ref="#TXM">TXM</rs> export includes Treetagger annotations in POS, adding <att>lemma</att> and <att>pos</att> attributes to the word element <gi>w</gi>.</p> <figure xml:id="example_code_11"> @@ -1157,13 +1213,14 @@ <w age="28.0" loc="MOT">extravaganza</w> <w age="28.0" loc="MOT">.</w> </u> - <!-- . . . . . . . --> + </text> - <!-- . . . . . . . --> + </TEI> </egXML> <head type="legend">Example of TreeTagger analysis representation that can be imported - into TXM</head> + into <ptr type="software" xml:id="TXM" target="#TXM"/><rs type="soft.name" ref="#TXM" + >TXM</rs></head> </figure> </div> <div xml:id="comparison"> @@ -1181,10 +1238,14 @@ type of data they target. TEICORPO is intended to be used not as an independent tool, but as a utility tool that helps researchers to go from one type of data to another. For example, the syntactic analysis is intended to be used as a first step before being used - in tools such as Praat, ELAN, or TXM. Our more recent developments (see <ref type="bibl" - target="#badin2021">Badin et al. 2021</ref>) made it possible to insert metadata - stored in CSV files (including participant metadata) into the TEI files. This makes it - possible to achieve more powerful corpus analysis using a tool such as TXM.</p> + in tools such as Praat, <ptr type="software" xml:id="ELAN" target="#ELAN"/><rs + type="soft.name" ref="#ELAN">ELAN</rs>, or <ptr type="software" xml:id="TXM" + target="#TXM"/><rs type="soft.name" ref="#TXM">TXM</rs>. Our more recent developments + (see <ref type="bibl" target="#badin2021">Badin et al. 2021</ref>) made it possible to + insert metadata stored in CSV files (including participant metadata) into the TEI files. + This makes it possible to achieve more powerful corpus analysis using a tool such as + <ptr type="software" xml:id="TXM" target="#TXM"/><rs type="soft.name" ref="#TXM" + >TXM</rs>.</p> <p>Our approach is somewhat similar to what is suggested in the conclusion of Schmidt, Hedeland, and Jettka (<ref type="bibl" target="#schmidt2017">2017</ref>), who describe a mechanism that makes it possible to use the power of Weblicht to process their files @@ -1215,15 +1276,16 @@ free and open source so it can be further used and developed in other projects.</p> <p>TEICORPO is intended to be part of a large set of tools using TEI for linguistic corpus research. It can be used in parallel with or as a complement to other tools such as - Weblicht or the EXMARaLDA tools (see <ref type="bibl" target="#schmidt2017">Schmidt, - Hedeland, and Jettka 2017</ref>). A specificity of TEICORPO is that it is more suitable - for processing extended forms of TEI data (especially forms which are not inside the main - <gi>u</gi> element in the TEI code). TEICORPO is also linked to TEIMETA, a flexible tool - for describing spoken language corpora in a web interface generated from an ODD file (<ref - type="bibl" target="#etienne">Etienne, Liégois, and Parisse, accepted</ref>). As TEI - enables metadata and data to be stored in the same file, sharing this format will promote - metadata sharing and will keep metadata linked to their data during the life cycle of the - data.</p> + Weblicht or the <ptr type="software" xml:id="EXMARaLDA" target="#EXMARaLDA"/><rs + type="soft.name" ref="#EXMARaLDA">EXMARaLDA</rs> tools (see <ref type="bibl" + target="#schmidt2017">Schmidt, Hedeland, and Jettka 2017</ref>). A specificity of + TEICORPO is that it is more suitable for processing extended forms of TEI data (especially + forms which are not inside the main <gi>u</gi> element in the TEI code). TEICORPO is also + linked to TEIMETA, a flexible tool for describing spoken language corpora in a web + interface generated from an ODD file (<ref type="bibl" target="#etienne">Etienne, Liégois, + and Parisse, accepted</ref>). As TEI enables metadata and data to be stored in the same + file, sharing this format will promote metadata sharing and will keep metadata linked to + their data during the life cycle of the data.</p> <p>Potential further developments could provide wider coverage of different formats such as CMDI or linked data for editing or data exploration purposes; allow TEICORPO to work with other external tools such as grammatical analyzers; or enable the visualization of @@ -1282,13 +1344,14 @@ >57–61</biblScope>. N.p.: Dublin City University and Association for Computational Linguistics. <ptr target="https://www.aclweb.org/anthology/C14-2013"/>.</bibl> <bibl xml:id="heiden2010"><author>Heiden, Serge</author>. <date>2010</date>. <title - level="a">The TXM Platform: Building Open-Source Textual Analysis Software Compatible - with the TEI Encoding Scheme. In Proceedings of the 24th Pacific Asia - Conference on Language, Information and Computation (PACLIC24), edited by Ryo - Otoguro, Kiyoshi Ishikawa, Hiroshi Umemoto, - Kei Yoshimoto, and Yasunari Harada, 389–98. Sendai, Japan: Institute for Digital Enhancement of - Cognitive Development, Waseda University. The TXM Platform: Building Open-Source Textual Analysis Software + Compatible with the TEI Encoding Scheme. In Proceedings of the 24th Pacific + Asia Conference on Language, Information and Computation (PACLIC24), edited by + Ryo Otoguro, Kiyoshi Ishikawa, Hiroshi + Umemoto, Kei Yoshimoto, and Yasunari + Harada, 389–98. Sendai, Japan: Institute for + Digital Enhancement of Cognitive Development, Waseda University. . Hinrichs, Erhard, Marie Hinrichs, and Thomas Zastrow. 2010. .</bibl> <bibl xml:id="schmidt2004"><author>Schmidt, Thomas</author> <date>2004</date>. <title level="a">Transcribing and Annotating Spoken Language with - EXMARaLDA. In Proceedings of the LREC-Workshop on XML-based Richly Annotated - Corpora. Paris: ELRA. Author’s version available at EXMARaLDA. In Proceedings of the LREC-Workshop on + XML-based Richly Annotated Corpora. Paris: ELRA. Author’s version available at . Schmidt, Thomas. 2011. A TEI-based Approach to Standardising Spoken Language Transcription. @@ -1387,11 +1451,13 @@ type="DOI">10.1590/S1980-220X2017015003353. Wittenburg, Peter, Hennie Brugman, Albert Russel, Alex Klassmann, and - Han Sloetjes. 2006. ELAN: A - Professional Framework for Multimodality Research. In Proceedings of LREC - 2006, Fifth International Conference on Language Resources and Evaluation, 1556–59. N.p.: European Language Resources Association (ELRA). - . + Han Sloetjes. 2006. <ptr + type="software" xml:id="ELAN" target="#ELAN"/><rs type="soft.name" ref="#ELAN" + >ELAN</rs>: A Professional Framework for Multimodality Research. In + Proceedings of LREC 2006, Fifth International Conference on Language Resources and + Evaluation, 1556–59. N.p.: European Language + Resources Association (ELRA). .

diff --git a/data/JTEI/13_2020-22/jtei-cc-ra-winslow-186-source.xml b/data/JTEI/13_2020-22/jtei-cc-ra-winslow-186-source.xml index 044cd87..d3a469d 100644 --- a/data/JTEI/13_2020-22/jtei-cc-ra-winslow-186-source.xml +++ b/data/JTEI/13_2020-22/jtei-cc-ra-winslow-186-source.xml @@ -1,6 +1,6 @@ - - + + + @@ -530,7 +530,9 @@ favoring one period or type of document over another, a generic element seems both desirable and advisable. The proposed element, as implemented in the TEI_CEI ODD,See the CEI2TEI GitHub repository, accessed June 25, 2021, CEI2TEI GitHub repository, accessed June 25, + 2021, . is simple (attList items suppressed for brevity: they follow the top-level distinctions in the sample controlled vocabulary visualized in In order to control the wide range of different variants of this practice, here is a proposed vocabulary, provided in SKOS (Simple Knowledge Organization System) format (as part of the project’s GitHub repository,Accessed July 13, 2021, + GitHub repository,Accessed July 13, 2021, . at - + + + @@ -354,9 +354,10 @@ since then the master text format ( shows an excerpt from such a text) and a number of published versions derived from it have all been encoded in TEI, with the most recent version in TEI P5 and Unicode.The most recent - version of the whole textual database is available in the CBETA XML P5 GitHub - repository, accessed April 20, 2020, . + version of the whole textual database is available in the CBETA XML P5 GitHub repository, accessed April 20, 2020, .

@@ -488,9 +489,10 @@

By the year 2010, the practice of using separate text files for different witnesses of a text had become well established in our workflow. For tracking changes to these files, we had used version control tools from the start. At some point, we realized that the modern - distributed variety of these tools, Git and GitHub, not only had the potential to solve - the problem of keeping track of changes made to a file, but could also be used to hold all - witnesses of a text in one repository, each of them represented as a + distributed variety of these tools, Git and GitHub, not only had the + potential to solve the problem of keeping track of changes made to a file, but could also + be used to hold all witnesses of a text in one repository, each of them represented as a branch. (In the terminology of version control software, a branch is one current state in the editing history of the file, which has been given a name to make it easy to address it and to track changes along a specific trajectory.)

@@ -500,11 +502,12 @@ texts. As stated already, one of the aims of my work from the outset was to make a digital version of a text at least as versatile as a printed scholarly edition. For me, this also included taking ownership of one specific copy of such an edition and tracking the work by - adding marginal notes, comments, and references directly into the book. With GitHub as a - repository for texts and Git as a means to control the various maintenance tasks, - researchers interested in a text could clone the text, add their own marginal notes, then - make their version of the text available to us or any other researcher to integrate, if we - so chose.

+ adding marginal notes, comments, and references directly into the book. With GitHub as a repository for texts and Git as a means to control the various + maintenance tasks, researchers interested in a text could clone the text, add their own + marginal notes, then make their version of the text available to us or any other + researcher to integrate, if we so chose.

A Git workflow can use any kind of digital material, but it works better with textual material as opposed to images or videos, and even better for texts that use lines as a structural element. This again is where the plain text we used in the Daozang @@ -513,7 +516,7 @@

When I first presented this idea at the TEI conference in Würzburg in October 2011, I got this comment via a tweet from one of the most respected members of the TEI community (: @rahtz: interesting that - @cwittern thinks <> is hard, Git is easy. + @cwittern thinks <> is hard, Git is easy. #tei2011).

@@ -540,8 +543,9 @@ />. (and the corresponding closing tags to convey this information).

From the beginning, the DZJY was in my view itself a pilot project for a much larger project, on which preparatory work started in earnest in 2012: the Kanseki Repository (GitHub username - @kanripo).Accessed June24, 2020, Kanseki Repository (GitHub + username @kanripo).Accessed June24, 2020, and . Kanseki here is the Japanese term for premodern Chinese texts, and @@ -560,16 +564,19 @@ sources available on the Internet, we have compiled an initial catalog of about 10,000 titles to be included in a first phase of the project; this catalog is also being supplemented by users who deposit whatever texts they are interested in into the - repository. Since the initial publication on GitHub in September 2015, and the launch of a - dedicated website in March 2016, usage has been increasing slowly but steadily.

+ repository. Since the initial publication on GitHub in September 2015, and + the launch of a dedicated website in March 2016, usage has been increasing slowly but + steadily.

Kanripo Project Details -

All the texts are freely available on GitHub in their source form. This repository of - texts can be accessed through the kanripo.org website, but also through a module of the Emacs editor called - Mandoku. This allows users to query, access, clone, edit, and push the texts directly - from their own computer. Reading, commenting, and editing do not require internet - access.

+

All the texts are freely available on GitHub in their source form. + This repository of texts can be accessed through the kanripo.org website, but also through a module + of the Emacs editor called Mandoku. This allows users to query, access, clone, edit, and + push the texts directly from their own computer. Reading, commenting, and editing do not + require internet access.

While not yet a full realization of the original vision, this project is currently the best compromise I know of between allowing the researcher (user) to take complete ownership of a text—not just in the technical sense, but also in a practical sense of @@ -577,8 +584,9 @@ the context of their aims—and authoritative vetting and editorial quality assurance.

demonstrate the concept and functions of the Kanseki Repository. On the website, users can search for - texts or browse the catalog. Once a text is found, the webserver reads it from the - GitHub repository and serves it to the user. For most texts, there are different + texts or browse the catalog. Once a text is found, the webserver reads it from the GitHub repository and serves it to the user. For most texts, there are different editions to choose from; usually both documentary and interpretative versions exist. For many texts, there is also a digital facsimile, which can be called up alongside the text; if there is more than one edition documented with a digital facsimile, the others @@ -589,11 +597,13 @@ A text in the Kanseki Repository

In the screenshot in , there is a link at the - top of the page labeled GitHub, from which the source of the text can be directly - accessed. A user who wishes to make changes to the text, by correcting, annotating, or - even translating it, can transfer a copy of this text from the public - @kanripo account, either by cloning it to their own account on GitHub, or - by downloading it locally.

+ top of the page labeled GitHub, from which the source of the text + can be directly accessed. A user who wishes to make changes to the text, by correcting, + annotating, or even translating it, can transfer a copy of this text from the public + @kanripo account, either by cloning it to their own account on GitHub, or by downloading it locally.

The user can also log in to the Kanripo website with their Github credentials. When this is done for the first time, the user has to grant the Kanseki Repository access to their repositories. In addition, a new repository KR-Workspace is @@ -645,8 +655,9 @@

- The text with translation, now pulled from the user’s GitHub - account + The text with translation, now pulled from the user’s GitHub account
diff --git a/data/JTEI/14_2021-23/jtei-barabuccietal-196-source.xml b/data/JTEI/14_2021-23/jtei-barabuccietal-196-source.xml index 7d03a1f..11f9314 100644 --- a/data/JTEI/14_2021-23/jtei-barabuccietal-196-source.xml +++ b/data/JTEI/14_2021-23/jtei-barabuccietal-196-source.xml @@ -1,6 +1,6 @@ - - + + + @@ -112,15 +112,16 @@ the paradigms and the technologies of the TEI.

This paper describes how we dealt with the encoding and transformation of the punctuation in the Early New High German edition of Marco Polo’s travel account. - Technically, we implemented a set of general rules (as XSLT templates) plus various - exceptions (as descriptive instructions in XML attributes), and applied them in an - automated fashion (using XProc pipelines). In addition to this, we discuss the - philological foundation of this method and, contextually, we address the topic of the - transformation of a single original source into different transcriptions: from a - highly diplomatic edition to an interpretative one, going through a spectrum of - intermediate levels of normalization. We also reflect on the separation between - transcription and analysis, as well as on the role of the editor when the edition is - the output of a semi-automated process.

+ Technically, we implemented a set of general rules (as XSLT + templates) plus various exceptions (as descriptive instructions in XML attributes), + and applied them in an automated fashion (using XProc pipelines). In addition to + this, we discuss the philological foundation of this method and, contextually, we + address the topic of the transformation of a single original source into different + transcriptions: from a highly diplomatic edition to an interpretative one, going + through a spectrum of intermediate levels of normalization. We also reflect on the + separation between transcription and analysis, as well as on the role of the editor + when the edition is the output of a semi-automated process.

@@ -186,8 +187,9 @@

This paper provides an overview of our approach (section 3) and shows how our approach addresses both issues. In particular we show how normalizing punctuation represents a dramatic step beyond the more classical normalization of words. The - current implementation of our approach, based on XProc and XSLT, is also presented in - section 3.

+ current implementation of our approach, based on XProc and XSLT, is also + presented in section 3.

Moving to an editorial workflow with such a level of automation requires a reevaluation of the role of the editor, from wordsmith to formalizer of rules (and exceptions). In section 4 we discuss how our approach fits the recent @@ -876,9 +878,10 @@ the edition files, despite being the main concrete output of the editorial project, are ephemeral and never modified directly. -

The implementation consists of a series of XSLT transformations, each representing - and implementing a single rule, coordinated by three different XProc pipelines, one - for each level of edition. The source code is available at The implementation consists of a series of XSLT transformations, each + representing and implementing a single rule, coordinated by three different XProc + pipelines, one for each level of edition. The source code is available at .

This methodology contrasts with the established editorial practice of mingling transcription, normalization, and critical amendments. Instead of just performing the @@ -930,34 +933,39 @@ particular punctuation rules, often modify the structure of the text. A technical system that does not allow for this interaction to happen is not able to deal in properly with normalization in general and punctuation in particular.

-

Each rule is implemented as a small and self-contained XSLT transformation. At the - time of writing, the ENHG Marco Polo project comprises about a hundred rules, - grouped in twenty macro categories. On average, the core of each rule is - implemented in less than three lines of XSLT.

+

Each rule is implemented as a small and self-contained XSLT + transformation. At the time of writing, the ENHG Marco Polo project comprises + about a hundred rules, grouped in twenty macro categories. On average, the core of + each rule is implemented in less than three lines of XSLT.

To give the readers an impression of the simplicity of the rule implementation, we - show here the main parts of the XSLT that implement one of the example rules - described above.

+ show here the main parts of the XSLT that implement one of the example + rules described above.

Example: Rule to Join Words Split at the End of a Line

In ENHG, a punctuation sign that we nowadays call a double oblique hyphen was used to mark that a word has been split at the end of a line. In the diplomatic rendition we want to preserve this word division and the forced line break, while in other renditions we want to reconstruct the complete word.

-

The XSLT excerpt in Example 3 - shows how split words are joined when a middle double oblique hyphen is found. - The joining is performed in a lossless way: all information present in the - original witness is preserved. This is possible because this step operates on a - supertextual structure that contains information about the structure of the - text and the positioning of a token (e.g., eol=true). It - must be stressed that this last piece of information, and the supertextual - structure in general, are not part of the master TEI file and have been added - in the preceding steps. It should also be noted that this rule does not delete - any text, it just marks that two words have been joined and what the result of - this operation is. Another rule will take care of modifying the structure, and - yet another will remove the now superfluous partial word in the next line - before creating the final edition file. However, while removing and carrying - out all these changes, comments about what is being done will be added as an - aid to the readers of the final edition file.

+

The XSLT excerpt in Example 3 shows how split words are joined when a middle double + oblique hyphen is found. The joining is performed in a lossless way: all + information present in the original witness is preserved. This is possible + because this step operates on a supertextual structure that contains + information about the structure of the text and the positioning of a token + (e.g., eol=true). It must be stressed that this last + piece of information, and the supertextual structure in general, are not part + of the master TEI file and have been added in the preceding steps. It should + also be noted that this rule does not delete any text, it just marks that two + words have been joined and what the result of this operation is. Another rule + will take care of modifying the structure, and yet another will remove the now + superfluous partial word in the next line before creating the final edition + file. However, while removing and carrying out all these changes, comments + about what is being done will be added as an aid to the readers of the final + edition file.

@@ -985,8 +993,9 @@ - XSLT implementation of the rule Join words split with a - double oblique hyphen. + XSLT implementation of the rule Join + words split with a double oblique hyphen.

The rule in Example 3 is independent from other rules in the pipeline. The scholar is free to use this @@ -1019,12 +1028,13 @@ encoded using TEI elements directly in the master TEI file.

One-off normalizations are marked in the master TEI file using one of the non-standard attributes defined by the ENHG ODD. For example attaching - mp:n1-subst="foo" to any element will force the substitution of the - word foo for the content of that element, but only at the normalization - level denoted by N1 (semi-interpretative). Such exceptions are marked using - w or pc elements, together with the - already cited project-specific mp:nX-subst attributes, as shown in - Example 4.

+ mp:n1-subst="foo" to any element will force the substitution + of the word foo for the content of that element, but only at the + normalization level denoted by N1 (semi-interpretative). Such exceptions + are marked using w or pc elements, + together with the already cited project-specific mp:nX-subst + attributes, as shown in Example + 4.

@@ -1181,15 +1191,19 @@ like to experiment with creating declarative rule generators. Many rules are repetitive in their nature (for example, the normalization of single characters) and it should be possible to express them in a declarative fashion. These abstract - rules would then be translated into XSLT transformations. Another aspect we would - like to reflect on is how the transformation process directed by the pipelines + rules would then be translated into XSLT transformations. Another aspect we + would like to reflect on is how the transformation process directed by the pipelines influences the various levels of abstraction of the document being transformed, drawing parallels with stratified document models such as CMV+P (Barabucci 2019). A final thing we would like to test - is the replacement of the XProc pipelines with pure XSLT pipelines (Birnbaum 2017). Replacing XProc with XSLT pipelines - would reduce the number of technologies that other scholars have to be familiar with - in order to understand the editorial process in its entirety.

+ is the replacement of the XProc pipelines with pure XSLT pipelines + (Birnbaum 2017). Replacing XProc + with XSLT pipelines would reduce the number of technologies that other + scholars have to be familiar with in order to understand the editorial process in its + entirety.

Another future development that we envision is the deconstruction of the visualization of the edition into a series of small, explicit steps, taking place one after the other, just like their counterparts in the pipelines: one click would show @@ -1274,9 +1288,11 @@ Polo. Prima edizione integrale. Firenze: Leo S. Olschki. Birnbaum, David J. 2017. - Patterns and Antipatterns in XSLT Micropipelining. In - Proceedings of Balisage: The Markup Conference 2017. - Balisage Series on Markup Technologies + Patterns and Antipatterns in <ptr type="software" + xml:id="XSLT" target="#XSLT"/><rs type="soft.name" ref="#XSLT">XSLT</rs> + Micropipelining. In Proceedings of Balisage: The + Markup Conference 2017. Balisage Series on Markup + Technologies 19. doi:10.4242/BalisageVol19.Birnbaum01. Bosco Coletsos, Sandra. 2003. diff --git a/data/JTEI/14_2021-23/jtei-bleeker-et-al-199-source.xml b/data/JTEI/14_2021-23/jtei-bleeker-et-al-199-source.xml index 916ebe8..de52cbe 100644 --- a/data/JTEI/14_2021-23/jtei-bleeker-et-al-199-source.xml +++ b/data/JTEI/14_2021-23/jtei-bleeker-et-al-199-source.xml @@ -1,7 +1,6 @@ - - + + @@ -763,8 +762,8 @@ branches is semantically related, the divergence of the text stream can be flagged with <|; the individual branches are separated with a vertical bar | and the converging of the branches is indicated - with a |>. The TAGML notation of the example above would thus be - as in .

|>. The TAGML notation of the example above would thus + be as in .
A TAGML transcription of a grouped revision. @@ -826,7 +825,7 @@ . . . marginal comments which indicate that the dates - date's + date's mentioned in the main body of the text are incorrect. Example of sic and corr @@ -867,8 +866,8 @@ - "Dear Marion: (wrote Ada.) We are all very glad..." + "Dear Marion: (wrote Ada.) We are all very glad..." @@ -1049,11 +1048,12 @@ retrieve all quotes together. The first would not pose a problem for TEI XML, but retrieving the disjointed quotations as one (merged) utterance would only be possible with additional, vocabulary-specific coding. Processing the two q elements - as a single q requires a set of XSLT instructions that check the values of - the xml:id and the next and prev attributes in order - to know which q elements should be stitched together. In TAGML, both - scenarios would be equally straightforward. The hypergraph can be queried for the - q element(s) and their textual content as a whole, or for the + as a single q requires a set of XSLT instructions that check + the values of the xml:id and the next and prev + attributes in order to know which q elements should be stitched together. In + TAGML, both scenarios would be equally straightforward. The hypergraph can be queried + for the q element(s) and their textual content as a whole, or for the q elements that have been suspended and resumed.

Processing discontinuous structures can become quite complex. Consider the following fragment ():

@@ -1091,20 +1091,22 @@ TEI transcription of
To process the text of this fragment correctly, one needs to write a rather - complicated set of XSLT instructions. At the very least, these instructions need to - match the values of the xml:id and prev in order to process the - first part of the deletion, look for the second part of the deletion, and then - concatenate their textual content. At the same time, one has to prevent the second - part from being processed twice (first as the second part of the deletion, and the - second time together with the regular del elements). After some - experimenting and consulting several XSLT specialists, we have come to no less than - three different sets of instructions.The authors are grateful to Peter Boot, - Vincent Neyt, and Frederike Neuber for sharing their expertise and invaluable - insights. And considering the ingenuity and technical expertise of the TEI - community, we are quite certain there are even more ways. In short, it can be a - challenging and time-consuming process to write and tweak vocabulary-specific and - schema-aware tools—a daunting task for any TEI XML user who lacks a certain level of - technical expertise.

+ complicated set of XSLT instructions. At the very least, these + instructions need to match the values of the xml:id and prev in + order to process the first part of the deletion, look for the second part of the + deletion, and then concatenate their textual content. At the same time, one has to + prevent the second part from being processed twice (first as the second part of the + deletion, and the second time together with the regular del elements). After + some experimenting and consulting several XSLT specialists, we have + come to no less than three different sets of instructions.The authors are + grateful to Peter Boot, Vincent Neyt, and Frederike Neuber for sharing their + expertise and invaluable insights. And considering the ingenuity and + technical expertise of the TEI community, we are quite certain there are even more + ways. In short, it can be a challenging and time-consuming process to write and tweak + vocabulary-specific and schema-aware tools—a daunting task for any TEI XML user who + lacks a certain level of technical expertise.

Conclusion diff --git a/data/JTEI/14_2021-23/jtei-burnard-shoch-odebrecht-194-source.xml b/data/JTEI/14_2021-23/jtei-burnard-shoch-odebrecht-194-source.xml index b0072f6..9b9e126 100644 --- a/data/JTEI/14_2021-23/jtei-burnard-shoch-odebrecht-194-source.xml +++ b/data/JTEI/14_2021-23/jtei-burnard-shoch-odebrecht-194-source.xml @@ -1,6 +1,6 @@ - - + + + @@ -200,12 +200,15 @@ projects, though the TEI Consortium website has for many years offered a platform for one: Projects Using the TEI, accessed May 17, 2021, . More recently, the - TEIhub project lists more than 12,500 GitHub-hosted TEI projects (last updated May - 11, 2021, ); an associated bot called - TEI Pelican provides a daily twitter feed of new GitHub repositories containing a - TEI header. We are unaware of any systematic analysis of the application types - indicated by these data sources, but a glance gives the impression that - traditional editorial and resource-building projects predominate. + TEIhub project lists more than 12,500 GitHub-hosted TEI + projects (last updated May 11, 2021, ); + an associated bot called TEI Pelican provides a daily twitter feed of new GitHub repositories containing a TEI header. We are unaware + of any systematic analysis of the application types indicated by these data + sources, but a glance gives the impression that traditional editorial and + resource-building projects predominate.

The work of the ActionFurther information about the Action is available from its website at . For information @@ -224,8 +227,10 @@ of the corpus. Working papers on each of these topics plus a fourth on theoretical issues of sampling and balance were prepared for discussion and approval by the members of WG1, and remain available from the Working Group’s website. These - and other documents are available from the Action’s GitHub page, accessed May 17, - 2021, . + and other documents are available from the Action’s GitHub page, accessed May 17, 2021, .

@@ -281,11 +286,11 @@ chapter, or letter only.An exception is made for epistolary novels which contain only the representation of a sequence of letters, with no other significant content: these may be marked as div - type="letter". For ELTeC purposes, a chapter - is considered to be the smallest subsection of a novel within which paragraphs of - text appear directly. Further subdivisions within a chapter (often indicated - conventionally by ellipses, dashes, stars, etc.) are marked using the - milestone element; larger groupings of div elements are + type="letter". For ELTeC purposes, a + chapter is considered to be the smallest subsection of a + novel within which paragraphs of text appear directly. Further subdivisions within a + chapter (often indicated conventionally by ellipses, dashes, stars, etc.) are marked + using the milestone element; larger groupings of div elements are indicated by div elements, always of type part, whatever their hierarchical level. Headings, at whatever level, are always marked using the head element when appearing at the start of a div, and the @@ -333,8 +338,8 @@ source (typically an illustration) has been left out of the encoding; the elements note and ref may be used to capture the location and content of authorially supplied footnotes or endnotes; wherever they occur in - the source, notes must be collected together in a div type="notes" - within a back element. + the source, notes must be collected together in a div + type="notes" within a back element.

This list of elements may seem distressingly small. It lacks entirely some elements which every TEI introductory course regards as indispensable (no list or @@ -657,13 +662,16 @@ resulting encoding standard. As with other ODDs, we are then able to produce documentation and formal schemas which reflect exactly the scope of each encoding level.

-

The ODD sources and their outputs are maintained on GitHub and are also published on Zenodo (Odebrecht et al. 2019) along with the - ELTeC corpora.The GitHub repository for the ELTeC collection (last updated May - 17, 2021) is found at ; the Zenodo - community within which it is being published (last updated April 11, 2021) lives - at . +

The ODD sources and their outputs are maintained on GitHub + and are also published on Zenodo + (Odebrecht et al. 2019) along with + the ELTeC corpora.The GitHub repository for the ELTeC + collection (last updated May 17, 2021) is found at ; the Zenodo community within which it + is being published (last updated April 11, 2021) lives at .

@@ -681,7 +689,8 @@ development and are expected to become available during the coming year. As noted above, up-to-date information about the current state of all corpora is publicly visible at , which includes - links to the individual GitHub repositories for each corpus.

+ links to the individual GitHub repositories for each corpus.

As well as continuing to expand the collection, and continuing to fine-tune its composition, we hope to improve the consistency and reliability of the metadata associated with each text, as far as possible automatically. For example, we have @@ -715,7 +724,8 @@ History. Ann Arbor, MI: University of Michigan Press. Burnard, Lou. 2016. ODD Chaining for Beginners. TEI Council Technical Working - Paper. TEI GitHub IO Repository. Available at GitHub IO Repository. Available at . Burnard, Lou. 2019. What Is TEI Conformance, and Why Should You Care? diff --git a/data/JTEI/14_2021-23/jtei-cc-pn-erjavec-195-source.xml b/data/JTEI/14_2021-23/jtei-cc-pn-erjavec-195-source.xml index d61eaf0..0152101 100644 --- a/data/JTEI/14_2021-23/jtei-cc-pn-erjavec-195-source.xml +++ b/data/JTEI/14_2021-23/jtei-cc-pn-erjavec-195-source.xml @@ -1,6 +1,6 @@ - - - + + + @@ -324,15 +324,18 @@

Presentation of Parla-CLARIN

Like the TEI Guidelines, the Parla-CLARIN recommendations are available on GitHub, as a - projectTomaž Erjavec and Andrej Pančur, Parla-CLARIN project GitHub site, last - updated March 17, 2021, . of the CLARIN ERIC collection. The project contains a folder for the schema - (i.e., the Parla-CLARIN ODD document and XML schemas derived from it), a folder for the - programs that convert the ODD into the XML schemas and to the HTML of the prose and - schema definitions, and a folder for examples, which contains an artificial but fully - worked out example of a Parla-CLARIN document and subfolders with various example - resources, where each should contain: + target="https://github.com/clarin-eric/parla-clarin/">GitHub, as a projectTomaž Erjavec and Andrej Pančur, Parla-CLARIN + project GitHub site, last updated March 17, 2021, . of the CLARIN ERIC + collection. The project contains a folder for the schema (i.e., the Parla-CLARIN ODD + document and XML schemas derived from it), a folder for the programs that convert the + ODD into the XML schemas and to the HTML of the prose and schema definitions, and a + folder for examples, which contains an artificial but fully worked out example of a + Parla-CLARIN document and subfolders with various example resources, where each should + contain: a sample of a corpus in its source encoding; XSLT script to convert it into Parla-CLARIN; and the output of the conversion. @@ -570,8 +573,10 @@ straightforward customization of the TEI Guidelines, with the majority of the effort having gone into the writing of the prose guidelines of the Parla-CLARIN recommendations and into developing the conversion from Akoma Ntoso to Parla-CLARIN. We have not included - examples of the encoding, as these are readily available on the GitHub documentation page - of the project, and large Parla-CLARIN encoded corpora are openly available.

+ examples of the encoding, as these are readily available on the GitHub + documentation page of the project, and large Parla-CLARIN encoded corpora are openly + available.

Apart from the siParl 2.0 corpus mentioned above (), the recommendations have already been used to encode the Czech (Hladka, Kopp, and Straňák @@ -613,10 +618,11 @@ recommendations. Furthermore, we plan to change the examples given in the schema specification from the default ones in the TEI Guidelines to ones taken or adapted from the collected parliamentary corpora.

-

Second, as we have already done for ParlaMint, we plan to add to the GitHub Parla-CLARIN - project more down-conversion scripts with which we would increase the usability of the - Parla-CLARIN corpora. As mentioned, work also needs to be done to develop a conversion to - RDF.

+

Second, as we have already done for ParlaMint, we plan to add to the GitHub + Parla-CLARIN project more down-conversion scripts with which we would increase the + usability of the Parla-CLARIN corpora. As mentioned, work also needs to be done to develop + a conversion to RDF.

Last, but not least, one of the great benefits of Git is the ability to support collaborative work, be it through posting issues, or through using pull requests to incorporate changes. While the community has not so far made use of these options, we hope diff --git a/data/JTEI/14_2021-23/jtei-cc-pn-holmes-193-source.xml b/data/JTEI/14_2021-23/jtei-cc-pn-holmes-193-source.xml index 331ed88..ea72479 100644 --- a/data/JTEI/14_2021-23/jtei-cc-pn-holmes-193-source.xml +++ b/data/JTEI/14_2021-23/jtei-cc-pn-holmes-193-source.xml @@ -1,8 +1,6 @@ - - - + + @@ -79,14 +77,16 @@ both an RDB and an XML document collection. Programmers must then integrate these distinct forms of data when building project outputs. This article discusses the Digital Victorian Periodical Poetry (DVPP) project, where metadata on about 15,000 poems from - nineteenth-century periodicals is captured in a MySQL database, and periodically exported - to create a TEI file for each poem. Many of the poems are then transcribed and encoded. - The canonical source of metadata is the RDB, while the canonical source of textual data is - the TEI file. Metadata in the TEI files must be periodically updated from the RDB, without - disturbing the textual encoding. Changes to the RDB data may result in changes to the id - and filename of the related TEI file, so any existing TEI data is migrated to a new file, - and the Subversion repository must be appropriately updated. All of this is done with XSLT - and Ant.

+ nineteenth-century periodicals is captured in a MySQL database, and periodically + exported to create a TEI file for each poem. Many of the poems are then transcribed and + encoded. The canonical source of metadata is the RDB, while the canonical source of + textual data is the TEI file. Metadata in the TEI files must be periodically updated from + the RDB, without disturbing the textual encoding. Changes to the RDB data may result in + changes to the id and filename of the related TEI file, so any existing TEI data is + migrated to a new file, and the Subversion repository must be appropriately updated. All + of this is done with XSLT and Ant.

The project described in this paper is supported by a which is still a hierarchy, but a hierarchy in which participant segments have been rearranged, possibly drastically; in other words, it allows for multiple hierarchies over the same - dataset. However, beginning with the work of E. F. Codd in the 1970s and the rise of SQL, + dataset. However, beginning with the work of E. F. Codd in the 1970s and the rise of SQL, the relational database model familiar today became dominant, and remained so until the relatively recent popularity of NoSQL approaches.

In modeling humanities datasets, both relational databases and XML have notable @@ -121,7 +122,8 @@ 2018). While RDBMSes have traditionally been considered to have advantages in terms of enforceable constraints on linking and data integrity as well as speed, members of the TEI and related communities favor XML, pointing to its flexibility and - extensibility. In recent years, the speed and power of XSLT and XQuery tools, the + extensibility. In recent years, the speed and power of XSLT and XQuery tools, the development of a rich array of schema and validation tools, and the appearance of XML databases have all but eradicated the traditional advantages claimed for relational databases.

@@ -159,7 +161,8 @@ poem, line, and stanza in the middle of its referential domain of study (para 8). Gibson (2012) describes a similar scenario with mixed RDB and - XML data, and how he used Saxonʼs SQL extension functions to overcome the problem.

+ XML data, and how he used Saxonʼs SQL extension functions to overcome the problem.

However, storing XML data in RDB fields is suboptimal. Most serious encoding projects make use of version-control systems such as Git or Subversion, for very good reasons: in a project with many transcribers and encoders, where multiple waves of encoding and @@ -183,26 +186,30 @@ project. This project began life many years ago as a pure-metadata project, capturing information about tens of thousands of poems that appeared in British periodicals during the nineteenth century. At that time, an RDB system seemed a natural and sufficient tool - for the job, so a MySQL database, along with a data-entry interface, was set up for the - researchers, and data collection proceeded rapidly (). However, after some years the project gained an additional - research focus, and more recently funding from the Social Sciences and Research Council of - Canada, to transcribe and encode a subset of these poems; we are focusing primarily on the - decade years (1820, 1830, 1840, and so on through to 1900), and at the time of writing we - have encoded more than 2,000 poems. Meanwhile, indexing of the much larger dataset - continues.

+ for the job, so a MySQL database, along with a data-entry interface, + was set up for the researchers, and data collection proceeded rapidly (). However, after some years the project gained an + additional research focus, and more recently funding from the Social Sciences and Research + Council of Canada, to transcribe and encode a subset of these poems; we are focusing + primarily on the decade years (1820, 1830, 1840, and so on through to 1900), and at the + time of writing we have encoded more than 2,000 poems. Meanwhile, indexing of the much + larger dataset continues.

A record in the relational database.
-

The MySQL database is relatively straightforward. The main table is the Poems table, in - which each record corresponds to a specific poem appearing in a given periodical on a - given date. Another table, Organs, contains the list of periodicals, and each poem record - points to a single organ record. A third table, People, contains information about - individuals who have played roles in the production of poems, as authors, translators, or - illustrators; people are linked in one-to-many relationships through role tables, so that - one poem may have multiple illustrators, and the author of one poem may be the translator - of another. The database front end is written in PHP.

+

The MySQL database is relatively straightforward. The main table is the + Poems table, in which each record corresponds to a specific poem appearing in a given + periodical on a given date. Another table, Organs, contains the list of periodicals, and + each poem record points to a single organ record. A third table, People, contains + information about individuals who have played roles in the production of poems, as + authors, translators, or illustrators; people are linked in one-to-many relationships + through role tables, so that one poem may have multiple illustrators, and the author of + one poem may be the translator of another. The database front end is written in PHP.

Our long-term plan is for the entire dataset to be in the form of TEI XML files, but for the first few years of the project, data will continue to be added to the RDB system, since we have good methods and protocols for this, as well as trained research assistants @@ -237,7 +244,8 @@ target="https://hcmc.uvic.ca/svn/dvpp/buildTEI.xml">Subversion repository, and details of the process can be found our project - documentation. and XSLT (). + documentation. and XSLT ().

A simple representation of the metadata integration process. @@ -246,13 +254,14 @@

In the initial part of the process, the current state of the database is dumped into an XML file (the application mysqldump can provide data in XML format). This file is stored in the Subversion repository, giving us at least a semblance of version control over the - SQL data, albeit in a rather impoverished fashion. Each poem record in the database is - matched against an equivalent XML file if there is one. If there is no matching file, then - one is created. If there is a matching file and no changes are required to its filename - and id, then its metadata is simply updated based on the poem record from the database. If - there is a matching file, but modifications to the filename and id are needed, then a new - file is created and all relevant TEI data is migrated into that file. Then, + SQL data, albeit in a rather impoverished fashion. Each poem record in the + database is matched against an equivalent XML file if there is one. If there is no + matching file, then one is created. If there is a matching file and no changes are + required to its filename and id, then its metadata is simply updated based on the poem + record from the database. If there is a matching file, but modifications to the filename + and id are needed, then a new file is created and all relevant TEI data is migrated into + that file. Then, If a file is new (i.e., not already tracked by Subversion), it must be added to the repository. If a file has not changed during this operation, that means it is obsolete and @@ -343,7 +352,10 @@ Conference, Graz, Austria, 19 September 2019. . Gibson, Matthew. 2012. Using XSLT’s SQL Extension with Encyclopedia Virginia. + level="a">Using XSLT’s SQL Extension with Encyclopedia + Virginia. Code{4}lib Journal 16. . diff --git a/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml b/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml index 84d088a..4bcce84 100644 --- a/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml +++ b/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml @@ -1,7 +1,6 @@ - - + + @@ -438,13 +437,13 @@

As the IIP documents are stored in an institutional repository, each file also has a persistent identifier (PID) provided by the repository in the form of a URI. The PID is not currently referenced in the IIP header metadata, but like the project ID, it can be - encoded with idno type="BDR". Some projects have developed workflows that - incorporate a repository-generated DOI back into the file, as the file is being deposited - into a repository like Zenodo (Prag 2020; Wagner, n.d.). There may be other types of files - beyond the epigraphic documents that should have unique identifiers: for example, local - authority lists if they are external to the documents, or XSL scripts that can be used to - process the files. + encoded with idno type="BDR". Some projects have developed workflows + that incorporate a repository-generated DOI back into the file, as the file is being + deposited into a repository like Zenodo (Prag + 2020; Wagner, n.d.). There may be other + types of files beyond the epigraphic documents that should have unique identifiers: for + example, local authority lists if they are external to the documents, or XSL scripts that + can be used to process the files. FM-F1B: Identifier persistence []

@@ -619,10 +618,12 @@ ontologies such as CIDOC-CRM, Nomisma, and CRMtexCIDOC (International - Committee for Documentation) Conceptual Reference Model, accessed July 4, 2022, ; Nomisma (knowledge organization system for - numismatics), accessed July 4, 2022, ; CRMtex model - for the study of ancient texts (an extension of CIDOC-CRM), accessed July 4, 2022, Reference Model, + accessed July 4, 2022, ; Nomisma (knowledge + organization system for numismatics), accessed July 4, 2022, ; CRMtex model for the study of ancient texts (an + extension of CIDOC-CRM), accessed July 4, 2022, . to describe inscribed objects (Bodard et al. 2021). Reuse @@ -665,8 +666,9 @@ governing body which might provide the certification, and in fact the Guidelines are, as indicated in their name, not a standard. However, if the TEI consortium and its members recommend how to fulfill the FAIR principles by using the teiHeader as discussed - for each metric above, and provide XSLT and Schematron files for validation, the output of - that file could indicate compliance. As always, it is the responsibility of the encoder + for each metric above, and provide XSLT and Schematron files for validation, the output + of that file could indicate compliance. As always, it is the responsibility of the encoder and project to make sure that the metadata are accurate and detailed.

@@ -678,12 +680,13 @@ the TEI header can provide much of the information required by the FAIR principles in a predictable and machine-readable form. Specifically, the TEI Guidelines and schema indicate where and how to encode licensing information, metadata formats, documentation, - and identifiers and their presence can be verified using XSLT, XPath, and Schematron. - Overall, the affordances of the TEI, best practices of the EpiDoc community, and IIP - archival format decisions have resulted in a set of documents that measure up to the - requirements of FAIR metrics. Furthermore, the best practices adopted by the IIP project - over the course of its development incorporated many FAIR behaviors with little extra - effort.

+ and identifiers and their presence can be verified using XSLT, XPath, and + Schematron. Overall, the affordances of the TEI, best practices of the EpiDoc community, + and IIP archival format decisions have resulted in a set of documents that measure up to + the requirements of FAIR metrics. Furthermore, the best practices adopted by the IIP + project over the course of its development incorporated many FAIR behaviors with little + extra effort.

However, some issues that are unique to TEI encoding, as well as future directions of FAIR, indicate areas where the TEI community could do more. Generally, these are formal components that have to do primarily with persistence of metadata external to the IIP @@ -769,7 +772,7 @@ FM-F1A: Identifier uniqueness Check for DOI: - /TEI/teiHeader/fileDesc/publicationStmt/idno[@type="DOI"] + /TEI/teiHeader/fileDesc/publicationStmt/idno[@type="DOI"] FM-F1B: Identifier persistence @@ -786,7 +789,7 @@ FM-F3: Metadata clearly and explicitly include the identifier of the data they describe Test for existence of DOI (or other identifiers): - /TEI/teiHeader/fileDesc/publicationStmt/idno[@type="DOI"] + /TEI/teiHeader/fileDesc/publicationStmt/idno[@type="DOI"] FM-F4: (Meta)data are registered or indexed in a searchable resource @@ -829,7 +832,7 @@ FM-R1.2: (Meta)data are associated with detailed provenance /TEI/teiHeader/fileDesc/titleStmt xml version="1.0" encoding="UTF-8" - xmlns="http://www.tei-c.org/ns/1.0" + xmlns="http://www.tei-c.org/ns/1.0" /TEI/teiHeader/encodingDesc/schemaRef diff --git a/data/JTEI/16_2023_spa/jtei-calarco-232-source.xml b/data/JTEI/16_2023_spa/jtei-calarco-232-source.xml index 303b22e..e98c6b9 100644 --- a/data/JTEI/16_2023_spa/jtei-calarco-232-source.xml +++ b/data/JTEI/16_2023_spa/jtei-calarco-232-source.xml @@ -636,23 +636,26 @@ documento TEI codificado mediante esta propuesta:

- En pocas - de + En pocas de palabras vos quiero destajar - la + la obra de las armas qu’Aquiles mandó far, que, - si por orden todo lo quesiés’én’ Aquiles mandó far, + que, si por orden todo lo quesiés’én’ notar, serié un notar, + serié un brevïario que prendrié grant logar. diff --git a/data/JTEI/16_2023_spa/jtei-rioriande-torresallen-250-source.xml b/data/JTEI/16_2023_spa/jtei-rioriande-torresallen-250-source.xml index 0485f45..a4206a0 100644 --- a/data/JTEI/16_2023_spa/jtei-rioriande-torresallen-250-source.xml +++ b/data/JTEI/16_2023_spa/jtei-rioriande-torresallen-250-source.xml @@ -1,5 +1,6 @@ - + + + @@ -455,14 +456,17 @@

A continuación, preguntamos a los participantes por las tecnologías utilizadas en sus proyectos, proponiéndoles respuestas múltiples que incluían: 1. Personalizaciones de los módulos TEI, 2. Utilización de esquemas (ODD, RelaxNG, - etc.), 3. Bases de datos XML, 4. Bases de datos relacionales (MySQL, PostgreSQL, - etc.), 5. Transformaciones XSLT, 6. Transformaciones XQuery, 7. Vocabularios - controlados o tesauros, 8. Tecnologías sobre Sistemas de Información Geográfica - (SIG), 9. Tecnologías de la información y la comunicación, 9. Tecnologías sobre - Procesamiento del Lenguaje Natural (PLN), 10. Tecnologías sobre web semántica, 11. - Visualización de datos, 12. Otros. El objetivo detrás de algunas de las opciones - que dimos era confirmar si la comunidad de uso de la TEI en español era de hecho - consciente de las mejores prácticas ya en uso a nivel internacional.

+ etc.), 3. Bases de datos XML, 4. Bases de datos relacionales (MySQL, + PostgreSQL, etc.), 5. Transformaciones XSLT, 6. Transformaciones + XQuery, 7. Vocabularios controlados o tesauros, 8. Tecnologías sobre Sistemas de + Información Geográfica (SIG), 9. Tecnologías de la información y la comunicación, + 9. Tecnologías sobre Procesamiento del Lenguaje Natural (PLN), 10. Tecnologías + sobre web semántica, 11. Visualización de datos, 12. Otros. El objetivo detrás de + algunas de las opciones que dimos era confirmar si la comunidad de uso de la TEI + en español era de hecho consciente de las mejores prácticas ya en uso a nivel + internacional.

Las respuestas revelaron que la mayoría de los proyectos recurren a personalizaciones del esquema TEI (40) y utilizan esquemas y ODD (39), que son algunos de los requisitos para un flujo de trabajo de marcado documentado, @@ -475,22 +479,28 @@

En lo que respecta a la transformación y renderizado de XML, el lenguaje más - utilizado parece seguir siendo el ya veterano XSLT (29),XSLT, fecha de - consulta 16 de julio de 2023, . mientras que las transformaciones XQuery (11)XQuery, fecha de - consulta 16 de julio de 2023, . se utilizan menos. Esto no parece coincidir del todo con la pregunta - anterior sobre el uso de bases de datos XML, ya que la mayoría de las bases de - datos XML nativas utilizan XQuery como principal herramienta de recuperación de - datos, en lugar de XSLT. Entre los participantes existe una curiosa mezcla entre - las formas antiguas y las nuevas.

+ utilizado parece seguir siendo el ya veterano XSLT (29),XSLT, fecha de consulta 16 de julio de 2023, . mientras que las + transformaciones XQuery (11)XQuery, fecha de consulta 16 de julio de 2023, + . se utilizan menos. + Esto no parece coincidir del todo con la pregunta anterior sobre el uso de bases + de datos XML, ya que la mayoría de las bases de datos XML nativas utilizan XQuery + como principal herramienta de recuperación de datos, en lugar de XSLT. Entre los participantes existe una curiosa mezcla entre las formas + antiguas y las nuevas.

Otras prácticas que los participantes adoptan cuando trabajan con TEI son la visualización de datos (31), y los vocabularios controlados o tesauros (22), así como las tecnologías relacionadas con la web semántica (16), los sistemas SIG (11), y el Procesamiento del Lenguaje Natural (12%). En Otros, algunos participantes explicaron que utilizaban scripts de interoperabilidad entre - distintos esquemas (DCAT, DDI-CDI), anotación lingüística de corpus, XSLT - LaTex - - PDF y Cocoon. Lamentablemente, el 9% eligió otros sin especificar más.

+ distintos esquemas (DCAT, DDI-CDI), anotación lingüística de corpus, XSLT - LaTex - PDF y Cocoon. Lamentablemente, el 9% eligió otros sin + especificar más.

Tecnologías utilizadas en proyectos de edición digital con @@ -514,9 +524,13 @@ los participantes para publicar sus archivos TEI. El objetivo era controlar si había alguna plataforma que se destacara por su uso. Por ello, propusimos las siguientes: 1. Infraestructura web creada ad hoc (por ejemplo, XML, - XSLT, PHP, Python, etc.), 2. Generadores web estáticos (Jekyll, Gatsby, etc.), 3. - Boilerplate, 4. eXist; 5. TEI Publisher, 6. CETEIcean, 7. Edition Visualization - Technology, y añadimos una opción 8. Otros.

+ XSLT, PHP, Python, etc.), 2. + Generadores web estáticos (Jekyll, Gatsby, etc.), 3. Boilerplate, 4. eXist; 5. TEI + Publisher, 6. CETEIcean, 7. Edition Visualization Technology, y añadimos una + opción 8. Otros.

No es sorprendente que la puntuación más alta correspondiera a las infraestructuras creadas ad hoc (43). Así pues, la gran mayoría siente la necesidad de construir su propia infraestructura en términos de @@ -540,9 +554,13 @@ probablemente a dos hechos: primero, el auge de la computación mínima que permite la creación de infraestructuras sin necesidad de servidores comerciales o institucionales (por ejemplo, los sitios estáticos pueden vivir en servicios - gratuitos como GitHub y GitHub Pages)GitHub Pages, fecha de consulta 16 de - julio de 2023, . y segundo, la - falta de acceso a infraestructuras digitales especialmente en América + gratuitos como GitHub y GitHub Pages)GitHub Pages, fecha de consulta 16 + de julio de 2023, . y segundo, + la falta de acceso a infraestructuras digitales especialmente en América Latina.Para más información sobre el movimiento de la computación mínima, véase la página del grupo Minimal Computing, fecha de consulta 16 de julio de 2023, . Además, @@ -667,10 +685,11 @@ mejorar la enseñanza y el aprendizaje de la TEI? Algunos de los participantes insistieron en la necesidad de recursos y materiales de formación a todos los niveles y en español, incluyendo además temas específicos (por ejemplo, - transformaciones con XSLT), así como otros tipos de recursos, como referencias - bibliográficas. La necesidad de formación formal e informal dentro y fuera del - mundo académico surge también como otra preocupación legítima. Entre las demandas - de formación aflora el desconocimiento de las etapas finales del proceso + transformaciones con XSLT), así como otros tipos de recursos, como + referencias bibliográficas. La necesidad de formación formal e informal dentro y + fuera del mundo académico surge también como otra preocupación legítima. Entre las + demandas de formación aflora el desconocimiento de las etapas finales del proceso editorial, es decir, la transformación de los archivos codificados en XML-TEI y su publicación en línea. Esto parece indicar que la comunidad está deseosa de avanzar en sus conocimientos técnicos. El deseo de contar con editores XML libres y @@ -700,25 +719,26 @@ instrucciones generales para marcar textos de modo que puedan ser procesados por ordenador. El consorcio de la TEI está insuflado de un espíritu de apertura y colaboración que hace que los propios usuarios puedan proponer mejoras y - modificaciones a través de su repositorio en GitHub. Este espíritu de colaboración y - difusión se manifiesta también a través de un diálogo continuo mediante una lista de - discusión en línea y la organización anual de un congreso internacional. Por último, - el consorcio es un gran paraguas que da lugar a diferentes grupos de trabajo y de - interés para determinados temas, como la creación de diccionarios, el trabajo con - lingüística de corpus, o la descripción de manuscritos, entre otros. Sin embargo, lo - anterior se aplica principalmente al mundo anglófono, donde, como decíamos al - principio, nació la TEI, pero esto no quiere decir que no contemos con proyectos e - iniciativas en países hispanohablantes. Con este estudio hemos tratado de profundizar - en una comunidad particular de practicantes, no sólo definida por la lengua de los - textos con los que trabaja, el español, sino también por sus peculiaridades y - necesidades culturales y sociales específicas. En este intento internacional por - parte de la comunidad de la TEI de respetar la diversidad e incluir otros idiomas - además del inglés, está más claro que nunca que tanto el conocimiento lingüístico - como el cultural son necesarios si se quiere entender a las personas y a sus - comunidades (Modern Language Association - 2007). La recopilación de los datos más destacados de nuestra encuesta ha - buscado ofrecer un sucinto estado de la cuestión sobre el uso de la TEI en la - actualidad para textos en español. Las respuestas de los participantes nos han + modificaciones a través de su repositorio en GitHub. Este espíritu de + colaboración y difusión se manifiesta también a través de un diálogo continuo + mediante una lista de discusión en línea y la organización anual de un congreso + internacional. Por último, el consorcio es un gran paraguas que da lugar a diferentes + grupos de trabajo y de interés para determinados temas, como la creación de + diccionarios, el trabajo con lingüística de corpus, o la descripción de manuscritos, + entre otros. Sin embargo, lo anterior se aplica principalmente al mundo anglófono, + donde, como decíamos al principio, nació la TEI, pero esto no quiere decir que no + contemos con proyectos e iniciativas en países hispanohablantes. Con este estudio + hemos tratado de profundizar en una comunidad particular de practicantes, no sólo + definida por la lengua de los textos con los que trabaja, el español, sino también + por sus peculiaridades y necesidades culturales y sociales específicas. En este + intento internacional por parte de la comunidad de la TEI de respetar la diversidad e + incluir otros idiomas además del inglés, está más claro que nunca que tanto el + conocimiento lingüístico como el cultural son necesarios si se quiere entender a las + personas y a sus comunidades (Modern Language + Association 2007). La recopilación de los datos más destacados de nuestra + encuesta ha buscado ofrecer un sucinto estado de la cuestión sobre el uso de la TEI + en la actualidad para textos en español. Las respuestas de los participantes nos han ayudado a constatar algunas de nuestras percepciones. Desde el punto de vista geográfico, ahora tenemos claro que la mayoría de los estudiosos y profesionales de la TEI tienen su sede en España, seguidos de México, Estados Unidos, Colombia y @@ -760,14 +780,16 @@ programación y la alfabetización digital en general. Además, diríamos que han surgido de forma audible otras dos preocupaciones, en primer lugar, la necesidad de materiales de formación para todo el proceso de labor editorial, con especial - insistencia en la transformación del XML-TEI (XSLT, XQuery) y publicación de los - archivos TEI. Esto significa que la comunidad ya ha superado las fases iniciales de - familiarización con la TEI, y que ahora urgen temas más avanzados. En segundo lugar, - han aparecido varias voces que defienden la necesidad de contar con editores XML - gratuitos. El hecho de que el software más popular sea propietario desanima a algunos - usuarios. Ni que decir tiene que ya se han dado algunos pasos para adaptar software - de código abierto para trabajar con archivos TEI, como Visual Studio Code, que se ha - beneficiado del desarrollo del plugin Scholarly + insistencia en la transformación del XML-TEI (XSLT, XQuery) y publicación + de los archivos TEI. Esto significa que la comunidad ya ha superado las fases + iniciales de familiarización con la TEI, y que ahora urgen temas más avanzados. En + segundo lugar, han aparecido varias voces que defienden la necesidad de contar con + editores XML gratuitos. El hecho de que el software más popular sea propietario + desanima a algunos usuarios. Ni que decir tiene que ya se han dado algunos pasos para + adaptar software de código abierto para trabajar con archivos TEI, como Visual Studio + Code, que se ha beneficiado del desarrollo del plugin Scholarly XML, que permite una codificación básica pero rigurosa en XML-TEI.Este trabajo se debe a Raffaele Viglianti (Maryland Institute for Technology in the Humanities) y el plugin puede descargarse en línea en: fecha de diff --git a/data/JTEI/7_2014/jtei-7-dee-source.xml b/data/JTEI/7_2014/jtei-7-dee-source.xml index 68d6e30..8792c1f 100644 --- a/data/JTEI/7_2014/jtei-7-dee-source.xml +++ b/data/JTEI/7_2014/jtei-7-dee-source.xml @@ -1,4 +1,6 @@ + + @@ -731,20 +733,23 @@

Integrated Resources

While initiatives such as TAPAS, TEICHI, and CWRC-Writer

Welcome to CWRC Writer, - CWRC-Writer Help, accessed September 7, 2013, .

have begun to - address to different aspects of these needs (Flanders and Hamlin 2013; Pape, Schöch, and - Wegner 2013; Crane 2010), there has yet - to be a deeply comprehensive resource intimately linked to the TEI Guidelines - themselves. New technical infrastructure should support workflows that allow users to - enter the genre with which they are working in a search engine connected to the TEI - Guidelines (e.g., poetry), find a list of relevant tags with explanations - of their functions, and from those tags find projects and files that make use of those - tags; for example, a search that retrieves all TEI-conformant files using an l - tag, and allows the user to search the projects that created these files.

+ target="https://sites.google.com/site/cwrcwriterhelp/">CWRC-Writer

Welcome to CWRC Writer, + CWRC-Writer Help, accessed September 7, + 2013, .

have + begun to address to different aspects of these needs (Flanders and Hamlin 2013; Pape, Schöch, and Wegner 2013; Crane + 2010), there has yet to be a deeply comprehensive resource intimately linked to + the TEI Guidelines themselves. New technical infrastructure should support workflows + that allow users to enter the genre with which they are working in a search engine + connected to the TEI Guidelines (e.g., poetry), find a list of relevant + tags with explanations of their functions, and from those tags find projects and files + that make use of those tags; for example, a search that retrieves all TEI-conformant + files using an l tag, and allows the user to search the projects that created + these files.

This vision may be a long way off, and should certainly be modified by community expertise, changing needs, and computational realities. However, this is the kind of organized, integrated, and open plan that the TEI community, both present and potential, diff --git a/data/JTEI/8_2014-15/jtei-8-berti-source.xml b/data/JTEI/8_2014-15/jtei-8-berti-source.xml index 489f877..0269c87 100644 --- a/data/JTEI/8_2014-15/jtei-8-berti-source.xml +++ b/data/JTEI/8_2014-15/jtei-8-berti-source.xml @@ -1,5 +1,6 @@ - - + + + + years + old we were at our house near + + +

This solution is explained in the project’s documentation, and the convention used would diff --git a/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml b/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml index 008005d..0cefc12 100644 --- a/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml +++ b/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml @@ -1,4 +1,6 @@ + + @@ -155,14 +157,15 @@ level="m">Electronic Beowulf, .

in favor of a web-based publication. While this decision was critical in that it allowed us to select the most supported and widely-used medium, we soon - discovered that it did not make choices any simpler. On the one hand, the XSLT - stylesheets provided by TEI are great for HTML rendering, but do not include support for - image-related features (such as the text-image linking available thanks to the P5 - version of the TEI schema) and tools (including zoom in/out, magnifying lens, and hot - spots) that represent a significant part of a digital facsimile and/or diplomatic - edition; other features, such as an XML search engine, would have to be integrated - separately, in any case. On the other hand, there are powerful frameworks based on - CMS

The Omeka framework () supports + discovered that it did not make choices any simpler. On the one hand, the XSLT stylesheets provided by TEI are great for HTML rendering, but do not + include support for image-related features (such as the text-image linking available + thanks to the P5 version of the TEI schema) and tools (including zoom in/out, magnifying + lens, and hot spots) that represent a significant part of a digital facsimile and/or + diplomatic edition; other features, such as an XML search engine, would have to be + integrated separately, in any case. On the other hand, there are powerful frameworks + based on CMS

The Omeka framework () supports publishing TEI documents; see also Drupal (‎) and TEICHI ().

and other web technologies

Such as the eXist XML database, TEI Boilerplate,

TEI Boilerplate, .

John A. - Walsh’s collection of XSLT - stylesheets,

tei2html, XSLT stylesheets,

tei2html, .

and Solenne Coutagne’s work for the Berliner Intellektuelle 1800–1830 @@ -238,19 +242,21 @@ intellektuellen Berlin um 1800, .

Through this approach, we achieved two important results: first, usage of EVT is quite - simple—the user applies an XSLT stylesheet to their already marked-up file(s), and when - the processing is finished they are presented with a web-ready edition; second, the web - edition that is produced is based on a client-only architecture and does not require any - additional kind of server software, which means that it can be simply copied on a web - server to be used at once, or even on a cloud storage service (provided that it is - accessible by the general public).

+ simple—the user applies an XSLT stylesheet to their already marked-up file(s), + and when the processing is finished they are presented with a web-ready edition; second, + the web edition that is produced is based on a client-only architecture and does not + require any additional kind of server software, which means that it can be simply copied + on a web server to be used at once, or even on a cloud storage service (provided that it + is accessible by the general public).

To ensure that it will be working on all the most recent web browsers, and for as long as possible on the World Wide Web itself, EVT is built on open and standard web - technologies such as HTML, CSS, and JavaScript. Specific features, such as the - magnifying lens, are entrusted to jQuery plug-ins, again chosen from the best-supported - open-source ones to reduce the risk of future incompatibilities. The general - architecture of the software, in any case, is modular, so that any component which may - cause trouble or turn out to be not completely up to the task can be replaced + technologies such as HTML, CSS, and JavaScript. Specific + features, such as the magnifying lens, are entrusted to jQuery plug-ins, again chosen + from the best-supported open-source ones to reduce the risk of future incompatibilities. + The general architecture of the software, in any case, is modular, so that any component + which may cause trouble or turn out to be not completely up to the task can be replaced easily.

@@ -258,12 +264,14 @@

Our ideal goal was to have a simple, very user-friendly drop-in tool, requiring little work and/or knowledge of anything beyond XML from the editor. To reach this goal, EVT is based on a modular structure where a single stylesheet (evt_builder.xsl) - starts a chain of XSLT 2.0 transformations calling in turn all the other modules. The - latter belong to two general categories: those devoted to building the HTML site, and - the XML processing ones, which extract the edition text lying between folios using the - pb element and format it according to the edition level. All XSLT modules - live inside the builder_pack folder, in order to have a clean and - well-organized directory hierarchy.

+ starts a chain of XSLT 2.0 transformations calling in turn all the + other modules. The latter belong to two general categories: those devoted to building + the HTML site, and the XML processing ones, which extract the edition text lying between + folios using the pb element and format it according to the edition level. All + XSLT modules live inside the builder_pack folder, in order to + have a clean and well-organized directory hierarchy.
The EVT builder_pack directory structure.
@@ -278,29 +286,36 @@ evt_builder-conf.xsl, to specify for example the number of edition levels or presence of images; you can then apply the evt_builder.xsl stylesheet to your TEI XML - document using the Oxygen XML editor or another XSLT 2–compliant engine. + document using the Oxygen XML editor or another XSLT 2–compliant + engine.
The EVT data directory structure.

-

When the XSLT processing is finished, the starting point for the edition is the - index.html file in the root directory, and all the HTML pages resulting - from the transformations will be stored in the output_data folder. You - can delete everything in this latter folder (and the index.html file), - modify the configuration options, and start again, and everything will be re-created in - the assigned places.

+

When the XSLT processing is finished, the starting point for the edition is + the index.html file in the root directory, and all the HTML pages + resulting from the transformations will be stored in the output_data + folder. You can delete everything in this latter folder (and the + index.html file), modify the configuration options, and start again, + and everything will be re-created in the assigned places.

- The XSLT stylesheets + The XSLT stylesheets

The transformation chain has two main purposes: generate the HTML files containing the edition and create the home page which will dynamically recall the other HTML files.

-

The EVT builder’s transformation system is composed of a modular collection of XSLT 2.0 - stylesheets: these modules are designed to permit scholars to freely add their own - stylesheets and to manage the different desired levels of the edition without - influencing other parts of the system, for instance the generation of the home page.

-

The transformation is performed applying a specific XSLT stylesheet +

The EVT builder’s transformation system is composed of a modular collection of XSLT 2.0 stylesheets: these modules are designed to permit scholars to freely + add their own stylesheets and to manage the different desired levels of the edition + without influencing other parts of the system, for instance the generation of the home + page.

+

The transformation is performed applying a specific XSLT stylesheet (evt_builder.xsl) which includes links to all the other stylesheets that are part of the transformation chain and that will be applied to the TEI XML document containing the transcription.

@@ -327,10 +342,10 @@ edition element in the evt_builder-conf.xsl file, and they can also personalize the edition’s name by changing the content of the edition element. For example, if they wish to generate a critical level, they are required to - add <edition>Critical</edition> to the edition_array - variable.

The available edition levels are described in the software - documentation. Adding another edition level requires providing the corresponding - stylesheet.

+ add <edition>Critical</edition> to the + edition_array variable.

The available edition levels are + described in the software documentation. Adding another edition level requires + providing the corresponding stylesheet.

Once the XML file is ready and the parameters are set, the EVT builder’s transformation system uses a collection of stylesheets to divide the XML file containing the text of the transcription into smaller portions, each one corresponding to the content of a @@ -350,10 +365,12 @@ transformations or send different parts of a document to different parts of the transformation chain. This permits the extraction of different texts for different edition levels (diplomatic, diplomatic-interpretative) processing the same XML file, and - to save them in the HTML site structure, which is available as a separate XSLT - module.

+ to save them in the HTML site structure, which is available as a separate XSLT module.

The use of modes also allows users to separate template rules for the different - transformations of a TEI element and to place them in different XSLT files or in + transformations of a TEI element and to place them in different XSLT files or in different parts of a single stylesheet. So templates such as the following and personalize the edition generation parameter as shown above; - copy their own XSLT files containing the template rules to generate the desired - edition levels in the directory that contains the stylesheets used for TEI element - transformation (builder_pack/modules/elements); + copy their own XSLT files containing the template rules to + generate the desired edition levels in the directory that contains the stylesheets + used for TEI element transformation + (builder_pack/modules/elements); include the new stylesheets in the file used to start the transformation chain (builder_pack/evt_builder.xsl); associate a mode value to the new edition level transformation; @@ -469,8 +488,9 @@ target="http://www.tapor.uvic.ca/~mholmes/image_markup/">Image Markup Tool

The UVic Image Markup Tool Project, .

software - and was implemented in XSLT and CSS; all the other features are achieved by using jQuery - plug-ins.

+ and was implemented in XSLT and CSS; all the other features are achieved by + using jQuery plug-ins.

In the text frame tool bar you can see three drop-down menus which are useful for choosing texts, specific folios, and edition levels, and an icon that triggers the search functionality. Again, the editor can modify the structure.xml file @@ -558,8 +578,10 @@ expected by the user. Essentially, we found that at least two of them were needed in order to make a functional search engine: free-text search and keyword highlighting. To implement them we looked at existing search engines and plug-ins programmed in the most - popular client-side web language: JavaScript. In the end, our search produced two - answers: Tipue Search and DOM manipulation.

+ popular client-side web language: JavaScript. In the + end, our search produced two answers: Tipue Search and DOM + manipulation.

Tipue Search

Tipue search

Tipue Search, @@ -568,10 +590,13 @@ collections of web pages. It can function both offline and online, and it does not necessarily require a web server or a server-side programming/query language (such as SQL, PHP, or Python) in order to work. While technically a plug-in, its architecture - is quite interesting and versatile: Tipue uses a combination of client-side JavaScript - for the actual bulk of the work, and JSON (or JavaScript object literal) for storing - the content. By accessing the data structure, this engine is able to search for a - relevant term and bring back the matches.

+ is quite interesting and versatile: Tipue uses a combination of client-side JavaScript for the actual bulk of the work, and JSON (or JavaScript object literal) for storing the content. By + accessing the data structure, this engine is able to search for a relevant term and + bring back the matches.

Tipue Search operates in three modes: in Static mode, Tipue Search operates without a web server by accessing the contents stored in a specific file @@ -597,10 +622,11 @@ the Vercelli Book.

-

These files are produced by including two templates in the overall flow of XSLT - transformations that extract crucial data from the TEI documents and format them with - JSON syntax. The procedure complements well the entire logic of automatic - self-generation that characterizes EVT.

+

These files are produced by including two templates in the overall flow of XSLT transformations that extract crucial data from the TEI documents and + format them with JSON syntax. The procedure complements well the entire logic of + automatic self-generation that characterizes EVT.

After we managed to extract the correct data structure, we began to include the search functionality in EVT. By using the logic behind Tipue JSON mode, we implemented a trigger (under the shape of a select tag) that loaded the desired JSON data @@ -616,16 +642,15 @@

- - Keyword - Highlighting through DOM Manipulation + Keyword Highlighting through DOM Manipulation

The solution to keyword highlighting was found while searching many plug-ins that - deal with this very problem. All these plug-ins use JavaScript and DOM manipulation in - order to wrap the HTML text nodes that match the query with a specific tag (a span or - a user-defined tag) and a CSS class to manage the style of the highlighting. While - this implementation was very simple and self-explanatory, making use of simple - recursive functions on relevant HTML nodes has proved to be very difficult to apply to - the textual contents handled by EVT.

+ deal with this very problem. All these plug-ins use JavaScript and DOM manipulation in order to wrap the HTML text nodes that + match the query with a specific tag (a span or a user-defined tag) and a CSS class to + manage the style of the highlighting. While this implementation was very simple and + self-explanatory, making use of simple recursive functions on relevant HTML nodes has + proved to be very difficult to apply to the textual contents handled by EVT.

HTML text within EVT is represented as a combination of text nodes and span elements. These spans are used to define the characteristics of the current selected edition. They contain both philological information about the inner workings of the @@ -636,12 +661,12 @@

This type of markup would not have constituted a problem if it had wrapped complete words, since the plug-ins could recursively explore its content and search for a matching term. In certain portions of the text, however, some letters are separated by - a span from the rest of the word (for example: <span>H</span>ello). - Since the plug-ins worked on the level of text nodes, a word split by spans would be - seen as two different text nodes (for example: H and - ello) and could not be matched by the query, making the - standard transversal algorithm of the plug-ins impossible to use without any further - substantial modifications.

+ a span from the rest of the word (for example: + <span>H</span>ello). Since the plug-ins worked on the level + of text nodes, a word split by spans would be seen as two different text nodes (for + example: H and ello) and could not be + matched by the query, making the standard transversal algorithm of the plug-ins + impossible to use without any further substantial modifications.

To solve this problem we recreated the transversal algorithm entirely, making it more intelligent and context-sensitive. In order to do so, we explored the text to be highlighted by using a map that kept track of all the spans and text nodes and their @@ -675,15 +700,16 @@ information about the image, but is placed inside a zone element, which defines two-dimensional areas within a surface, and is transcribed using one or more line elements.

-

Originally EVT could not handle this particular encoding method, since the XSLT - stylesheets could only process TEI XML documents encoded according to the traditional - transcription method. Since we think that this is a concrete need in many cases of - study (mainly epigraphical inscriptions, but also manuscripts, at least in some - specific cases), we recently added a new feature that will allow EVT to handle texts - encoded according to the embedded transcription method. This work was possible due to - a small grant awarded by EADH.

See EADH Small Grant: Call for Proposals, .

+

Originally EVT could not handle this particular encoding method, since the XSLT stylesheets could only process TEI XML documents encoded according to the + traditional transcription method. Since we think that this is a concrete need in many + cases of study (mainly epigraphical inscriptions, but also manuscripts, at least in + some specific cases), we recently added a new feature that will allow EVT to handle + texts encoded according to the embedded transcription method. This work was possible + due to a small grant awarded by EADH.

See EADH Small Grant: Call for + Proposals, .

Support for Critical Edition @@ -784,15 +810,21 @@ function.

Digital Lightbox has been developed using some of the latest web technologies available, such as HTML5, CSS3, the front-end framework Bootstrap,

Bootstrap, .

and the JavaScript (ECMAScript 6) - programming language, in combination with the jQuery library.

.

The code - architecture has been designed to be modular and easily extensible by other developers - or third parties: indeed, it has been released as open source software on - GitHub,

Digital Lightbox, .

and is - freely available to be downloaded, edited, and tinkered with.

+ target="http://getbootstrap.com/">Bootstrap,

Bootstrap, .

and the JavaScript (ECMAScript 6) programming language, in combination with the jQuery library.

.

The code architecture has been designed + to be modular and easily extensible by other developers or third parties: indeed, it has + been released as open source software on GitHub,

Digital + Lightbox, .

and is freely available to be downloaded, edited, and tinkered + with.

The Digital Lightbox represents a perfect complementary feature for the EVT project: a graphic-oriented tool to explore, visualize, and analyze digital images of manuscripts. While EVT provides a rich and usable interface to browse and study manuscript texts @@ -825,10 +857,11 @@ the Archivio della Pontificia Università Gregoriana.

The integration of EVT with another web framework used in the project, the eXist XML database, will require a very important change in how the software works: as mentioned above, everything from - XSLT processing to browsing of the resulting website has been done on the client side, - but the integration with eXist will require a move to the more complex client-server - architecture. A version of EVT based on this architecture would present several - advantages, not only the integration of a powerful XML database, but also the + XSLT processing to browsing of the resulting website has been done on the client + side, but the integration with eXist will require a move to the more complex + client-server architecture. A version of EVT based on this architecture would present + several advantages, not only the integration of a powerful XML database, but also the implementation of a full version of the Digital Lightbox. We will try to make the move as painless as possible and to preserve the basic simplicity and flexibility that has been a major feature of EVT so far. The client-only version will not be abandoned, @@ -843,11 +876,12 @@ to the publishing of TEI-encoded digital editions, this software has grown to the point of being a potentially very useful tool for the TEI community: since it requires little configuration, and no knowledge of programming languages or web frameworks except for what - is needed to apply an XSLT stylesheet, it represents a user-friendly method for producing - image-based digital editions. Moreover, its client-only architecture makes it very easy to - test the edition-building process (one has only to delete the output folders and start - anew) and publish preliminary versions on the web (a shared folder on any cloud-based - service such as Dropbox is all that is needed).

+ is needed to apply an XSLT stylesheet, it represents a user-friendly method + for producing image-based digital editions. Moreover, its client-only architecture makes + it very easy to test the edition-building process (one has only to delete the output + folders and start anew) and publish preliminary versions on the web (a shared folder on + any cloud-based service such as Dropbox is all that is needed).

While EVT has been under development for 3–4 years, it was thanks to the work and focus required by the Digital Vercelli Book release at end of 2013 that we now have a solid foundation on which to build new features and refine the existing ones. Some of the future @@ -878,7 +912,8 @@ Roberto Rosselli Del Turco Roberto Rosselli Del Turco, Julia Kenny, and Raffaele Masotti - + Julia Kenny and Raffaele Masotti Jacopo Pugliese diff --git a/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml b/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml index bbc747f..5fae689 100644 --- a/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml +++ b/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml @@ -1,5 +1,6 @@ - + + + @@ -451,9 +452,9 @@ Représentant(s) de la France Représentant(s) du Royaume-Uni Représentant(s) de la République Fédérale - d'Allemagne + d'Allemagne Représentant(s) de la République Fédérale - d'Allemagne, de la France et de l'Italie + d'Allemagne, de la France et de l'Italie Représentant(s) de la France et du Royaume-Uni Représentant(s) de la délégation française @@ -514,9 +515,9 @@ xml:id="faure">Maurice Faure Représentant(s) du Royaume-Uni Selwyn Lloyd - Représentant(s) de la République Fédérale d'Allemagne - Heinrich von - Brentano + Représentant(s) de la République Fédérale + d'Allemagne Heinrich von + Brentano

@@ -531,7 +532,7 @@

M. Faure souligne avec force que le succès de - l'entreprisedépend de la volonté politique des gouvernements d'assurer + l'entreprisedépend de la volonté politique des gouvernements d'assurer unecoopération effective. Les propositions de M. von Brentanoconstituant un pas important dans cette direction et il s'y rallie.

@@ -573,11 +574,11 @@ […]

La réponse de M. HEATH était : "Nous ne sommespas juridiquement - tenus d'autoriser l'inspection de ces dépôts,car ils ont été constitués - dans le cadre de l'OTAN, et sontdonc, strictement - parlant, uniquement soumis à l’inspection decette - organisation."[…].

[…]
+ who="#heath" corresp="#repres_cons_weu">"Nous ne sommespas + juridiquement tenus d'autoriser l'inspection de ces dépôts,car ils ont été + constitués dans le cadre de l'OTAN, et + sontdonc, strictement parlant, uniquement soumis à l’inspection + decette organisation."[…].

[…]
Excerpt from a discourse within a discourse. WEU-Diplo: NoteAgence pour le contrôle des armements. Division III. Note à l’intention du @@ -611,28 +612,32 @@ dedicated stylesheet was created for the transformation of the GATE tags (such as Person, Location, Organization, and Date) into corresponding TEI tags (name with the attribute type, and - date, respectively). A few examples of name type="person", - name type="org" are presented in the previous examples. Further - transformation was necessary during the importing of the annotated corpus into the - software for textual analysis (see ).

+ date, respectively). A few examples of name + type="person", name type="org" are presented in the + previous examples. Further transformation was necessary during the importing of the + annotated corpus into the software for textual analysis (see ).

Decoding

The so called decoding phase, for corpus analysis and interpretation, consisted of importing and processing the TEI XML annotated documents - within a specialized platform, TXM (Heiden 2010),TXM (Heiden 2010),. that allows the analysis of a large body of texts by means of lexicometrical and statistical methods. The previous encoding served as a basis for discerning or grouping together different types of semantic or structural elements needed for analysis.

Importing -

Since TXM supports XSLT transformation at the moment of import (XML/w+CSV option), an +

Since TXM supports XSLT transformation at the moment of import (XML/w+CSV option), an XSLT stylesheet was created to accommodate particular formats or conversions required by the software. Therefore, it was not necessary to store different versions of the corpus, - one for TXM analysis, the other for Web publication.

+ one for TXM analysis, the other for Web publication.

First, a lowercase conversionAll the examples of analysis presented in the paper will consequently be displayed in lowercase. was provided for consistency reasons relating to the varying ways of capitalizing (e.g., Comité militaire @@ -648,7 +653,8 @@ in the analysis (in the example, constituant instead of cons and tituant as the software would treat it without a w tag).

-

Part-of-speech tagging via the TreeTagger module integrated into TXM was also applied +

Part-of-speech tagging via the TreeTagger module integrated into TXM was also applied to the corpus at import in order to allow lemma and part-of-speech statistics and queries.

@@ -656,19 +662,22 @@ Analysis

The annotated corpus (only the content inside text tags, without metadata) contained 6,512 items (unique words) with 76,558 occurrences in the text.A - subcorpus based on the @lang="fr" property (an attribute of the - text element) was created in TXM for the analysis of the documents’ - content, excluding the data from the teiHeader. The whole corpus + subcorpus based on the @lang="fr" property (an attribute of the + text element) was created in TXM for the analysis of the + documents’ content, excluding the data from the teiHeader. The whole corpus (teiHeader included) comprised 7,015 items and 105,897 occurrences.

Partitioning

Given the identification and annotation of different semantic and structural elements - in the encoding phase, TXM allows the creation of partitions (Textométrie 2014, section Construire une partition) by - selecting a Structure unit and a corresponding Property - (i.e., an XML element and one of its attributes) from the list of structural units and - properties recognized by the software for the imported corpus.

+ in the encoding phase, TXM allows the creation of partitions (Textométrie 2014, section Construire une + partition) by selecting a Structure unit and a corresponding + Property (i.e., an XML element and one of its attributes) from the list + of structural units and properties recognized by the software for the imported + corpus.

For instance, as fragments of discourse spread throughout the documents were assigned to particular countries or institution representatives (), a partition was created based on the said element @@ -700,14 +709,16 @@ target="#textometrie14">Textométrie 2014, section Spécificités) allows a comparison of the vocabularies: what is specific (either as overuse or deficit) in a part of a - partition, as compared with the parent corpus and a certain threshold.In TXM, it - is called the banality threshold, fixed by default at the value of +/- - 2.0 for positive and negative specificities scores, respectively. In , the banality thresholds are rendered by (red) - horizontal lines. The feature is based on a probabilistic model (Lafon 1980) used in TXM to compute a log10 - specificity score of a word property (e.g., word form, lemma, or part of speech) for a - given part. In the analysis of the WEU-Diplo corpus, it was assumed that the + partition, as compared with the parent corpus and a certain threshold.In TXM, it is called the banality threshold, fixed by default at + the value of +/- 2.0 for positive and negative specificities scores, respectively. + In , the banality thresholds are rendered by + (red) horizontal lines. The feature is based on a probabilistic model (Lafon 1980) used in TXM to compute a + log10 specificity score of a word property (e.g., word form, lemma, or part of speech) + for a given part. In the analysis of the WEU-Diplo corpus, it was assumed that the specificity score may draw attention to forms specific to the discourse of different country/institutional representatives as compared with the whole. shows an extract from the @@ -807,13 +818,13 @@ said_corresp partition Representative/lemmas - repres_​aca - repres_​cons_​weu - repres_​sac - repres_​deleg_​fr - repres_​deleg_​uk - repres_​fr - repres_​uk + repres_​aca + repres_​cons_​weu + repres_​sac + repres_​deleg_​fr + repres_​deleg_​uk + repres_​fr + repres_​uk améri(cain)(que du @@ -1079,8 +1090,9 @@

Results Discussion -

The TEI XML encoding and TXM analysis related to the research questions on arms - design, production, and control within the WEU have enabled a set of more or less +

The TEI XML encoding and TXM analysis related to the research questions on + arms design, production, and control within the WEU have enabled a set of more or less predictable results, the latter needing further examination. Among the former, we can mention those referring to the SAC and ACA roles. Arms production and control was a major part of WEU’s work, despite its somewhat mixed record in this area. Protocol IV @@ -1108,20 +1120,23 @@ Foreign Office, Western Organisations and Co-ordination Department and Foreign and Commonwealth Office, Western Organisations Department: Registered Files (W and WD Series). Western European Union (WEU). Future of Standing Armaments Committee of - Western European Union. 01/01/1975–31/12/1975, FCO 41/1749 (Former Reference Dep: - WDU 11/1 PART B). The interpretation of the less predictable results is not - straightforward, since they may have been determined by an under- or - overrepresentation of certain elements in the discourse, based on the selection of - documents. The same could be said about the negative specificity score for the - control group in repres_fr’s, but this finding - is also likely to be associated with the assertion of France’s resistance to - submitting its nuclear stocks to the ACA’s controls and the need to avoid making - statements on the subject. Since the size of the corpus was relatively small, and not - all the information for the documents on the selected topic and their types in the WEU - archive was available, extrapolations about the TXM probabilistic model and the - observed linguistic patterns at a larger scale than the pilot sample should be avoided - at this stage.

-

The TEI XML combined with the TXM analysis tools can also reveal inconsistencies + Western European Union. 01/01/1975–31/12/1975, FCO 41/1749 (Former Reference Dep: WDU 11/1 PART B). The interpretation + of the less predictable results is not straightforward, since they may have been + determined by an under- or overrepresentation of certain elements in the discourse, + based on the selection of documents. The same could be said about the negative + specificity score for the control group in + repres_fr’s, but this finding is also likely to be associated with + the assertion of France’s resistance to submitting its nuclear stocks to the ACA’s + controls and the need to avoid making statements on the subject. Since the size of the + corpus was relatively small, and not all the information for the documents on the + selected topic and their types in the WEU archive was available, extrapolations about + the TXM probabilistic model and the observed linguistic patterns at a larger scale + than the pilot sample should be avoided at this stage.

+

The TEI XML combined with the TXM analysis tools can also reveal inconsistencies which may draw attention to the need for further encoding and testing additional documents. On the other hand, it is also important to take into consideration how far (or how well) the researcher/user knows the content of the documents, as a lack of @@ -1210,14 +1225,15 @@ Chicago: University of Chicago Press. Excerpt: . Heiden, Serge. 2010. The TXM Platform: Building Open-Source Textual Analysis Software Compatible - with the TEI Encoding Scheme. In Proceedings of the 24th - Pacific Asia Conference on Language, Information and Computation, edited by - Ryo Otoguro, Kiyoshi Ishikawa, Hiroshi - Umemoto, Kei Yoshimoto, and Yasunari - Harada, 389–398. Tokyo: - Institute for Digital Enhancement of Cognitive Development, Waseda - University. Accessed July 24, 2017. The TXM Platform: Building Open-Source Textual Analysis Software + Compatible with the TEI Encoding Scheme. In Proceedings of + the 24th Pacific Asia Conference on Language, Information and Computation, + edited by Ryo Otoguro, Kiyoshi Ishikawa, + Hiroshi Umemoto, Kei Yoshimoto, and Yasunari + Harada, 389–398. + Tokyo: Institute for Digital Enhancement of Cognitive + Development, Waseda University. Accessed July 24, 2017. . Ihde, Don. 2003. More Material Hermeneutics. Paper presented at the meeting on Pascaline Winand</editor>, <biblScope unit="page">187–234</biblScope>. <pubPlace>Brussels</pubPlace>: <publisher>PIE-Peter-Lang</publisher>.</bibl> <bibl xml:id="textometrie14"><orgName>Textométrie</orgName>. <date>2014</date>. <title - level="m">Manuel de TXM, Version 0.7. Accessed March 6, 2016. . + level="m">Manuel de TXM, Version 0.7. Accessed March 6, 2016. + . Thornborrow, Joanna Sarah. 2002 Power Talk: Language and Interaction in Institutional Discourse. diff --git a/data/JTEI/9_2016-17/jtei-9-ciotti-source.xml b/data/JTEI/9_2016-17/jtei-9-ciotti-source.xml index 573ba86..cb3c703 100644 --- a/data/JTEI/9_2016-17/jtei-9-ciotti-source.xml +++ b/data/JTEI/9_2016-17/jtei-9-ciotti-source.xml @@ -1,5 +1,6 @@ - + + + @@ -478,7 +479,7 @@ rend="ordered"> one XML element on its own: e.g., abbr; an XML element/attribute couple or a compound of so-called Janus - elements: e.g., <corr resp=> or + elements: e.g., <corr resp=> or ]]> one XML element in a given context: e.g., p in text vs. @@ -650,7 +651,8 @@

In order to finalize the model from an LOD cloud perspective—as regards the collection of TEI-based documents—various methods will have to be explored, beginning with the creation of an RDF triple store by converting some pertinent elements of the refined TEI XML files - into RDF through XSLT. Experiments in converting XML files into RDF have already been + into RDF through XSLT. Experiments in converting XML files into RDF have already been undertaken: a transformation to RDF has to create the URIs of its resources and connect them through the RDF triple structure consisting of subject, predicate and object (rdf:description about a node id (e.g., through an ref) that we - could manage in XSLT for transforming the ref value into a URI. This approach - yields: + could manage in XSLT for transforming the ref value into a URI. This + approach yields: SUBJECT = rdf:description about a TEI element (the @ref value) + PREDICATE = an attribute of the element in the subject (for managing cross-references) + or, simplest, the child element OBJECT = literal (the content of the element)

An example of an XML TEI markup in a document: @@ -676,7 +677,8 @@ That is: an entity (person) with a value (a fragment) referring to a xml:id (persona01) to be converted in dereferenceable URI (e.g., - http://www.person.it/about#persona01) through XSLT, a predicate + http://www.person.it/about#persona01) through XSLT, a predicate corresponding to the child element (persname) and a literal as the object (Vespasiano). Another fundamental issue is the identification of pertinent authorities for the data matching (e.g., VIAF, Geonames, Worldcat, SNAP, or DBpedia). In order for the @@ -737,8 +739,9 @@ target="http://subs.emis.de/LNI/Proceedings/Proceedings72/GI-Proceedings.72-11.pdf" />. Breitling, Frank. 2009. A Standard Transformation from XML to RDF via XSLT. Astronomische Nachrichten + level="a">A Standard Transformation from XML to RDF via XSLT. + Astronomische Nachrichten 330(7): 755–60. . doi: + + + @@ -301,24 +302,25 @@ can ultimately confirm its viability, even though results from early adopters like SARIT or Buddhist Stonesutras or experiments - with EEBO-TCPEarly English Books Online eXist-db app, accessed February 11, - 2016, . are - more than promising (see, for example, Wicentowski and Meier 2015). Second, and more important, the Processing Model - covers only the document transformation aspects of an edition; building a working - application on top of it still remains a significant challenge for the editorial teams, - though general-purpose application frameworks, like html templating for XQuery - applications on top of eXistdb, are already there to help with that process. Nevertheless, - we believe that the Processing Model is a crucial step in the right direction, addressing - the greatest challenge in the publication process, and that it stands a good chance of - gaining more traction and becoming part of the infrastructure and recommendations - maintained by the TEI Consortium. It will be practical to incorporate the Processing Model - into widely used application frameworks, resulting in a promising technology stack that - truly empowers editors, as the U.S. Department of State’s Office of the Historian’s recent - adoptionU.S. Department of State, Office of the Historian, accessed February 11, - 2016, . of a Processing Model library - for eXistdb has demonstrated very clearly (EEBO-TCPEarly English Books Online eXist-db app, accessed February 11, 2016, . are more than + promising (see, for example, Wicentowski and + Meier 2015). Second, and more important, the Processing Model covers only the + document transformation aspects of an edition; building a working application on top of it + still remains a significant challenge for the editorial teams, though general-purpose + application frameworks, like html templating for XQuery applications on top of eXistdb, + are already there to help with that process. Nevertheless, we believe that the Processing + Model is a crucial step in the right direction, addressing the greatest challenge in the + publication process, and that it stands a good chance of gaining more traction and + becoming part of the infrastructure and recommendations maintained by the TEI Consortium. + It will be practical to incorporate the Processing Model into widely used application + frameworks, resulting in a promising technology stack that truly empowers editors, as the + U.S. Department of State’s Office of the Historian’s recent adoptionU.S. Department + of State, Office of the Historian, accessed February 11, 2016, . of a Processing Model library for + eXistdb has demonstrated very clearly (Wicentowski and Meier 2015). There is no reason why this exercise could not be successfully repeated for other XML database systems such as BaseX. Thriving infrastructure projects like TAPASTAPAS Project, accessed February 11, 2016, + + + @@ -71,13 +72,16 @@

The paper presents the database Cretan Institutional Inscriptions, which was created as part of a PhD research project carried out at the University of Venice Ca’ - Foscari. The database, built using the EpiDoc Front-End Services (EFES) platform, - collects the EpiDoc editions of six hundred inscriptions that shed light on the - institutions of the political entities of Crete from the seventh to the first century - BCE. The aim of the paper is to outline the main issues addressed during the creation - of the database and the encoding of the inscriptions and to illustrate the core - features of the database, with an emphasis on the advantages deriving from the - combined use of the TEI-EpiDoc standard and of the EFES platform.

+ Foscari. The database, built using the EpiDoc Front-End Services (EFES) platform, collects the EpiDoc editions of six hundred inscriptions + that shed light on the institutions of the political entities of Crete from the + seventh to the first century BCE. The aim of the paper is to outline the main issues + addressed during the creation of the database and the encoding of the inscriptions + and to illustrate the core features of the database, with an emphasis on the + advantages deriving from the combined use of the TEI-EpiDoc standard and of the EFES platform.

I would like to express my gratitude to my PhD supervisors, Claudia Antonetti and @@ -119,13 +123,16 @@ standard, including a commentary focused on the institutional data offered by the document. The editions of these inscriptions, along with a collection of the most relevant literary sources, have been collected in the database Cretan Institutional - Inscriptions, which I created using the EpiDoc Front-End Services (EFES) platform. To - facilitate consulting the epigraphic records, the database also includes, in addition - to the ancient sources, two catalogs providing information about the Cretan political - entities and the institutional elements considered.

+ Inscriptions, which I created using the EpiDoc Front-End Services (EFES) platform. To facilitate consulting the epigraphic records, the + database also includes, in addition to the ancient sources, two catalogs providing + information about the Cretan political entities and the institutional elements + considered.

The aim of this paper is to illustrate the main issues tackled during the creation of the database and to examine the choices made, focusing on the advantages offered by - the use of EpiDoc and EFES.

+ the use of EpiDoc and EFES.

Cretan Epigraphy and Cretan Institutions @@ -238,7 +245,9 @@ Crete or concerning Crete, as the documentary base of my study.

- Towards the Creation of a Born-Digital Epigraphic Collection with EFES + Towards the Creation of a Born-Digital Epigraphic Collection with EFES

Once the relevant material had been defined, another major issue that I had to face was to decide how to deal efficiently with it. While I was in the process of starting a more traditional and monographic study of Cretan institutions, @@ -260,17 +269,20 @@ collection of editions of the previously selected six hundred inscriptions to creating it as a born-digital epigraphic collection because of another event that also happened in 2017: the appearance of a powerful new tool for digital epigraphy, - EpiDoc Front-End Services (EFES).

GitHub repository, accessed July 21, 2021, - .

Although I was - already aware of the many benefits deriving from a semantic markup of the + EpiDoc Front-End Services (EFES).

GitHub repository, accessed July + 21, 2021, .

Although I + was already aware of the many benefits deriving from a semantic markup of the inscriptions,

On which see and .

what really persuaded me to adopt a TEI-based approach for the creation of my epigraphic - editions was actually the great facilitation that EFES offered in using TEI-EpiDoc, - which I will discuss in the following section.

+ editions was actually the great facilitation that EFES offered in using + TEI-EpiDoc, which I will discuss in the following section.

- The Benefits of Using EpiDoc and EFES + The Benefits of Using EpiDoc and EFES

I was already familiar with the epigraphic subset of the TEI standard, EpiDoc,

EpiDoc: Epigraphic Documents in TEI XML, accessed July 21, 2021, + but to my knowledge the creation of proper indexes—before EFES—was + almost impossible to achieve without the help of an IT expert.

Thus, despite the many benefits that EpiDoc encoding potentially offers, epigraphists might often be discouraged from adopting it by the amount of time that such an approach requires, combined with the fact that in many cases these benefits become tangible only at the end of the work, and only if one has IT support.

In light of these limitations, it is easy to understand how deeply the release of - EFES has transformed the field of digital epigraphy. EFES, developed at the Institute - of Classical Studies of the School of Advanced Study of the University of London as - the epigraphic specialization of the Kiln platform,

New - Digital Publishing Tool: EpiDoc Front-End Services, September 1, - 2017, EFES has transformed the field of digital epigraphy. EFES, developed at the Institute of Classical Studies of the School of + Advanced Study of the University of London as the epigraphic specialization of the + Kiln platform,

New Digital Publishing Tool: EpiDoc Front-End + Services, September 1, 2017, ; see also the Kiln GitHub repository, accessed July 21, 2021,.

is a platform that simplifies the creation and management of databases of inscriptions encoded following - the EpiDoc Guidelines. More specifically, EFES was developed to make it easy for - EpiDoc users to view a publishable form of their inscriptions, and to publish them - online in a full-featured searchable database, by easily ingesting EpiDoc texts and - providing formatting for their display and indexing through the EpiDoc reference XSLT - stylesheets. The ease of configuration of the XSLT transformations, and the - possibility of already having, during construction, an immediate front-end - visualization of the desired final outcome of the TEI-EpiDoc marked-up documents, - allow smooth creation of an epigraphic database even without a large team or in-depth - IT skills. Beyond this, EFES is also remarkable for the ease of creation and display - of the indexes of the various categories of marked-up terms, which significantly - simplifies comparative analysis of the data under consideration. EFES is thus proving - to be an extremely useful tool not only for publishing inscriptions online, but also - for studying them before their publication or even without the intention of - publishing them, especially when dealing with large collections of documents and data - sets.

See Bodard and Yordanova (2020).

-

Some of these useful features of EFES are common to other existing tools, such as TEI - Publisher,

Accessed July 21, 2021, EFES was developed to make + it easy for EpiDoc users to view a publishable form of their inscriptions, and to + publish them online in a full-featured searchable database, by easily ingesting + EpiDoc texts and providing formatting for their display and indexing through the + EpiDoc reference XSLT stylesheets. The ease of configuration of the XSLT + transformations, and the possibility of already having, during construction, an + immediate front-end visualization of the desired final outcome of the TEI-EpiDoc + marked-up documents, allow smooth creation of an epigraphic database even without a + large team or in-depth IT skills. Beyond this, EFES is also remarkable for + the ease of creation and display of the indexes of the various categories of + marked-up terms, which significantly simplifies comparative analysis of the data + under consideration. EFES is thus proving to be an extremely useful + tool not only for publishing inscriptions online, but also for studying them before + their publication or even without the intention of publishing them, especially when + dealing with large collections of documents and data sets.

See Bodard and + Yordanova (2020).

+

Some of these useful features of EFES are common to other existing tools, + such as TEI Publisher,

Accessed July 21, 2021, .

TAPAS,

Accessed July 21, 2021, .

or Kiln itself, which is - EFES’s direct ancestor. What makes EFES unique, however, is the fact that it is the - only one of those tools to have be designed specifically for epigraphic purposes and - to be deeply integrated with the EpiDoc Schema/Guidelines and with its reference - stylesheets. Not only does it use, by default, the EpiDoc reference stylesheets for - transforming the inscriptions and for indexing, it also comes with a set of default - search facets and indexes that are specifically meant for epigraphic documents. The - default facets include the findspot of the inscription, its place of origin, its - current location, its support material, its object type, its document type, and the - type of evidence of its date. The search/browse page, moreover, also includes a - slider for filtering the inscriptions by date and a box for textual searches, which - can be limited to the indexed forms of the terms. The default indexes include places, - personal names (onomastics), identifiable persons (prosopography), divinities, - institutions, words, lemmata, symbols, numerals, abbreviations, and uninterpreted - text fragments. New facets and indexes can easily be added even without mastering - XSLT, along the lines of the existing ones and by following the detailed instructions - provided in the EFES Wiki documentation.

Accessed July 21, 2021, EFES’s direct ancestor. What makes EFES unique, + however, is the fact that it is the only one of those tools to have be designed + specifically for epigraphic purposes and to be deeply integrated with the EpiDoc + Schema/Guidelines and with its reference stylesheets. Not only does it use, by + default, the EpiDoc reference stylesheets for transforming the inscriptions and for + indexing, it also comes with a set of default search facets and indexes that are + specifically meant for epigraphic documents. The default facets include the findspot + of the inscription, its place of origin, its current location, its support material, + its object type, its document type, and the type of evidence of its date. The + search/browse page, moreover, also includes a slider for filtering the inscriptions + by date and a box for textual searches, which can be limited to the indexed forms of + the terms. The default indexes include places, personal names (onomastics), + identifiable persons (prosopography), divinities, institutions, words, lemmata, + symbols, numerals, abbreviations, and uninterpreted text fragments. New facets and + indexes can easily be added even without mastering XSLT, along the lines of the + existing ones and by following the detailed instructions provided in the EFES Wiki documentation.

Accessed July 21, 2021, . Creation of new facets, last updated April 11, 2018: . Creation of new indexes, last updated May 27, 2020: .

- Furthermore, EFES makes it possible to create an epigraphic concordance of the + Furthermore, EFES makes it possible to create an epigraphic concordance of the various editions of each inscription and to add information pages as TEI XML files (suitable for displaying both information on the database itself and potential additional accompanying information).

-

Against this background, the combined use of the EpiDoc encoding and of the EFES tool - seemed to be a promising approach for the purposes of my research project, and so it - was.

+

Against this background, the combined use of the EpiDoc encoding and of the EFES tool seemed to be a promising approach for the purposes of my research + project, and so it was.

I initially aimed to create updated digital editions of the inscriptions mentioning Cretan institutional elements that could be used to facilitate a comparative analysis of the latter. The ability to generate and view the indexes of the mentioned @@ -399,33 +425,38 @@ inscriptions in EpiDoc, totally met my needs, and helped me very much in the identification of recurring patterns. As I was expected to submit my doctoral thesis in PDF format, I also needed to convert the epigraphic editions into PDF, and by - running EFES locally I have been able to view their transformed HTML versions on a - browser and to naively copy and paste them into a Microsoft Word file.

I am - very grateful to Pietro Maria Liuzzo for teaching me how to avoid this - conversion step by using XSL-FO, which can be used to generate a PDF directly - from the raw XML files. The use of XSL-FO, however, requires some additional - skills that are not needed in the copy-and-paste-from-the-browser - process.

Although I had not planned it from the beginning, EFES also - proved to be useful in the (online) publication of the results of my research. The - ease with which EFES allows the creation of a searchable epigraphic database, in - fact, spontaneously led me to decide to publish it online once completed, making - available not only the HTML editions—which can also be downloaded as printable - PDFs—but also the raw XML files for reuse. The aim of the online publication, in - fact, is to allow other researchers to query the epigraphic collection for their - needs and to encourage the reuse of the dataset for unpredictable future digital - humanities purposes.

+ running EFES locally I have been able to view their transformed HTML + versions on a browser and to naively copy and paste them into a Microsoft Word + file.

I am very grateful to Pietro Maria Liuzzo for teaching me how to + avoid this conversion step by using XSL-FO, which can be used to generate a PDF + directly from the raw XML files. The use of XSL-FO, however, requires some + additional skills that are not needed in the copy-and-paste-from-the-browser + process.

Although I had not planned it from the beginning, EFES also proved to be useful in the (online) publication of the results of + my research. The ease with which EFES allows the creation of a searchable + epigraphic database, in fact, spontaneously led me to decide to publish it online + once completed, making available not only the HTML editions—which can also be + downloaded as printable PDFs—but also the raw XML files for reuse. The aim of the + online publication, in fact, is to allow other researchers to query the epigraphic + collection for their needs and to encourage the reuse of the dataset for + unpredictable future digital humanities purposes.

Cretan Institutional Inscriptions: An Overview of the Database -

The core of the EFES-based database Cretan Institutional - Inscriptions consists of the EpiDoc editions of the previously selected - six hundred inscriptions, which can be exported both in PDF and in their original XML - format. Each edition is composed of an essential descriptive lemma; a bibliographic - lemma, also including links to external online resources; a critical edition of the - Greek text; a selective apparatus—where necessary—recording mainly the most - significant readings or restorations affecting the institutional terms; and a - commentary to contextualize the institutional elements mentioned in the document.

+

The core of the EFES-based database Cretan + Institutional Inscriptions consists of the EpiDoc editions of the + previously selected six hundred inscriptions, which can be exported both in PDF and + in their original XML format. Each edition is composed of an essential descriptive + lemma; a bibliographic lemma, also including links to external online resources; a + critical edition of the Greek text; a selective apparatus—where necessary—recording + mainly the most significant readings or restorations affecting the institutional + terms; and a commentary to contextualize the institutional elements mentioned in the + document.

Thanks to the EpiDoc markup of the texts and of their metadata,

On which see .

the inscriptions can be browsed by applying one or more customized search facets, which @@ -455,7 +486,8 @@ >Political entities, Institutions, Literary sources, and Bibliographic references, have been added to the database as pages generated from TEI - XML files, which could be natively included in EFES.

+ XML files, which could be natively included in EFES.

As mentioned above, the database also includes several thematic indexes listing the marked-up terms along with the references to the inscriptions in which they occur, divided into institutions, toponyms and ethnic adjectives, lemmata (both of @@ -475,20 +507,20 @@ .

but it proved to be particularly useful in the encoding of the - Greek texts (in the div type="edition"), where it was - strictly connected to my research questions. The EpiDoc elements, attributes and - attribute values I chose to use inside the div type="edition" - facilitated my analysis of the institutional elements mentioned in the documents, - aiming at the identification, contextualization, and indexing of the institutional - terms and of the attested individuals holding an office. A TEI-based approach proved - to be especially effective for dealing with the challenging Cretan institutional - records,

See above.

because, in particular, the possibility of using multiple - attributes in the same element allowed me to extract the institutional occurrences as - contextually as possible. In other words, it allowed me to identify and index not - only the simple occurrence of the institutional terms, but also a set of connected - details that were extremely valuable for examining and then classifying them - according to several variables.

+ Greek texts (in the div type="edition"), where it + was strictly connected to my research questions. The EpiDoc elements, attributes and + attribute values I chose to use inside the div + type="edition" facilitated my analysis of the institutional + elements mentioned in the documents, aiming at the identification, contextualization, + and indexing of the institutional terms and of the attested individuals holding an + office. A TEI-based approach proved to be especially effective for dealing with the + challenging Cretan institutional records,

See above.

because, in + particular, the possibility of using multiple attributes in the same element allowed + me to extract the institutional occurrences as contextually as possible. In other + words, it allowed me to identify and index not only the simple occurrence of the + institutional terms, but also a set of connected details that were extremely valuable + for examining and then classifying them according to several variables.

Let us now look in detail at the markup of the institutional terms. The EpiDoc Guidelines, in this respect, do not offer many details. The closest case is the one presented in the Titles, Offices, Political Posts, Honorifics, @@ -602,12 +634,12 @@ />.</p></note> The honored individuals, foreign rulers, and theonyms I have marked up similarly to the individual holding an office, by using the <gi>persName</gi> element and, for honored individuals and foreign rulers, also a - nested <tag type="start">name nymRef=""</tag>. Broadly following a consolidated - EpiDoc practice, I have added a <att>type</att> attribute to each <gi>persName</gi>, - with the values <val>honoured</val> for honored individuals, <val>ruler</val> for the - rulers, and <val>divine</val> for the theonyms.<note><p>The values - <val>attested</val> and <val>divine</val> are recommended by the EpiDoc - Guidelines v. 9.2, October 13, 2020, <ptr + nested <tag type="start">name nymRef=""</tag>. Broadly following a + consolidated EpiDoc practice, I have added a <att>type</att> attribute to each + <gi>persName</gi>, with the values <val>honoured</val> for honored individuals, + <val>ruler</val> for the rulers, and <val>divine</val> for the + theonyms.<note><p>The values <val>attested</val> and <val>divine</val> are + recommended by the EpiDoc Guidelines v. 9.2, October 13, 2020, <ptr target="https://epidoc.stoa.org/gl/latest/idx-persnames.html"/>. The value <val>ruler</val> (instead of the <val>emperor</val> suggested by the EpiDoc Guidelines) has also been used by <title level="m">Inscriptions of Greek @@ -631,9 +663,10 @@ type="crossref"/> (<ref type="bibl" target="#icret"><title level="m" >I.Cret. II 23 5).

-

Given the markup described above, EFES was able to generate detailed indexes having - the appearance of rich tables, where each piece of information is displayed in a - dedicated column and can easily be combined with the other ones at a glance.

+

Given the markup described above, EFES was able to generate detailed indexes + having the appearance of rich tables, where each piece of information is displayed in + a dedicated column and can easily be combined with the other ones at a glance.

In the most complex case, that of the institutions, the index displays for each occurrence the base form of the term both in transliteration and in Greek (under the headings Istituzione and Termine attestato, obtained from key of @@ -669,16 +702,17 @@ An excerpt from the prosopographical index.

In addition to the more tabular institutional and - prosopographical indexes, EFES facilitated the creation of other more traditional - indexes, including the indexed terms and the references to the inscriptions that - mention them. The encoding of the most significant words with w - lemma="" led to the creation of a word index of relevant terms attested in - the inscriptions. Similarly, an index of the attested toponyms and ethnic adjectives - was generated from their encoding with placeName ref=""; an - index of the attested divinities, nymphs, and heroes from their encoding with persName type="divine" key=""; and an index of personal - names—of individuals holding an office, honored individuals, and foreign rulers—from - their encoding with name nymRef="".

+ prosopographical indexes, EFES facilitated the creation of other more + traditional indexes, including the indexed terms and the references to the + inscriptions that mention them. The encoding of the most significant words with w lemma="" led to the creation of a word index of relevant + terms attested in the inscriptions. Similarly, an index of the attested toponyms and + ethnic adjectives was generated from their encoding with placeName + ref=""; an index of the attested divinities, nymphs, and heroes from their + encoding with persName type="divine" key=""; and an index of + personal names—of individuals holding an office, honored individuals, and foreign + rulers—from their encoding with name nymRef="".

In total, the semantic markup has involved 8,162 lemmata (w), 4,353 institutional elements (rs), 2,633 toponyms or ethnic adjectives (placeName), 1,694 anthroponyms (name) and 1,651 @@ -687,9 +721,10 @@

Conclusions

In conclusion, I would like to emphasize how particularly efficient the combined use - of EpiDoc and EFES has proven to be for the creation of a thematic database like - Cretan Institutional Inscriptions. By collecting in a searchable database all the - inscriptions pertaining to the Cretan institutions, records that were hitherto + of EpiDoc and EFES has proven to be for the creation of a thematic database + like Cretan Institutional Inscriptions. By collecting in a searchable database all + the inscriptions pertaining to the Cretan institutions, records that were hitherto accessible only in a scattered way, Cretan Institutional Inscriptions is a new resource that can facilitate the finding, consultation, and reuse of these very heterogeneous documents, many of which offer further points of reflection only when @@ -722,10 +757,12 @@ Bodard, Gabriel, Yordanova, Polina. 2020. Publication, Testing and Visualization with - EFES: A Tool for All Stages of the EpiDoc XML Editing Process. Studia Universitatis Babeș-Bolyai Digitalia, no. 1: 17–35. doi:10.24193/subbdigitalia.2020.1.02. + EFES: A Tool for All Stages of the EpiDoc XML Editing + Process. Studia Universitatis Babeș-Bolyai + Digitalia, no. 1: 17–35. doi:10.24193/subbdigitalia.2020.1.02. Dobias-Lalou, Catherine. 2017. Inscriptions of Greek Cyrenaica, in collaboration with Alice diff --git a/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml b/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml index dbcdd65..d83c37b 100644 --- a/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml +++ b/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml @@ -1,6 +1,6 @@ - - + + + @@ -131,8 +131,10 @@ Bausi. In order to obtain a TEI file with an initial text transcription from manuscripts, to be published alongside the catalogue description of the manuscript itself, we have investigated a series of options, among which we have - chosen to use the Transkribus - sofware by READ Coop.Accessed February 2, 2022, Transkribus sofware by READ + Coop.Accessed February 2, 2022, .

The transcription matches and complements the cataloguing efforts of the project, which encodes in the teiHeader the description of the manuscript. Within @@ -148,25 +150,26 @@ of features of the object described in msDesc, but, thinking of the work for an historical catalogue that involves copying from the former cataloguer transcription. Having a new transcription, based on autopsy or at least on the images - of the manuscript would be preferable and technology as Transkribus allows one to - obtain this transcription in an almost entirely automated way. Additionally, most of - the internal referencing within a manuscript is done with the indication of the - ranges of folios, and in TEI with locus. While normally a transcription - either in an historical catalogue, or done by a researcher by hand, seldom records - more than the folio ranges, automated techniques can identify text areas, like - columns, for example, and each line, and have the necessary information to encode the - pb, cb and lb which would be needed to encode the - structure of the manuscript and would be a tedious task, when manually done. Having - the structured text completed by the information about the layout, linked to the - images, already in the source and linking to it with the locus elements - allows to effectively point to an exact section of the transcribed text. The syntax - of the textual content of the attributes from, to and - target is defined by the project’s Guidelines and so is the use of the - above elements, which make the references machine operable. This is the case - for Beta maṣāḥǝft, which implements this experimentally, via the Distributed Text - Services API Specifications. See Beta - maṣāḥǝft Guidelines, accessed February 2, 2022, Transkribus allows one to obtain this transcription in an + almost entirely automated way. Additionally, most of the internal referencing within + a manuscript is done with the indication of the ranges of folios, and in TEI with + locus. While normally a transcription either in an historical catalogue, + or done by a researcher by hand, seldom records more than the folio ranges, automated + techniques can identify text areas, like columns, for example, and each line, and + have the necessary information to encode the pb, cb and lb + which would be needed to encode the structure of the manuscript and would be a + tedious task, when manually done. Having the structured text completed by the + information about the layout, linked to the images, already in the source and linking + to it with the locus elements allows to effectively point to an exact + section of the transcribed text. The syntax of the textual content of the attributes + from, to and target is defined by the project’s + Guidelines and so is the use of the above elements, which make the references machine + operable. This is the case for Beta maṣāḥǝft, which implements this + experimentally, via the Distributed Text Services API Specifications. See Beta maṣāḥǝft Guidelines, accessed February 2, 2022, .

The output of any automated process will never be perfect. However, we have @@ -198,8 +201,10 @@ Oriental tradition, this is a vital support for text identification.

The following steps have been taken to carry out an investigation of the possibilities for the automated production of text transcriptions based on images of - manuscripts, before we opted for Transkribus and its integration in the workflow to - make texts available in the Beta maṣāḥǝft research environment.

+ manuscripts, before we opted for Transkribus + and its integration in the workflow to make texts available in the Beta maṣāḥǝft + research environment.

Related works @@ -286,7 +291,8 @@ one script.

- Transkribus + Transkribus

This software is freely accessible and has a subscription model based on credits. The platform was created within the framework of the EU projects tranScriptorium and READ (Recognition and Enrichment of Archival Documents - @@ -298,8 +304,10 @@ platform. The Pattern Recognition and Human Language Technology (PRHLT) group of the Universitat Politècnica de València and the CITlab group of the University of Rostock should be mentioned in particular.

-

Transkribus comes as an expert tool in its downloadable - version and its online version,Accessed February 2, 2022, Transkribus comes as an + expert tool in its downloadable version and its online + version,Accessed February 2, 2022, . and it allows to upload images privately and perform the transcription task without neural networks knowledge, but just following concise and easily retrievable documentation.

@@ -328,13 +336,16 @@ training data. However, there is no organized and freely available dataset for Ethiopic handwriting character recognition.

Thus, the first stage for developing a model was gathering the data and preparing an - initial dataset. Also for this aspect, Transkribus proved superior to all other - options offering support also for this step. Colleagues which we called to contribute - could be added to a collection, share their images without publishing them and add - their transcriptions in the tool with a very mild learning curve.

-

Within Transkribus we have trained a model called Manuscripts from - Ethiopia and Eritrea in Classical Ethiopic (Gǝʿǝz).See, accessed - February 2, 2022, Transkribus + proved superior to all other options offering support also for this step. Colleagues + which we called to contribute could be added to a collection, share their images + without publishing them and add their transcriptions in the tool with a very mild + learning curve.

+

Within Transkribus we have trained a model + called Manuscripts from Ethiopia and Eritrea in Classical Ethiopic + (Gǝʿǝz).See, accessed February 2, 2022, . Checked transcriptions for the training set have been kindly provided by

@@ -395,11 +406,15 @@ set.

- Training a model in Transkribus -

Gathering data to train an HTR model in Transkribus was not easy. Researchers were - directly asked to contribute images of which they had already done the correct - transcription. Sets of images with the relative transcription was thus obtained - thanks to the generosity of contributors listed above.

+ Training a model in Transkribus +

Gathering data to train an HTR model in Transkribus + was not easy. Researchers were directly asked to contribute images of which they had + already done the correct transcription. Sets of images with the relative + transcription was thus obtained thanks to the generosity of contributors listed + above.

As stated earlier, we have trained a generic model using various styles and manuscripts. The simple fact of having the images and the transcriptions was not enough of course. These needed to be cleaned up at least for what concerns the file @@ -421,9 +436,10 @@ enter more of the available transcription by hand as discussed above, than to wait for the available time of the colleagues to fix the work of the machine, since we intended to train the model again. After three months with a full-time dedicated - person, we had more than 50k words in the Transkribus expert tool, and we could train - a model which could be made public, since this is the unofficial threshold to make a - model available to everyone.

+ person, we had more than 50k words in the Transkribus + expert tool, and we could train a model which could be made public, since this is the + unofficial threshold to make a model available to everyone.

The features of the final model can be seen in .

@@ -440,18 +456,24 @@ Once the model is publicly available eventually anyone will be able to do so.

- Adding transcriptions to Beta maṣāḥǝft from Transkribus + Adding transcriptions to Beta maṣāḥǝft from Transkribus

Even if a user already worked through each page of a manuscript to produce a - transcription, doing it again with Transkribus and checking it has many advantages, - chiefly the alignment of the text regions and lines on the base image to the - transcription.Guidelines are provided for this steps to the users in theproject Guidelines, accessed February 2, 2022, - Transkribus + and checking it has many advantages, chiefly the alignment of the text regions and + lines on the base image to the transcription.Guidelines are provided for this + steps to the users in theproject Guidelines, + accessed February 2, 2022, .

With the transcribed images, either by hand with the help of the tool, or using the - HTR model, the export functionalities of the Transkribus tool, allow to download a - TEI encoded version of this transcription where we encourage users to use Line Breaks - (lb) instead of l and preserve the coordinates of the boxes.

+ HTR model, the export functionalities of the Transkribus tool, allow to download a TEI encoded version + of this transcription where we encourage users to use Line Breaks (lb) + instead of l and preserve the coordinates of the boxes.

This TEI file contains all the aligned transcription, links between the regions of the image, and the text. It has however to follow the structure of the set of images. If you transcribed images, for example, of openings, logically you will have a page @@ -462,8 +484,11 @@ manuscript and not of the image set. Most of this can be fixed by preparing the image set accurately, but we assume in most real-life use cases this will not be the case.

-

We have then prepared a bespoke XSLT transformation which can be used to transform - the rich TEI from Transkribus, called We have then prepared a bespoke XSLT transformation which can be used to + transform the rich TEI from Transkribus, + called transkribus2Beta maṣāḥǝft.xsl. This transformation, given a few parameters, restructures the TEI to fit the project requirements. The needed parameters are: the @@ -485,23 +510,24 @@

Conclusions -

Working with Transkribus for the Beta maṣāḥǝft project gives the community of users a - way to support the process of transcribing to the text on source manuscripts without - typing it down. This is not intended to substitute the work of the editor of a text, - but to support it, producing a transcription that still needs a lot of care for its - content and encoding, but also comes with a lot of added value, like the precise - alignment of the text to the image set and its encoding in the TEI. Files thus - obtained are huge and not as easy to maintain in a database or edit directly. - However, even if the text of the transcription is still unchecked and thus subject to - at least the percentage of the error the model provides, several benefits become - immediately available to the users, both encoders and users of the web application - hosting the texts. Encoders can point to a part of the transcription using - locus and avoid keying in the text from the transcription of a cataloguer - in the teiHeader. Similarly, a user of the application can identify an - unknown text using the functionality of the search index, which is capable of - performing fuzzy searches which will return results also where a query term is - partially different from the matching result, e.g., in case this contains an error - originated from the automated transcription process.

+

Working with Transkribus for the Beta maṣāḥǝft project + gives the community of users a way to support the process of transcribing to the text + on source manuscripts without typing it down. This is not intended to substitute the + work of the editor of a text, but to support it, producing a transcription that still + needs a lot of care for its content and encoding, but also comes with a lot of added + value, like the precise alignment of the text to the image set and its encoding in + the TEI. Files thus obtained are huge and not as easy to maintain in a database or + edit directly. However, even if the text of the transcription is still unchecked and + thus subject to at least the percentage of the error the model provides, several + benefits become immediately available to the users, both encoders and users of the + web application hosting the texts. Encoders can point to a part of the transcription + using locus and avoid keying in the text from the transcription of a + cataloguer in the teiHeader. Similarly, a user of the application can + identify an unknown text using the functionality of the search index, which is + capable of performing fuzzy searches which will return results also where a query + term is partially different from the matching result, e.g., in case this contains an + error originated from the automated transcription process.

With this process started, the model publicly available, and thousands of images of manuscripts, we work for more transcriptions, more text distributed as TEI on the web, and more collaboration to improve each of these aspects.

@@ -551,7 +577,7 @@ Hadi Samer Jomaa. 2019. Handwritten Amharic Character Recognition Using a Convolutional Neural Network. . - + Muehlberger, Guenter, Louise Seaward, Melissa Terras, Sofia Ares Oliveira, @@ -580,11 +606,12 @@ Vidal, Johanna Walcher, Max Weidemann, Herbert Wurster, and Konstantinos Zagoris. 2019. Transforming scholarship - in the archives through handwritten text recognition: Transkribus as a case - study. Journal of Documentation, 75 (5) 954–976. 10.1108/JD-07-2018-0114. + in the archives through handwritten text recognition: Transkribus as a case study. Journal of Documentation, 75 (5) 954–976. 10.1108/JD-07-2018-0114. Andrews, Tara L., Caroline Macé (eds.). 2014. Analysis of Ancient and Medieval Texts and Manuscripts: Digital Approaches. Lectio Studies in the diff --git a/data/JTEI/rolling_2022/jtei-teilex-207-source.xml b/data/JTEI/rolling_2022/jtei-teilex-207-source.xml index bfaa354..f3d301a 100644 --- a/data/JTEI/rolling_2022/jtei-teilex-207-source.xml +++ b/data/JTEI/rolling_2022/jtei-teilex-207-source.xml @@ -1,5 +1,6 @@ - + + + @@ -182,8 +183,9 @@ 2018), an initiative launched in 2016 under the auspices of the DARIAH Working Group on Lexical Resources, which aims to define a pivot format for the integration and querying of heterogeneous TEI-based lexical resources.See the - project’s GitHub repository, accessed June 17, 2022, . + project’s GitHub repository, accessed June 17, 2022, + .

The scope of our proposal covers the usage of the following concepts central to etymological description:

@@ -257,11 +259,11 @@ type="bibl" target="#romary2012">2012). with type for complex descriptions of linguistic signs and their properties. The two most common usages in the context of etymological representation are etymons - (cit[type="etymon"]) and cognates - (cit[type="cognate"]).For a complete list of customized - type values on cit in TEI Lex-0, see Tasovac et al. 2018, sec. 12.1.19: - cit, cit[type="etymon"]) and cognates + (cit[type="cognate"]).For a complete list of + customized type values on cit in TEI Lex-0, see Tasovac et al. 2018, sec. + 12.1.19: cit, . We follow here the recent developments related to ISO standard 24613–3: see Khan and @@ -281,15 +283,15 @@ lang). bibl and biblStruct for (complete) bibliographical references presented inline. - ref type="bibl" for pointers to bibliographical entries stored - elsewhere, which may also contain a target attribute when the + ref type="bibl" for pointers to bibliographical entries + stored elsewhere, which may also contain a target attribute when the bibliographic description is available. As an option depending on editorial practices, seg type="desc" for spans of prose that do not represent any of the information types described above. note for editorial notes that are not part of the actual etymological description (see the previous discussion concerningseg - type="desc"). + type="desc"). lbl to mark up short intertwining descriptive or connecting markers, particularly in cases of cross references (e.g., cf.andsee). @@ -310,12 +312,13 @@
Basic Components of Etymons, Related Forms, and Other Components of Etymologies -

Other than seg type="desc", bibl, date, and +

Other than seg type="desc", bibl, date, and note, the rest of the most important components of an etymology, which are described in the following sections, are encoded as children of a typed - cit element for describing etymons (type="etymon") or - cognates (type="cognate") , both of which are discussed in detail in - .

+ cit element for describing etymons + (type="etymon") or cognates + (type="cognate") , both of which are discussed in detail + in .

The element cit can contain:

form for describing the actual form corresponding to the intended @@ -369,10 +372,10 @@ target="#crist2005">2005) for an in-depth discussion of such possible relations. - ref type="bibl" with a target attribute for references - to bibliographic entries described elsewhere in the encompassing document, and - possibly bibl as an alternative, when no central bibliographical - management is anticipated for the current dictionary. + ref type="bibl" with a target attribute for + references to bibliographic entries described elsewhere in the encompassing + document, and possibly bibl as an alternative, when no central + bibliographical management is anticipated for the current dictionary.

As can be seen already, and in continuity with Bowers and Romary (2017), the TEI Lex-0 recommendation departs @@ -413,10 +416,10 @@ the data itself. Minimal adherence to TEI Lex-0 Etym requires only that the data be encoded using the elements described above: all text content must be wrapped in the particular element(s) specified for their data type, with seg - type="desc" remaining an option depending on editorial practices. - Optionally, users can include multiple layered etym elements which may - also be typed. In the sections below we describe each basic possibility, their - uses, and the specifics of their encoding.

+ type="desc" remaining an option depending on editorial + practices. Optionally, users can include multiple layered etym elements + which may also be typed. In the sections below we describe each basic possibility, + their uses, and the specifics of their encoding.

In , from Kluge’s etymological dictionary of German (Kluge 1975), we demonstrate the minimal encoding of the entry components.

@@ -485,8 +488,8 @@ lat. introitus of the Latin introitus which contains the source language. Additionally, the presence of the Middle High German language - (<lang>mhd.</lang>) would enable researchers to infer the - process of inheritance into Modern German.

+ (<lang>mhd.</lang>) would enable researchers to infer + the process of inheritance into Modern German.

In any given project where terminology is consistent,In the case of data sets (original or legacy) that do not use consistent terminology, variation in the terminology should be normalized to allow for maximally systematic search @@ -647,7 +650,7 @@ Use of lbl, date, bibl, and - seg type="desc" with discontinuous prose. (Source: seg type="desc" with discontinuous prose. (Source: Kluge 1975) @@ -659,10 +662,11 @@ often represented by a form and which may include other information typical of any lexical entry: for example, a language name, grammatical properties, usage descriptions, semantic descriptions, or bibliographic sources. Etymons are encoded in - cit type="etymon" and are used analogously to the organization of - entry both conceptually and structurally as shown by the side-by-side - comparison of Entry structure and contents () and basic etymon ().

+ cit type="etymon" and are used analogously to the + organization of entry both conceptually and structurally as shown by the + side-by-side comparison of Entry structure and contents () and basic etymon ().

@@ -731,7 +735,7 @@ description with no form. This is possible because in cases of polysemy, the form of the new meaning/lexical item remains the same as the headword of the entry. In the encoding in , the corresp - attribute added to the cit type="etymon" points to the + attribute added to the cit type="etymon" points to the xml:id value of the source sense. shows the etymon with only sense change; shows the entry to which the etymon is @@ -756,7 +760,7 @@ LM The front of (sth). - nuu ve'e + nuu ve'e the front of the house @@ -916,9 +920,10 @@ source. The structure of a basic representation of cognates mirrors that of etymons and uses the same cit structure, with the difference that the value of type should be cognate. Note that in , ref type="bibl" is used - instead of bibl because in the project from which the examples are taken - all bibliographical sources are listed in the header with xml:ids.

+ target="#example12" type="crossref"/>, ref type="bibl" is + used instead of bibl because in the project from which the examples are + taken all bibliographical sources are listed in the header with + xml:ids.

@@ -959,14 +964,14 @@ source etymology as a set or list. Often these would have some kind of referential function word or abbreviation: for example, cf. … In such cases it may be desirable to present the list of cognates as the source intended and thus group - them in a single wrapper cit type="cognateSet".

+ them in a single wrapper cit type="cognateSet".

In the case of a cognate set where there is a referential function word or abbreviation, it should be tagged with lbl and included as a child of - cit type="cognateSet", preceding the etymons (see cit type="cognateSet", preceding the etymons (see ). Where there is a common bibliographic - source, bibl or ref type="bibl" (if the bibliographic sources - are already declared elsewhere) should be a child of the cit - type="cognateSet" and placed after the given forms (see bibl or ref type="bibl" (if the + bibliographic sources are already declared elsewhere) should be a child of the + cit type="cognateSet" and placed after the given forms (see ).

@@ -1029,13 +1034,13 @@ ... Derivatives:
amārilūdō
- 'bitterness' + 'bitterness' (Varro+)
,
amāror
[m.] - ‘bitter taste' + ‘bitter taste' (Lucr.+)
.
@@ -1080,8 +1085,8 @@

In this first case, the entry is for the Latin arcessō, -ere / accersō, -ere; within the etymology section of that entry there are references to the two lemma variants. The cross-references are encoded in xr - type="crossReference" xml:lang="la" - ref type="entry". This format would also apply to external + type="crossReference" xml:lang="la" + ref type="entry". This format would also apply to external cross-references. In either case, the editors would also have the option of including a pointer to the given internal or external form(s) with the target attribute on ref.

@@ -1105,7 +1110,8 @@ next can be used to denote temporal sequences of referenced forms.). In such examples as the following, the use of xr encodes an important conceptual distinction in the data as it allows cit - type="etymon" to be reserved for the form(s) in the given entry.

+ type="etymon" to be reserved for the form(s) in the given + entry.

….Nussbaum 2007b gives two more arguments for regarding accersoas original: the noun dorsum → dossum @@ -1125,13 +1131,13 @@

Finally, we have an example in which there is a cross-reference to a sense of an external entry (). This is also encoded - as xr type="crossReference" but differs in that it is a reference to - the sense of a given entry; thus we use ref @type="sense" and the - embedded gloss within, which needs to have the xml:lang to - distinguish from the value declared at the xr level. In this case there - is also a form (included here in a separate ref type="sense"); however, - it is also possible that a cross-reference to a sense could occur without an - accompanying form.

+ as xr type="crossReference" but differs in that it is a + reference to the sense of a given entry; thus we use ref @type="sense" + and the embedded gloss within, which needs to have the + xml:lang to distinguish from the value declared at the xr + level. In this case there is also a form (included here in a separate ref + type="sense"); however, it is also possible that a + cross-reference to a sense could occur without an accompanying form.

... a verb in -cesso @@ -1142,8 +1148,8 @@ ….a verb in -cessō meaning - 'go get' would be - favoured by its semantic neighbours... + 'go get' would + be favoured by its semantic neighbours... Cross-reference to a sense from an external entry.

@@ -1173,7 +1179,8 @@

In the following examples, we have tried to illustrate interesting cases of etymological processes that show how TEI Lex-0 Etym can seamlessly take into account a variety of situations. All examples have been validated and included in the TEI - Lex-0 GitHub environment.

+ Lex-0 GitHub environment.

Embedded Senses, Metaphor, and Compounding @@ -1181,10 +1188,10 @@ the Mixtepec-Mixtec TEI dictionary (Bowers 2020) in which the lemma form xini ve’e is a compound with one component that is metaphorical in nature. The portion of the - etymology that is metaphorical (etym type="metaphor") is embedded - within the etym type="compounding", and as it is relevant to the - process of metaphor, within the etym type="metaphor" is the domain - (usg type="domain").

+ etymology that is metaphorical (etym type="metaphor") is + embedded within the etym type="compounding", and as it is + relevant to the process of metaphor, within the etym type="metaphor" is + the domain (usg type="domain").

@@ -1213,7 +1220,7 @@
- ve'e + ve'e
house casa @@ -1368,18 +1375,20 @@ >Sabellic
, since Latin does not have this word in its lexicon. For a word only occurring in glosses, this is of course possible. - Others have proposed - an etymology + + Others have proposed an + etymology
*ad-arti-
+ >
*ad-arti-
+ with intervocalic
d
becoming + xml:id="c1" next="#c2">
d
+ becoming
l
; -
the spelling + the spelling allers would then be analogical to sollers.