diff --git a/.github/workflows/schema.yml b/.github/workflows/schema.yml index 627a11fc..a911337f 100644 --- a/.github/workflows/schema.yml +++ b/.github/workflows/schema.yml @@ -29,6 +29,8 @@ jobs: git config --local user.name "GitHub Action" git add schema/tei_software_annotation.xml git add schema/tei_software_annotation.rng + git add schema/tei_jtei_annotated.odd + git add schema/tei_jtei_annotated.rng git commit -m "Add updated odd and generated rng" - name: Push changes uses: ad-m/github-push-action@master diff --git a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml index 9ec8142f..c62b4c2b 100644 --- a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml @@ -211,9 +211,10 @@ target="http://www.deutschestextarchiv.de/doku/software#cab"/>. as well as collaborative text correction and annotationSee DTAQ: Kollaborative Qualitätssicherung im Deutschen Textarchiv - (Collaborative Quality Assurance within the DTA), accessed January 28, 2017, . On the process of + level="a">DTAQ: Kollaborative Qualitätssicherung im Deutschen Textarchiv + (Collaborative Quality Assurance within the DTA), accessed January 28, 2017, . On the process of quality assurance in the DTA, see, for example, Haaf, Wiegand, and Geyken 2013.) is a matter of supporting scholarly projects in their usage of the DTA infrastructure, which is part of the DTA’s mission. Second, @@ -273,7 +274,8 @@ Since June 2014, nine complete volumes with a total of more than 3,500 manuscript pages have been manually transcribed, annotated in TEI XML, and published via the DTA infrastructure. Most of these manuscripts were keyed manually by a vendor and published at - an early stage in the web-based quality assurance platform DTAQ. There, the transcription + an early stage in the web-based quality assurance platform DTAQ. There, the transcription as well as the annotation of each document was checked and corrected, if necessary; DTAQ also provided the means to add additional markup, such as the tagging of person names (persName), directly at page level. After the process of quality control has @@ -1210,7 +1212,7 @@ corpora. Our primary goal is to be as inclusive as possible, allowing for other projects to benefit from our resources (i.e., our comprehensive guidelines and documentation as well as the technical infrastructure that includes Schemas, ODDs, and XSLT scripts) and + xml:id="R1" target="#xslt"/>XSLT scripts) and contribute to our corpora. We also want to ensure interoperability of all data within the DTA corpora. The underlying TEI format has to be continuously maintained and adapted to new necessities with these two premises in mind.

diff --git a/data/JTEI/10_2016-19/jtei-10-romary-source.xml b/data/JTEI/10_2016-19/jtei-10-romary-source.xml index 3b75d2b6..bfb0c21e 100644 --- a/data/JTEI/10_2016-19/jtei-10-romary-source.xml +++ b/data/JTEI/10_2016-19/jtei-10-romary-source.xml @@ -645,8 +645,8 @@ available at . In our proposal, the etym element has to be made recursive in order to allow the fine-grained representations we propose here. The corresponding ODD customization, together with - reference examples, is available on GitHub. and the + reference examples, is available on GitHub. and the fact that a change occurred within the contemporary lexicon (as opposed to its parent language) is indicated by means of xml:lang on the source form.There may also be cases in which it is unknown whether a given etymological process occurred @@ -768,8 +768,8 @@ text.The interested reader may ponder here the possibility to also encode scripts by means of the notation attribute instead of using a cluttering of language subtags on xml:lang. For more on this issue, see the proposal in - the TEI GitHub (GitHub (). This is why we have extended the notation attribute to orth in order to allow for better representation of both language identification and the orthographic content. With this double mechanism, we intend to @@ -987,7 +987,7 @@

The dateThe element date as a child of cit is another example which does not adhere to the current TEI standards. We have allowed this within our ODD document. A feature request proposal will be made on the GitHub page and this feature may or may not appear in future versions of the TEI Guidelines. element is listed within each etymon block; the values of attributes notBefore and notAfter specify the range of time @@ -1486,8 +1486,10 @@ extent of knowledge that is truly necessary to create an accurate model of metaphorical processes. In order to do this, it is necessary to make use of one or more ontologies, which could be locally defined within a project, and of external linked open data sources - such as DBpedia and Wikidata, or some combination thereof. Within + such as DBpedia and Wikidata, or some combination thereof. Within TEI dictionary markup, URIs for existing ontological entries can be referenced in the sense, usg, and ref elements as the value of the attribute corresp.

@@ -1496,7 +1498,8 @@ reference to the source entry’s unique identifier (if such an entry exists within the dataset). In such cases, the etymon pointing to the source entry can be assumed to inherit the source’s domain and sense information, and this information can be automatically - extracted with a fairly simple XSLT program; thus the encoders may choose to leave some or + extracted with a fairly simple XSLT program; thus the encoders may choose to leave some or all of this information out of the etymon section. However, in the case that the dataset does not actually have entries for the source terms, or the encoder wants to be explicit in all aspects of the etymology, as mentioned above, the source domain and the @@ -1556,7 +1559,8 @@ type="metonymy") and the etymon (cit type="etymon") the source term’s URI is referenced in oRef and pRef as the value of corresp (@corresp="#animal").

-

In sense, the URI corresponding to the DBpedia entry for horse is the +

In sense, the URI corresponding to the DBpedia entry for horse is the value for the attribute corresp. Additionally, the date notBefore="…" element–attribute pairing is used to specify that the term has only been used for the horse since 1517 at maximum (corresponding to the first Spanish @@ -2485,8 +2489,8 @@ Problematic and Unresolved Issues

For the issues regarded as the most fundamentally important to creating a dynamic and sustainable model for both etymology and general lexicographic markup in TEI, we have - submitted formal requests for changes to the TEI GitHub, and will continue to + submitted formal requests for changes to the TEI GitHub, and will continue to submit change requests as needed. While this work represents a large step in the right direction for those looking for means of representing etymological information, there are still a number of unresolved issues that will need to be addressed. These remaining issues diff --git a/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml b/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml index 0b87d585..8e6a2bdf 100644 --- a/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml +++ b/data/JTEI/11_2019-20/jtei-cc-ra-bermudez-sabel-137-source.xml @@ -110,10 +110,11 @@ ways in which the variant taxonomy may be linked to the body of the edition.

Although this paper is TEI-centered, other XML technologies will be mentioned. includes a brief commentary on using XSLT + type="software" xml:id="R1" target="#xslt"/>XSLT to transform a TEI-conformant definition of constraints into schema rules. However, the greatest attention to an additional technology is in , which discusses the use of XQuery to retrieve particular + target="#analyses"/>, which discusses the use of XQuery to retrieve particular loci critici and to deploy quantitative analyses.

@@ -211,13 +212,14 @@ neutralized.This statement is especially significant when dealing with corpora that have been compiled over a long period of time. As is clearly explained in the introduction to the Helsinki Corpus that Irma Taavitsainen and Päivi Pahta prepared for - the Corpus Resource Database (CoRD) (Placing the Helsinki Corpus Middle English Section Introduction into - Context, ): The idea of basing corpus texts directly on + the Corpus Resource Database (CoRD) (Placing the Helsinki Corpus Middle English Section Introduction into + Context, ): The idea of basing corpus texts directly on manuscript sources has been presented more recently The principles of preparing manuscript texts for print have undergone changes during the history of editing.

@@ -445,11 +447,12 @@ definition, its typed-feature modeling facilitates the creation of schema constraints. For instance, I process my declaration to further constrict my schema so the feature structure declaration and its actual application are always synchronized and up to date.I use - XSLT to process the feature structure declaration in order to create all required Schematron rules that will constrict the feature library accordingly. I am currently working on creating a more generic validator (see my Github repository, Github repository, ).
@@ -541,16 +544,16 @@ >parallel segmentation method (TEI Consortium 2016, 12.2.3) seems to be a popular encoding technique for multi-witness editions, in terms of both the specific tools that have been created for this method and the number - of projects that apply it.Tools include Versioning Machine, CollateX (both - the Java and Python versions), and Juxta. For + of projects that apply it.Tools include Versioning Machine, CollateX (both + the Java and Python versions), and Juxta. For representative projects using the parallel segmentation method see Satire in Circulation: James editions Russell Lowell’s Letter from a volunteer in diff --git a/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml b/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml index 309ab6cf..6fa4f312 100644 --- a/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml +++ b/data/JTEI/11_2019-20/jtei-cc-ra-hannessschlaeger-164-source.xml @@ -119,9 +119,10 @@ the gaps between them. Finally, I will illustrate my findings with the help of a concrete example, which is as TEI-specific as it can get: I will describe the history of the TEI Conference 2016 abstracts, which have, since the conference, been transformed into a TEI - data set that has been published not only on GitHub, but also in an - eXistdb-powered web application. This is by any standard a wonderful development for a + data set that has been published not only on GitHub, but also in an + eXistdb-powered web application. This is by any standard a wonderful development for a collection of textual data—and one that would not have been possible had the abstracts not been published under an open license, especially since their authors come from fourteen different countries.

@@ -444,16 +445,19 @@ explain how licensing played a vital role in enabling this transformation.

Submission of the Abstracts and the joys of ConfTool -

As it is every year, the conference management software ConfTool ProConfTool Conference - Management Software, ConfTool GmbH, accessed August 23, 2019, . was used for the submission of the +

As it is every year, the conference management software ConfTool ProConfTool Conference + Management Software, ConfTool GmbH, accessed August 23, 2019, . was used for the submission of the abstracts of the 2016 TEI conference. When the Vienna team received access to the ConfTool system, the instance for the 2016 conference had been equipped with default - settings based on previous TEI conference settings. As ConfTool is not the most + settings based on previous TEI conference settings. As ConfTool is not the most intuitive system to handle for a first-time administrator,The chair of the 2017 TEI conference program committee Kathryn Tomasek has described the rather tricky - structure of the system as the joys of ConfTool (email message to author, April + structure of the system as the joys of ConfTool (email message to author, April 11, 2017). one aspect was overlooked when setting up the system for the 2016 conference: the Copyright Transfer Terms and Licensing Policy that contributors had to agree to when submitting an abstract remained unchanged. It was @@ -527,23 +531,26 @@ Hannesschläger, and Wissik 2016a). Subsequently, the PDF of this printed book was made available via the conference website under the same license (Resch, Hannesschläger, and Wissik 2016b).

-

The page proofs that were transformed into this PDF had been created with Adobe - InDesign. The real fun started when the InDesign file was exported to XML and +

The page proofs that were transformed into this PDF had been created with Adobe + InDesign. The real fun started when the InDesign file was exported to XML and transformed back into single files (one file per abstract). These files were edited with - the Oxygen XML editor to become proper TEI files with extensive headers. Finally, they + the Oxygen XML editor to become proper TEI files with extensive headers. Finally, they were published as a repository together with the TEI schema on GitHub (GitHub (Hannesschläger and Schopper 2017), again under the same license. This allowed Martin Sievers, one of the abstract authors, to immediately correct a typing error in his abstract that the editors had overlooked (see history of Hannesschläger and Schopper - 2017 on GitHub).

+ 2017 on GitHub).

But the story did not end there. The freely available and processable collection of abstracts inspired Peter Andorfer, a colleague of the editors at the Austrian Centre for Digital Humanities, to use this text collection to built an eXistdb-powered web - application (Andorfer and Hannesschläger - 2017). In the context of licensing issues, it is important to mention that + application (Andorfer and Hannesschläger + 2017). In the context of licensing issues, it is important to mention that Andorfer was never approached by the editors or explicitly asked to process the TEI files, and he only informed the editors about the web application that he was building when it was already available online (as a work in progress, but diff --git a/data/JTEI/12_2019-20/jtei-cc-ra-bauman-170-source.xml b/data/JTEI/12_2019-20/jtei-cc-ra-bauman-170-source.xml index 70ce74d2..2d92b996 100644 --- a/data/JTEI/12_2019-20/jtei-cc-ra-bauman-170-source.xml +++ b/data/JTEI/12_2019-20/jtei-cc-ra-bauman-170-source.xml @@ -375,8 +375,8 @@ - XSLT template that converts an + XSLT template that converts an lb into a space.

@@ -439,7 +439,8 @@ level="m">A Romance of Many Dimensions) separately and to differentiate it from the main title. She plans to use TEI P5, so her first thought is Does the TEI title element have a type - attribute? So she switches into oXygen, creates a new + attribute? So she switches into oXygen, creates a new tei_all document, puts her cursor immediately before the closing angle bracket of the title start-tag

The TAGC in SGML nomenclature.

and types a space. The result ( Choosing the type attribute from a - drop-down in oXygen + drop-down in oXygen In this fictional—but completely believable—example the encoder - has used the schema (tei_all) through a tool (oXygen) as a + has used the schema (tei_all) through a tool (oXygen) as a way of finding out about the markup language (TEI). Yes, she could just as well have read the demonstrates oXygen helping an encoder. Here oXygen is + type="crossref"/> demonstrates oXygen helping an encoder. Here oXygen is not just answering a common question: is the TEI element for a notes statement notesStmt or noteStmt? It is also requiring that the user enter only one of three elements allowed by the schema (or a comment, etc.).
- Inserting an element in oXygen + Inserting an element in oXygen
By using an editor that understands the schema, an encoder can avoid making common mistakes (like misspelling an element name) at the time of the original encoding. This is extremely helpful in constraining one’s @@ -792,8 +797,8 @@ would ignore it, would not cause a problem if it were left.

Either way, in order to avoid this potential maintenance nightmare, tei_customization.odd is not a static file, but rather is - generated by running an XSLT program that reads as its input the + generated by running an XSLT program that reads as its input the source to TEI P5

Remember, the TEI Guidelines are written in TEI. The source to all of P5 is a single TEI document, although for convenience it is split into well over 850 separate files.

and writes @@ -886,12 +891,12 @@
How to Get it and Use it -

The XSLT program used to generate +

The XSLT program used to generate tei_customization.odd can be found in the TEI GitHub repository. It is currently called + type="software" xml:id="R4" target="#github"/>GitHub repository. It is currently called TEI-to-tei_customization.xslt. The generated tei_customization ODD file and the schemas generated from it can be found in each release of the TEI from 3.3.0 on.

For example, the @@ -902,10 +907,13 @@ target="http://www.tei-c.org/Vault/P5/3.3.0/xml/tei/custom/schema/relaxng/tei_customization.rnc" />.

Furthermore, the current version of tei_customization is available - from within oXygen as part of the TEI oXygen framework. However, the RELAX NG + from within oXygen as part of the TEI oXygen framework. However, the RELAX NG schema (tei_customization.rng or tei_customization.rnc) has the behavior discussed in . While this is not a bug or broken - in any way, it is likely to be confusing and problematic for most users of oXygen. + in any way, it is likely to be confusing and problematic for most users of oXygen. The TEI_Council is interested in finding a way around this difficulty.

@@ -1144,7 +1152,8 @@ customization that she has to do this. The consequences of failing to turn it off are severe, though: although completion pop-up boxes still work, validation (both automatic validation as you type and static validation, for example ⌘-⇧-V) completely - stops working. Furthermore, oXygen leaves this feature on by default for a reason. + stops working. Furthermore, oXygen leaves this feature on by default for a reason. Even though ID/IDREF checking itself is of almost no use to a user working with TEI P5 documents,

Because P5 does not use the ID/IDREF mechanism, the only one of the three added constraints that is useful is (2), that the value of @@ -1157,7 +1166,8 @@ prev). And that is a very well-loved feature of oXygen. There are several possible solutions to this problem, each of which has its drawbacks. The TEI Council will hopefully implement one of them soon, making use of - tei_customization from the oXygen framework much less + tei_customization from the oXygen framework much less problematic.

diff --git a/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml b/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml index c55001f2..0e723e25 100644 --- a/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml +++ b/data/JTEI/13_2020-22/jtei-cc-ra-parisse-182-source.xml @@ -95,19 +95,36 @@ collect and transcribe spoken language resources, their number is limited and thus corpora need to be interoperable and reusable in order to improve research on themes such as phonology, prosody, interaction, syntax, and textometry. To help researchers reach this - goal, CORLI has designed a pair of tools: TEICORPO to assist in the conversion and use of - spoken language corpora, and TEIMETA for metadata purposes. TEICORPO is based on the - principle of an underlying common format, namely TEI XML as described in its specification - for spoken language use (ISO 2016). This tool enables the conversion of transcriptions - created with alignment software such as CLAN, Transcriber, Praat, or ELAN as well as - common file formats (CSV, XLSX, TXT, or DOCX) and the TEI format, which plays the role of - a lossless pivot format. Backward conversion is possible in many cases, with limitations - inherent in the destination target format. TEICORPO can run the Treetagger part-of-speech - tagger and the Stanford CoreNLP tools on TEI files and can export the resulting files to - textometric tools such as TXM, Le Trameur, or Iramuteq, making it suitable for - spoken language corpora editing as well as for various research purposes.

+ goal, CORLI has designed a pair of tools: + TEICORPO to assist in the conversion and use of spoken + language corpora, and + TEIMETA for metadata purposes. + TEICORPO is based on the principle of an underlying + common format, namely TEI XML as described in its specification for spoken language use + (ISO 2016). This tool enables the conversion of transcriptions created with alignment + software such as + CLAN, + Transcriber, + Praat, or ELAN as well as common file formats + (CSV, XLSX, TXT, or DOCX) and the TEI format, which plays the role of a lossless pivot + format. Backward conversion is possible in many cases, with limitations inherent in the + destination target format. + TEICORPO can run the + Treetagger part-of-speech tagger and the + Stanford CoreNLP tools on TEI files and can export + the resulting files to textometric tools such as TXM, Le Trameur, or + Iramuteq, making it suitable for spoken language corpora editing as well as for + various research purposes.

@@ -173,7 +190,8 @@ limited coverage, even if the corpora involved are very large.

- The TEICORPO Approach + The + TEICORPO Approach

The goal of the CORLI consortium is to make it easier to deposit, share, and reuse data. With this goal in mind, CORLI has always promoted the use of open public repositories and open formats. Our policy is to advocate for the use of a common single @@ -213,14 +231,18 @@

Similarities with and Differences from Other Approaches

Many software packages dedicated to editing spoken language transcription contain - utilities that can convert many formats: for example, EXMARaLDA (Schmidt 2004; see ), Anvil (Kipp - 2001; see ), and ELAN (Wittenburg et al. 2006; - see ). However, in all cases, the + utilities that can convert many formats: for example, EXMARaLDA (Schmidt 2004 + ; see ), + + Anvil ( + Kipp 2001; see ), and ELAN + (Wittenburg + et al. 2006; see + ). However, in all cases, the conversions are limited to the features implemented in the tool itself—for example, with a limited set of metadata—and they cannot always be used to prepare data to be used by another tool.

@@ -233,108 +255,148 @@ TEI is used as a destination format.

The list of tools that are considered in the two projects is nearly the same. The only - tools missing in the TEICORPO approach are EXMARaLDA and FOLKER - (Schmidt and Schütte 2010; see ), but this was only because the - conversion tools from and to EXMARaLDA, FOLKER, and TEI already exist. - They are available as XSLT stylesheets in the open-source distribution of - EXMARaLDA (). The other common point is the - use of the TEI format, and especially the more recent ISO version of TEI for spoken - language (ISO/TEI; see ISO 2016). The TEI - format produced by the EXMARaLDA and FOLKER software fit within the - process chain of TEICORPO. This demonstrates the usefulness of a well-known and + tools missing in the + TEICORPO approach are EXMARaLDA and + FOLKER (Schmidt and Schütte + 2010; see ), but this was only because the + conversion tools from and to EXMARaLDA, + FOLKER, and TEI already exist. They are available + as XSLT stylesheets in the open-source distribution of EXMARaLDA (). The other common point is the use of the TEI format, and especially the more + recent ISO version of TEI for spoken language (ISO/TEI; see ISO 2016). The TEI format produced by the EXMARaLDA and + + FOLKER software fit within the process chain of + TEICORPO. This demonstrates the usefulness of a well-known and efficient format such as TEI.

There are, however, differences between the two projects that make them nonredundant but complementary, each project having specificities that can be useful or damaging - depending on the user’s needs. One minor difference is that the TEICORPO project is not - a functionality of an editing tool, but is a standalone tool for converting data between - one format and another. This had certain effects on the user interface and explains some - of the choices made in the development of the two tools.

-

There are two major differences between TEICORPO and Schmidt’s approach, which affected + depending on the user’s needs. One minor difference is that the + TEICORPO project is not a functionality of an + editing tool, but is a standalone tool for converting data between one format and + another. This had certain effects on the user interface and explains some of the choices + made in the development of the two tools.

+

There are two major differences between + TEICORPO and Schmidt’s approach, which affected both the design of the tools and how they can be used. The first difference is that in - developing TEICORPO, it was decided that the conversion between the original formats and - TEI had to be lossless (or as lossless as possible) because we wanted to offer a means - to store the research data for long-term conservation and dissemination in a standard - XML format instead of in proprietary formats such as those used by CLAN (MacWhinney 2000; see ), ELAN, Praat (Boersma and van Heuven 2001; see ), and Transcriber (Barras et al. 2000; see and ). These proprietary - formats are in XML or Unicode formats so that they can be conserved for the long term. - However, they are not all well described or constrained, at least not in the same way as - TEI—which, moreover, offers a semantically relevant structure as well as an official - format for long-term conservation in France. Moreover, as the durability of these four - pieces of software cannot be guaranteed in the long term, it does not seem safe to keep - corpora in a format available only for a given tool that may disappear or fall into - disuse.

-

The second major difference is that the TEICORPO initiative does not target only spoken + developing TEICORPO, it was decided that the conversion between the original + formats and TEI had to be lossless (or as lossless as possible) because we wanted to + offer a means to store the research data for long-term conservation and dissemination in + a standard XML format instead of in proprietary formats such as those used by + CLAN (MacWhinney 2000; see ), ELAN, + Praat (Boersma and van Heuven 2001; see ), and + Transcriber (Barras et al. 2000; see and + ). These + proprietary formats are in XML or Unicode formats so that they can be conserved for the + long term. However, they are not all well described or constrained, at least not in the + same way as TEI—which, moreover, offers a semantically relevant structure as well as an + official format for long-term conservation in France. Moreover, as the durability of + these four pieces of software cannot be guaranteed in the long term, it does not seem + safe to keep corpora in a format available only for a given tool that may disappear or + fall into disuse.

+

The second major difference is that the + TEICORPO initiative does not target only spoken language, but all types of annotation, including media of any type. This covers all spoken languages, vocal as well as sign languages, and also gesture and any type of - multimodal coding. The goal of TEICORPO was not to advocate a linguistic mode of coding - spoken data as a transcription convention does, but rather to propose a research model - for storing and sharing data about language and other modalities. Consequently, the - focus of the work was not on how the spoken data were coded (i.e., the microstructure), - nor on the standard that should be used for transcribing in orthographic format. - Instead, the TEICORPO approach focused on how to integrate multiple pieces of - information into the TEI semantics (the macrostructure), as this is possible with tools - such as ELAN or PRAAT. The goal was to be able to convert a file produced by - these tools so that it can be saved in TEI format for long-term conservation.

-

Data in PRAAT and ELAN formats can contain information that is - different from what is usually present in an ISO/TEI description, but that nonetheless - remains within the structures authorized in the ISO/TEI. For example, the information is - stored as described below in spanGrp, an element available in the ISO/TEI - description. This means that whenever information is organized according to the - classical approach to spoken language (by + multimodal coding. The goal of + TEICORPO was not to advocate a linguistic mode of + coding spoken data as a transcription convention does, but rather to propose a research + model for storing and sharing data about language and other modalities. Consequently, + the focus of the work was not on how the spoken data were coded (i.e., the + microstructure), nor on the standard that should be used for transcribing in + orthographic format. Instead, the + TEICORPO approach focused on how to integrate + multiple pieces of information into the TEI semantics (the macrostructure), as this is + possible with tools such as ELAN or + PRAAT. The goal was to be able to convert a file + produced by these tools so that it can be saved in TEI format for long-term + conservation.

+

Data in + PRAAT and ELAN formats can contain + information that is different from what is usually present in an ISO/TEI description, + but that nonetheless remains within the structures authorized in the ISO/TEI. For + example, the information is stored as described below in spanGrp, an element + available in the ISO/TEI description. This means that whenever information is organized + according to the classical approach to spoken language (by classical, we mean approaches based on an orthographic transcription represented as a list, as in the script of a play), it will be available - for further processing by using the export features of TEICORPO (see and further below for export functionalities) - but other types of information are also available. Compared to PRAAT and ELAN, the integration of tools such as CLAN or Transcriber was much more - straightforward, as the organization of the files is less varied and more - classical.

+ for further processing by using the export features of + TEICORPO (see and further below for export functionalities) but other types of + information are also available. Compared to + PRAAT and ELAN, the integration of tools + such as + CLAN or + Transcriber was much more straightforward, as the + organization of the files is less varied and more classical.

Choice of the Microstructure Representation

Processing of the microstructure, with the exception of information already available - in the tools themselves (for example silence in Transcriber), is not done during the - conversion to TEI. The division into words or other elements such as morphemes or - phonemes is not systematically done in any of the tools used by researchers in CORLI. - When it exists, it is not included in the main transcription line but most often in - dependent lines, as it represents an annotation with its own rules and guidelines. - Division into words or other elements is part of the linguistic analysis rather than a - simple storage operation.

-

TEICORPO therefore preserves as long-term storage data both the original information - that was created in the original software—the full unprocessed transcription—and the - other linguistically processed transcriptions and annotations. For TEICORPO, - microstructure processing, such as division into words, or text standardization when - necessary, belongs to the linguistic analysis of the corpora. Hence, the TEI data file - can be used both for data exploration and for scientific purposes. For example, when a - researcher needs to parse the data, or to explore the data with textometric tools, - then it is necessary to decide which type of preprocessing is necessary. As this - decision often depends on the initial project as well as on linguistic choices, it is - difficult to standardize this task.

+ in the tools themselves (for example silence in + Transcriber), is not done during the conversion + to TEI. The division into words or other elements such as morphemes or phonemes is not + systematically done in any of the tools used by researchers in CORLI. When it exists, + it is not included in the main transcription line but most often in dependent lines, + as it represents an annotation with its own rules and guidelines. Division into words + or other elements is part of the linguistic analysis rather than a simple storage + operation.

+

+ TEICORPO therefore preserves as long-term storage + data both the original information that was created in the original software—the full + unprocessed transcription—and the other linguistically processed transcriptions and + annotations. For + TEICORPO, microstructure processing, such as + division into words, or text standardization when necessary, belongs to the linguistic + analysis of the corpora. Hence, the TEI data file can be used both for data + exploration and for scientific purposes. For example, when a researcher needs to parse + the data, or to explore the data with textometric tools, then it is necessary to + decide which type of preprocessing is necessary. As this decision often depends on the + initial project as well as on linguistic choices, it is difficult to standardize this + task.

- The TEICORPO Project -

The TEICORPO project contains two different sets of tools. One set focuses on conversion - between various software packages used for spoken language coding and TEI. The other set - focuses on using the TEI format for linguistic analyses (textometric or grammatical - analyses).

+ The + TEICORPO Project +

The + TEICORPO project contains two different sets of + tools. One set focuses on conversion between various software packages used for spoken + language coding and TEI. The other set focuses on using the TEI format for linguistic + analyses (textometric or grammatical analyses).

Alignment Tools @@ -346,31 +408,37 @@

Some common practices have been identified in our community but other uses of the same software are of course possible:

- Transcriber is widely used in sociolinguistics; - CLAN is widely used in language acquisition and especially in the Talkbank - project; + + Transcriber is widely used in + sociolinguistics; + + CLAN is widely used in language acquisition and + especially in the Talkbank project; Praat is more specialized for phonetic or phonological annotations; - ELAN is recommended for annotating video and particularly - multimodality (for example, components such as gazes, gestures, and movements), and is - often used for rare languages to describe the organization of the segments. + ELAN is recommended for annotating video and particularly multimodality (for + example, components such as gazes, gestures, and movements), and is often used for + rare languages to describe the organization of the segments.

It should be pointed out here that whereas Transcriber and CLAN files nearly always contain classical orthographic transcriptions, this is not the case - for Praat and ELAN files. As our goal is to provide a generic solution for - long-term conservation and use for any type of project, conversion of all types of files - produced by the four tools cited above will be possible. It is up to the user to - determine which part of a corpus can be used with a classical approach, which parts - should not, and how they should be processed.

+ for Praat and ELAN files. As our goal is to provide a generic solution for long-term + conservation and use for any type of project, conversion of all types of files produced + by the four tools cited above will be possible. It is up to the user to determine which + part of a corpus can be used with a classical approach, which parts should not, and how + they should be processed.

The list of tools reflects the uses and practices in the CORLI network, and is very similar to the list suggested by Schmidt (2011) with the exception of EXMARaLDA and FOLKER. - These two tools already have built-in conversion features, so adding them to the - TEICORPO project would be easy at a later date.

+ >2011) with the exception of EXMARaLDA and + FOLKER. These two tools already have built-in + conversion features, so adding them to the + TEICORPO project would be easy at a later date.

Alignment applications deal with two main types of data presentation and organization. The presentation of the data has direct consequences for how the data are exploited, and therefore on the design of the tools that are used.

@@ -391,35 +459,42 @@ chronologically but is sorted by the names of the tiers (or any other order), with all the production within the same tier sorted by timeline. -

No tool offers both types of presentation. ELAN offers some alternatives to +

No tool offers both types of presentation. ELAN offers some alternatives to editing or displaying data with the partition format, but none of the existing tools offer full-fledged list format editing. It is possible to represent the two structures within a similar model, as demonstrated by Bird and Liberman (2001). However, this is not the case for the four tools listed above: each of them represents the data in a unique underlying data structure. Transcriber and CLAN are organized in list format; Praat and ELAN have a + xml:id="R59" target="#elan"/>ELAN have a partition format.

Each presentation format has its own pros and cons. Because of the possibilities offered by the presentation formats, and because the same software, even within the same presentation models, rarely provides a solution for all the needs of all users, researchers often have to use two or more pieces of software.

-

The use of multiple tools is quite common. For example, Praat and Transcriber cannot be - used when working on video recordings because these programs are limited to audio - formats. But if researchers need to conduct spectral analysis for some purpose, they - will have to use the Praat software and convert not only the transcription, but also the - media. In the field of language acquisition, where the CLAN software is generally used - to describe both the child productions and the adult productions, when researchers are - interested in gestures, they use the ELAN software, importing the CLAN file to add - gesture tiers, as ELAN is more suitable for the fine-grained analysis - of visual data. Another common practice consists in first doing a rapid transcription - using only orthographic annotations in Transcriber and then in a second stage annotating - some more interesting excerpts in greater detail including new information. In this case +

The use of multiple tools is quite common. For example, + Praat and + Transcriber cannot be used when working on video + recordings because these programs are limited to audio formats. But if researchers need + to conduct spectral analysis for some purpose, they will have to use the + Praat software and convert not only the + transcription, but also the media. In the field of language acquisition, where the + CLAN software is generally used to describe both + the child productions and the adult productions, when researchers are interested in + gestures, they use the ELAN software, importing the CLAN file to add gesture + tiers, as ELAN is more suitable for the fine-grained analysis of visual data. + Another common practice consists in first doing a rapid transcription using only + orthographic annotations in Transcriber and then in a second stage annotating some more + interesting excerpts in greater detail including new information. In this case researchers will import the first transcription file into other tools such as Praat or - ELAN and annotate them partially. It is therefore necessary to import or export files in different formats if researchers need to use different tools for different parts of their work.

@@ -435,8 +510,9 @@ requirements of the conversion process options. For these reasons, we decided in the CORLI consortium and in collaboration with the ORTOLANG infrastructure to design a common tool that could be used by the whole linguistic community. The goal was to make - open-source software with proper maintenance freely available on .

+ open-source software with proper maintenance freely available on + .

Conversion to and from TEI @@ -451,22 +527,34 @@ metadata and all the macrostructure information into the TEI format.

Basic Structures -

Converting the metadata is straightforward, as the four tools (CLAN, ELAN, Praat, and Transcriber) do not enable a large amount of metadata to be - edited. Most of the metadata available concerns the content of the sequence; some user - metadata is also available, especially in CLAN. The insertion of metadata follows the +

Converting the metadata is straightforward, as the four tools ( + CLAN, ELAN, + Praat, and + Transcriber) do not enable a large amount of + metadata to be edited. Most of the metadata available concerns the content of the + sequence; some user metadata is also available, especially in + CLAN . The insertion of metadata follows the indications of the ISO/TEI 24624:2016 standard (ISO 2016).

Moreover, some tools, such as Transcriber, include information about silences, - pauses, and events in their XML format. This information is also processed within - TEICORPO, once again following the recommendations of the ISO/TEI standard.

+ pauses, and events in their XML format. This information is also processed within + TEICORPO, once again following the + recommendations of the ISO/TEI standard.

Conversion of the main data, the transcription and the annotations, cannot always be done solely on the basis of the description provided in the ISO/TEI guidelines. These - guidelines do, however, suffice to fully describe the content of the CLAN and - Transcriber software. We took advantage of the new annotationBlock element, - which codes several annotation levels, a function that is commonly required in - spoken-language annotations.

+ guidelines do, however, suffice to fully describe the content of the + CLAN and + Transcriber software. We took advantage of the + new annotationBlock element, which codes several annotation levels, a + function that is commonly required in spoken-language annotations.

The annotationBlock contains two major elements: the u element, which contains the transcription in orthographic form, and the spanGrp elements, which contain tier elements that annotate the utterance described in the @@ -474,25 +562,27 @@ elements as required. All span elements have the same type of content, as indicated in the parent spanGrp element. and provide an - example of conversion from a CLAN file to illustrate how a production annotated on - different levels (orthography, morphosyntax, dependencies) is represented in TEI with - a first main utterance element u to which two spanGrps are linked, - one for each annotation level, in our case one spanGrp for morphosyntax and - one spanGrp for dependencies (see ). A - timeline element gives the start (T1) and end (T2) - timecodes and an annotationBlock element specifies the speaker with the - who attribute and the start and end attributes with - the timecode anchors #T1 and #T2. The annotationBlock - element includes both the utterance element and the two annotations. No semantic - constraint is imposed on the inner content of the span elements. The content of the - type attribute in the spanGrp element represents and documents - the choice of the researchers who produced the original corpus. The content generated - is preserved as it was in the original file, making backward conversion possible. In - , the mor and - gra attribute values represent grammatical knowledge. Using the content - of these elements to produce advanced grammatical representation in more elaborate TEI - and XML formats is of course possible, but would be a tailored task which is beyond - the scope of the TEICORPO project.

+ example of conversion from a + CLAN file to illustrate how a production + annotated on different levels (orthography, morphosyntax, dependencies) is represented + in TEI with a first main utterance element u to which two spanGrps + are linked, one for each annotation level, in our case one spanGrp for + morphosyntax and one spanGrp for dependencies (see ). A timeline element gives the start (T1) and + end (T2) timecodes and an annotationBlock element specifies the + speaker with the who attribute and the start and end + attributes with the timecode anchors #T1 and #T2. The + annotationBlock element includes both the utterance element and the two + annotations. No semantic constraint is imposed on the inner content of the span + elements. The content of the type attribute in the spanGrp element + represents and documents the choice of the researchers who produced the original + corpus. The content generated is preserved as it was in the original file, making + backward conversion possible. In , the + mor and gra attribute values represent grammatical knowledge. + Using the content of these elements to produce advanced grammatical representation in + more elaborate TEI and XML formats is of course possible, but would be a tailored task + which is beyond the scope of the + TEICORPO project.

*MOT: look at the tree ! 2263675_2265197 %mor: v|look prep|at det|the n|tree ! @@ -529,19 +619,26 @@

Although the presentation described above can represent the data of many corpora and tools, a single-level annotation structure within the spanGrp elements is insufficient to represent the complex organization that can be constructed with the - ELAN and Praat tools. ELAN is a tool used by many researchers to - describe data of greater complexity than the data presented in the ISO/TEI guidelines. - As the goal of the TEICORPO project was to convert all types of structure used in the - spoken language community, including ELAN and Praat, it was necessary to extend - the description method presented in .

-

In ELAN and Praat, the multitiered annotations can be organized in a - structured manner. These tools take advantage of the partition presentation of the - data, so that the relationship between a parent tier and a child tier can be precisely - organized. There are two main types of organization: symbolic and temporal.

+ ELAN and + Praat tools. ELAN is a tool used by many + researchers to describe data of greater complexity than the data presented in the + ISO/TEI guidelines. As the goal of the + TEICORPO project was to convert all types of + structure used in the spoken language community, including ELAN and + Praat, it was necessary to extend the description + method presented in .

+

In ELAN and + Praat, the multitiered annotations can be + organized in a structured manner. These tools take advantage of the partition + presentation of the data, so that the relationship between a parent tier and a child + tier can be precisely organized. There are two main types of organization: symbolic + and temporal.

In symbolic division, the elements of a child tier, C1 to Cn, can be related to an element of a parent tier P. For example, a word is divided into morphemes. In , the main @@ -555,8 +652,8 @@ links.

- ELAN annotation with symbolic structures + ELAN annotation with symbolic structures

In temporal division, the association between the main tier and the dependent tiers @@ -588,34 +685,41 @@ usual spoken language corpus (such as those described in Schmidt 2011). However, as this type of data is produced by members of the CORLI consortium, it needs to be preserved. Encoding the data in TEI using a standard tool makes the process - reproducible, which is one of the goals of TEICORPO.

+ reproducible, which is one of the goals of + TEICORPO.

Although this type of data is not described in the ISO/TEI guidelines, it is in fact possible to store it in TEI format using current TEI features. TEI provides a general mechanism for storing hierarchically structured data by using the spanGrp and span mechanism. Moreover, the span and spanGrp tags have attributes that can point to other elements or to timelines. Using this coding schema, it is therefore possible to store any type of structure, symbolic and/or temporal, - that can be generated with ELAN or PRAAT, as described above.

+ that can be generated with ELAN or + PRAAT, as described above.

To do this, each element which is in a symbolic or temporal relation is represented by a spanGrp element of the TEI. The spanGrp element contains as many span elements as necessary to store all the elements present in the ELAN or PRAAT representation. The parent element of a spanGrp is the - main annotationBlock element when the division in ELAN or PRAAT is - the first division of a main element. The parent element is another span - element when the division in ELAN or PRAAT is a subdivision of another element - which is not a main element. This XML structure is complemented by explicit - information as allowed in TEI. The span elements are linked to the element - they depend on, either with a symbolic link using the target attribute of - the span element, or with temporal links using the from and - to attributes of the span element.

+ type="software" xml:id="R90" target="#elan"/>ELAN or + PRAAT representation. The parent element of a + spanGrp is the main annotationBlock element when the division in + ELAN or PRAAT is the first division of a main element. The parent element is + another span element when the division in ELAN or + PRAAT is a subdivision of another element which + is not a main element. This XML structure is complemented by explicit information as + allowed in TEI. The span elements are linked to the element they depend on, + either with a symbolic link using the target attribute of the span + element, or with temporal links using the from and to attributes + of the span element.

Two examples of how this is displayed in a TEI file are given below. The first example (see and ) corresponds to the ELAN example above (see ) corresponds to the ELAN example above (see , ). The TEI encoding represents the words of the sentence from left to right (from gahwat to endi in our example). The detail of the @@ -627,8 +731,8 @@ and -DET.

- ELAN example of a symbolic division + ELAN example of a symbolic division
@@ -673,21 +777,22 @@

The second example is structured using time references. This example (see and ) corresponds to the Praat example above (see , ). In this case, each part - of the transcription is represented according to the timeline, but there is also a - hierarchy which is represented by the spanGrp and span tags. Each - span is part of the parent spanGrp with starting and ending points - (which correspond to the from and to attributes in the example - below). The use of from + />) corresponds to the + Praat example above (see , ). In this case, each part of the transcription is represented according to the + timeline, but there is also a hierarchy which is represented by the spanGrp + and span tags. Each span is part of the parent spanGrp with + starting and ending points (which correspond to the from and to + attributes in the example below). The use of from to versus target is the only difference between the two organizations. In the example below, the syllable Sa is divided into two phonemes, S and a (see xml:id s s34, s36, and s37).

- ELAN example of a temporal division + ELAN example of a temporal division
@@ -715,17 +820,19 @@

The spanGrp and span offer a generic representation of data coming from relatively unconstrained representations produced by partition software. The - names of the tiers used in the ELAN and Praat tools are given in the content of - the type attribute. These names are not used to provide structural + names of the tiers used in the ELAN and + Praat tools are given in the content of the + type attribute. These names are not used to provide structural information, the structure being represented only by the spanGrp and span hierarchy. However, the organization into spanGrp and span is not always sufficient to represent all the details of the tier organization of each software feature. This is the case for some of the ELAN structures, which can specify the nature of span elements further than in the TEI feature. For example, the timediv - ELAN property specifies that only contiguous temporal division is allowed, whereas the incl property allows non-contiguous elements. It was therefore necessary to include the type of organization in the header of the TEI file, @@ -737,13 +844,19 @@

Exporting to Research Tools -

In the TEICORPO approach, no modification is made to the original format and conversion - remains as lossless as possible. This allows for all types of corpora to be stored for - long-term preservation purposes. It also allows the corpora to be used with other - editing tools, some of which are suited to specific processing: for example, Praat for - phonetics/phonology; Transcriber/CLAN for raw transcription; and ELAN for gesture - and visual coding.

+

In the + TEICORPO approach, no modification is made to the + original format and conversion remains as lossless as possible. This allows for all + types of corpora to be stored for long-term preservation purposes. It also allows the + corpora to be used with other editing tools, some of which are suited to specific + processing: for example, + Praat for phonetics/phonology; + Transcriber/ + CLAN for raw transcription; and ELAN for gesture and visual coding.

However, a large proportion of scientific research and applications done using corpora requires further processing of the data. For example, although querying or using raw language forms is possible, many research investigations and tools use words, parts of @@ -754,8 +867,9 @@ structure. This microstructure is integrated in Schmidt’s approach, in which the TEI file can contain standardized information about words, specific spoken language information, and sometimes even POS information.

-

This approach was not adopted in TEICORPO for several reasons. First, we had to deal - with a large variety of coding approaches, which makes it difficult to conduct work +

This approach was not adopted in + TEICORPO for several reasons. First, we had to + deal with a large variety of coding approaches, which makes it difficult to conduct work similar to that done in CHILDES (MacWhinney 2000; see ). Second, there was no consensus about the way tokenization should be performed, as many researchers consider @@ -769,7 +883,9 @@ span elements without modifying the original u element information. Second, we decided to design another category of tools for processing or making it possible to process the spoken language corpus, and to use powerful tools in corpus - analysis. This part of the TEICORPO library is described in the next section.

+ analysis. This part of the + TEICORPO library is described in the next + section.

@@ -792,21 +908,28 @@
Basic Import and Export Functions

The command-line interface (see ) can - perform conversions between TEI and the formats used by the following programs: CLAN, - ELAN, Praat, and Transcriber. The conversions can be performed on single files - or on whole directories or on a file tree. The command-line interface is suited to - automatic processing in offline environments. The online interface (see ) can convert one or several files - selected by the user, but not whole directories. Results appear in the user’s download - folder.

+ perform conversions between TEI and the formats used by the following programs: + CLAN, ELAN, + Praat, and + Transcriber. The conversions can be performed on + single files or on whole directories or on a file tree. The command-line interface is + suited to automatic processing in offline environments. The online interface (see + ) can + convert one or several files selected by the user, but not whole directories. Results + appear in the user’s download folder.

In addition to the conversion to and from the alignment software, the online version of - TEICORPO offers import and export in common spreadsheet formats (.xlsx and .csv) and - word processing formats (.docx and .txt). Importing data is useful to create new data, - and exporting is used to make reports or examples for a publication and for end users - not familiar with transcription tasks or computer software (see and ).

+ + TEICORPO offers import and export in common + spreadsheet formats (.xlsx and .csv) and word processing formats (.docx and .txt). + Importing data is useful to create new data, and exporting is used to make reports or + examples for a publication and for end users not familiar with transcription tasks or + computer software (see and ).

Visual representation of data from Example 1 after being processed through TEI @@ -854,25 +977,34 @@ transcription.

Other features are available in both types of interface (command line and web service). - TEICORPO allows the user to exclude some tiers, for example adult tiers in acquisition - research where the user wants to study child production only, or comment tiers which are - not necessary for some studies.

+ + TEICORPO allows the user to exclude some tiers, + for example adult tiers in acquisition research where the user wants to study child + production only, or comment tiers which are not necessary for some studies.

Export to Specialized Software -

Another kind of export concerns textometric software. TEICONVERT makes spoken language - data available for TXM (Heiden 2010; see ), Le Trameur (Fleury and Zimina 2014; see ), and Iramuteq (see and de Souza et - al. 2018), providing a dedicated TEI export for these tools. For example, for - the TXM software, the export includes a text element made of utterance elements - including age and speaker attributes. - presents an example for the TXM software.

+

Another kind of export concerns textometric software. + TEICONVERT makes spoken language data available + for TXM (Heiden + 2010; see ), + Le Trameur (Fleury and Zimina 2014; see ), + and + Iramuteq (see and de Souza et al. 2018), providing a + dedicated TEI export for these tools. For example, for the TXM software, the + export includes a text element made of utterance elements including age and speaker + attributes. presents an example for the + TXM software.

@@ -902,8 +1034,8 @@ - Example of XML for the TXM software + Example of XML for the TXM software

An export has been developed for Lexico and Le Trameur textometric software with a @@ -917,18 +1049,20 @@ Example of export for the Lexico or Le Trameur software -

Likewise, another export is available for the textometric tool Iramuteq without - timelines (see ).

+

Likewise, another export is available for the textometric tool Iramuteq + without timelines (see ).

**** -*MOT you have to rest now ? -*CHI yes . -*MOT from your big singing extravaganza ? -*CHI yes that was a party . -*MOT woof . -*MOT that was a party that sure was some party . Example of export for the IRAMUTEQ software
-

In all these cases, TEICORPO is able to provide an export file and to remove - unnecessary information from the TEI pivot format. This is useful, for example, with - textometric software, which works only with orthographic tiers without a timeline or - dependent information.

+

In all these cases, + TEICORPO is able to provide an export file and to + remove unnecessary information from the TEI pivot format. This is useful, for example, + with textometric software, which works only with orthographic tiers without a timeline + or dependent information.

@@ -938,36 +1072,60 @@ linguistic research. A present difficulty with these grammatical analyzers is that most often they run only on raw orthographic material, excluding other information. Moreover, their results are not always in a format that can be used with traditional spoken - language software such as CLAN, ELAN, Praat, Transcriber, nor of course in TEI - format.

-

TEICORPO provides a way to solve this problem by running analyzers and putting the - results from the analysis back into TEI format. Once the TEI format has been enriched - with grammatical information, it is possible to use the results and convert them back to - ELAN or Praat and use the grammatical information in these spoken language - software packages. It is also possible to export to TXM and to use the grammatical - information in the textometric software. Two grammatical analyzers have been implemented - in TEICORPO: TreeTagger and CoreNLP.

+ language software such as + CLAN , ELAN, + Praat, + Transcriber, nor of course in TEI format.

+

+ TEICORPO provides a way to solve this problem by + running analyzers and putting the results from the analysis back into TEI format. Once + the TEI format has been enriched with grammatical information, it is possible to use the + results and convert them back to ELAN or + Praat and use the grammatical information in these + spoken language software packages. It is also possible to export to TXM and to use the + grammatical information in the textometric software. Two grammatical analyzers have been + implemented in + TEICORPO: + TreeTagger and + CoreNLP.

- TreeTagger -

TreeTagger -

Accessed March 11, 2021, .

- (Schmid 1994; 1995) is a tool for annotating text with part-of-speech - and lemma information. The software is freely available for research, education, and - evaluation. It is available in twenty-five languages, provides high-quality results, - and can be easily improved by enriching the training set, as was done for instance by - Benzitoun, Fort, and Sagot (2012) in - the PERCEO project. They defined a syntactic model suitable for spoken language - corpora, using the training feature of TreeTagger and an iterative process including - manual corrections to improve the results of the automatic tool.

-

The command-line version of TEICORPO should be used to generate an annotated file - with lemma and POS information based on Treetagger. TreeTagger should be installed - separately. The implementation of TreeTagger in TEICORPO includes the ability to use - any syntactic model. For French data, we used the PERCEO model ( + TreeTagger +

+ TreeTagger +

Accessed March 11, 2021, .

+ (Schmid + 1994; 1995) is a tool for annotating text with + part-of-speech and lemma information. The software is freely available for research, + education, and evaluation. It is available in twenty-five languages, provides + high-quality results, and can be easily improved by enriching the training set, as was + done for instance by Benzitoun, Fort, and Sagot (2012) in the PERCEO project. They defined a syntactic + model suitable for spoken language corpora, using the training feature of TreeTagger + and an iterative process including manual corrections to improve the results of the + automatic tool.

+

The command-line version of + TEICORPO should be used to generate an annotated + file with lemma and POS information based on + TreeTagger. + TreeTagger should be installed separately. The + implementation of + TreeTagger in + TEICORPO includes the ability to use any + syntactic model. For French data, we used the PERCEO model (Benzitoun, Fort, and Sagot 2012).

The command line to be used is: java -cp TEICORPO.jar fr.ortolang.TEICORPO.TeiTreeTagger filenames... with additional @@ -982,14 +1140,18 @@

-model filename

-

filename is the full name of the TreeTagger syntactic model. In - our case, we use the PERCEO model.

+

filename is the full name of the + TreeTagger syntactic model. In our case, + we use the PERCEO model.

-program filename

-

filename is the full location of the TreeTagger program, according - to the system used (Windows, MacOS, or Linux).

+

filename is the full location of the + TreeTagger program, according to the system + used (Windows, MacOS, or Linux).

-normalize @@ -999,8 +1161,9 @@

The environment variable TREE_TAGGER can be used to locate the model and the program. - If no -program option is used, the default name for the TreeTagger - program is used.

+ If no -program option is used, the default name for the + TreeTagger program is used.

The -model parameter is mandatory.

The resulting filename ends with .tei_corpo_ttg.tei_corpo.xml or a specific name provided by the user (option -o).

@@ -1115,20 +1278,28 @@
- Stanford CoreNLP -

The Stanford Core Natural Language Processing -

Accessed March 11, 2021, .

- (CoreNLP) package is a suite of tools (Manning et al. 2014) that can be used under a GNU General Public License. The - suite provides several tools such as a tokenizer, a POS tagger, a parser, a named - entity recognizer, temporal tagging, and coreference resolution. All the tools are - available for English, but only some of them are available for all languages. All - software libraries are integrated into Java JAR files, so all that is - required is to download JAR files from the CoreNLP website + + Stanford CoreNLP +

+ The Stanford Core Natural Language Processing +

Accessed March 11, 2021, .

+
( + CoreNLP) package is a suite of tools (Manning et al. + 2014) that can be used under a GNU General Public License. The suite + provides several tools such as a tokenizer, a POS tagger, a parser, a named entity + recognizer, temporal tagging, and coreference resolution. All the tools are available + for English, but only some of them are available for all languages. All software + libraries are integrated into Java JAR files, so all that is required is to + download JAR files from the CoreNLP website

Accessed May 5, 2021, .

-
to use them with TEICORPO. Using the analyzer is similar to using TreeTagger. + to use them with + TEICORPO. Using the analyzer is similar to using + TreeTagger + . The -model and -syntaxformat parameters can be used in a similar way to specify the grammatical model to be used and the output format. A command line example is:

java -cp "teicorpo.jar:directory_for_SNLP/*" fr.ortolang.teicorpo.TeiSNLP @@ -1136,7 +1307,7 @@

The directory_for_SNLP is the name of the location on a computer where all the CoreNLP JAR files can be found. Note that using the CoreNLP software makes heavy demands on the computer’s memory resources and it is necessary to instruct the - Java software to use a large amount of memory (for example to insert parameter -mx5g before parameter -cp to indicate that 5 GB of memory will be used for a full English analysis).

@@ -1152,18 +1323,21 @@
Exporting the Grammatical Analysis

The results from the grammatical analysis can be used in transcription files such as - those used by Praat and ELAN. A partition-like visual presentation of data - is very handy to represent a part of speech or a CONLL result. The orthographic line - will appear at the top with divisions into words, divisions into parts of speech, and - other syntactic information below. As the result of the analysis can contain a large - number of tiers (each speaker will have as many tiers as there are elements in the - grammatical analysis: for example, word, POS, and lemma for TreeTagger; ten tiers for - CoreNLP full analysis), it is helpful to limit the number of visible tiers, either using - the -a option of TEICORPO, or limiting the display with the annotation - tool.

-

An example is presented below in the ELAN tool (see + Praat and ELAN. A partition-like visual + presentation of data is very handy to represent a part of speech or a CONLL result. The + orthographic line will appear at the top with divisions into words, divisions into parts + of speech, and other syntactic information below. As the result of the analysis can + contain a large number of tiers (each speaker will have as many tiers as there are + elements in the grammatical analysis: for example, word, POS, and lemma for TreeTagger; + ten tiers for CoreNLP full analysis), it is helpful to limit the number of visible + tiers, either using the -a option of + TEICORPO, or limiting the display with the + annotation tool.

+

An example is presented below in the ELAN tool (see ). The original utterance was si c’est comme ça je m’en vais (if that’s how it is, I’m leaving). It is displayed in the first line, highlighted in pink. The analysis into words (second line, consisting of numbers), @@ -1173,21 +1347,23 @@ (is).

- Example of TreeTagger analysis representation in a partition - software program + Example of + TreeTagger analysis representation in a + partition software program

Export can be done from TEI into a format used by textometric software (see ). This is the case for TXM, -

See the Textométrie website, last updated June 29, 2020, .

+ type="software" xml:id="R160" target="#txm"/>TXM, +

See the Textométrie website, last updated June 29, 2020, .

a textometric software application. In this case, instead of using a partition representation, the information from the grammatical analysis is inserted at the word level in an XML structure. For example, in the case below, the TXM export includes - Treetagger annotations in POS, adding lemma and pos attributes to - the word element w.

+ xml:id="R161" target="#txm"/>TXM export includes + + TreeTagger annotations in POS, adding + lemma and pos attributes to the word element w.

@@ -1218,78 +1394,113 @@ - Example of TreeTagger analysis representation that can be imported - into TXM + Example of + TreeTagger analysis representation that can be + imported into TXM
Comparison with Other Software Suites -

The additional functionalities available in the TEICORPO suite are close to those - available in the Weblicht web services (Hinrichs, Hinrichs, and Zastrow 2010). To a certain extent, the two suites of - tools (Weblicht and TEICORPO) have the same purpose and functionalities. They can import - data from various formats, run similar processes on the data, and export the data for - scientific uses. In some cases, the services could complement each other or TEICORPO - could be integrated in the Weblicht services. This is the case, for example, for - handling the CHILDES format, which at the time of writing is more functional in TEICORPO - than in Weblicht.

+

The additional functionalities available in the + TEICORPO suite are close to those available in the + + Weblicht web services ( Hinrichs, Hinrichs, and Zastrow + 2010). To a certain extent, the two suites of tools ( + Weblicht and + TEICORPO) have the same purpose and + functionalities. They can import data from various formats, run similar processes on the + data, and export the data for scientific uses. In some cases, the services could + complement each other or + TEICORPO could be integrated in the + Weblicht services. This is the case, for example, + for handling the CHILDES format, which at the time of writing is more functional in + TEICORPO than in + Weblicht.

A major difference between the two suites is in the way they can be used and in the - type of data they target. TEICORPO is intended to be used not as an independent tool, - but as a utility tool that helps researchers to go from one type of data to another. For - example, the syntactic analysis is intended to be used as a first step before being used - in tools such as Praat, ELAN, or TXM. Our more recent developments - (see Badin et al. 2021) made it possible to - insert metadata stored in CSV files (including participant metadata) into the TEI files. - This makes it possible to achieve more powerful corpus analysis using a tool such as - TXM.

+ type of data they target. + TEICORPO is intended to be used not as an + independent tool, but as a utility tool that helps researchers to go from one type of + data to another. For example, the syntactic analysis is intended to be used as a first + step before being used in tools such as + Praat, ELAN, or TXM. Our more + recent developments (see Badin et al. 2021) + made it possible to insert metadata stored in CSV files (including participant metadata) + into the TEI files. This makes it possible to achieve more powerful corpus analysis + using a tool such as TXM.

Our approach is somewhat similar to what is suggested in the conclusion of Schmidt, Hedeland, and Jettka (2017), who describe a - mechanism that makes it possible to use the power of Weblicht to process their files - that are in the ISO/TEI format. A similar mechanism could be used within TEICORPO to - take advantage of the tools that are implemented in Weblicht. However, Schmidt, - Hedeland, and Jettka (2017) suggest in - their conclusion that it would be more interesting to work directly on ISO/TEI files - because they contain a richer format. This is exactly what we did in TEICORPO. Our - suggestion would be to use the tools created by Schmidt, Hedeland, and Jettka (2017) directly with the TEICORPO files, so - that their work would complement ours. Moreover, in this way, the two projects would be - compatible and provide either new functionalities when the projects have clearly - different goals, or data variants when the goals are closer.

+ mechanism that makes it possible to use the power of + Weblicht to process their files that are in the + ISO/TEI format. A similar mechanism could be used within + TEICORPO to take advantage of the tools that are + implemented in + Weblicht. However, Schmidt, Hedeland, and Jettka + (2017) suggest in their conclusion that + it would be more interesting to work directly on ISO/TEI files because they contain a + richer format. This is exactly what we did in + TEICORPO. Our suggestion would be to use the tools + created by Schmidt, Hedeland, and Jettka (2017) directly with the + TEICORPO files, so that their work would + complement ours. Moreover, in this way, the two projects would be compatible and provide + either new functionalities when the projects have clearly different goals, or data + variants when the goals are closer.

Conclusion -

TEICORPO is a functional tool, created by the CORLI network and ORTOLANG, that converts - files created by software specializing in editing spoken-language data into TEI format. - The result is fully compatible with the most recent developments in TEI, especially those - that concern spoken-language material.

+

+ TEICORPO is a functional tool, created by the CORLI + network and ORTOLANG, that converts files created by software specializing in editing + spoken-language data into TEI format. The result is fully compatible with the most recent + developments in TEI, especially those that concern spoken-language material.

The TEI files can also be converted back to the original formats or to other formats used in spoken-language editing to take advantage of their functionalities. This makes TEI a - useful pivot format. Moreover, TEICORPO allows conversion to formats used by tools + useful pivot format. Moreover, + TEICORPO allows conversion to formats used by tools dedicated to corpus exploration and browsing.

-

TEICORPO exists as a command-line interface as well as a web service. It can thus be used - by novice as well as advanced users, or by developers of linguistic software. The tool is - free and open source so it can be further used and developed in other projects.

-

TEICORPO is intended to be part of a large set of tools using TEI for linguistic corpus - research. It can be used in parallel with or as a complement to other tools such as - Weblicht or the EXMARaLDA tools (see Schmidt, Hedeland, and Jettka 2017). A specificity of - TEICORPO is that it is more suitable for processing extended forms of TEI data (especially - forms which are not inside the main u element in the TEI code). TEICORPO is also - linked to TEIMETA, a flexible tool for describing spoken language corpora in a web - interface generated from an ODD file (Etienne, Liégois, - and Parisse, accepted). As TEI enables metadata and data to be stored in the same - file, sharing this format will promote metadata sharing and will keep metadata linked to - their data during the life cycle of the data.

+

+ TEICORPO exists as a command-line interface as well + as a web service. It can thus be used by novice as well as advanced users, or by + developers of linguistic software. The tool is free and open source so it can be further + used and developed in other projects.

+

+ TEICORPO is intended to be part of a large set of + tools using TEI for linguistic corpus research. It can be used in parallel with or as a + complement to other tools such as Weblicht or the EXMARaLDA tools (see Schmidt, Hedeland, and Jettka 2017). A + specificity of + TEICORPO is that it is more suitable for processing + extended forms of TEI data (especially forms which are not inside the main u + element in the TEI code). + TEICORPO is also linked to + TEIMETA, a flexible tool for describing spoken + language corpora in a web interface generated from an ODD file (Etienne, Liégois, and Parisse, accepted). As TEI enables + metadata and data to be stored in the same file, sharing this format will promote metadata + sharing and will keep metadata linked to their data during the life cycle of the data.

Potential further developments could provide wider coverage of different formats such as - CMDI or linked data for editing or data exploration purposes; allow TEICORPO to work with - other external tools such as grammatical analyzers; or enable the visualization of - multilevel annotations.

+ CMDI or linked data for editing or data exploration purposes; allow + TEICORPO to work with other external tools such as + grammatical analyzers; or enable the visualization of multilevel annotations.

diff --git a/data/JTEI/13_2020-22/jtei-cc-ra-wittern-189-source.xml b/data/JTEI/13_2020-22/jtei-cc-ra-wittern-189-source.xml index 04ea339b..988de0ad 100644 --- a/data/JTEI/13_2020-22/jtei-cc-ra-wittern-189-source.xml +++ b/data/JTEI/13_2020-22/jtei-cc-ra-wittern-189-source.xml @@ -355,7 +355,7 @@ excerpt from such a text) and a number of published versions derived from it have all been encoded in TEI, with the most recent version in TEI P5 and Unicode.The most recent version of the whole textual database is available in the CBETA XML P5 GitHub repository, accessed April 20, 2020, .

@@ -489,30 +489,31 @@

By the year 2010, the practice of using separate text files for different witnesses of a text had become well established in our workflow. For tracking changes to these files, we had used version control tools from the start. At some point, we realized that the modern - distributed variety of these tools, Git and GitHub, not only had the - potential to solve the problem of keeping track of changes made to a file, but could also - be used to hold all witnesses of a text in one repository, each of them represented as a - branch. (In the terminology of version control software, a branch - is one current state in the editing history of the file, which has been given a name to - make it easy to address it and to track changes along a specific trajectory.)

+ distributed variety of these tools, Git and GitHub, + not only had the potential to solve the problem of keeping track of changes made to a + file, but could also be used to hold all witnesses of a text in one repository, each of + them represented as a branch. (In the terminology of version control + software, a branch is one current state in the editing history of the file, which has been + given a name to make it easy to address it and to track changes along a specific + trajectory.)

The distributed nature of this toolchain, which unlike earlier version control systems does not require a central authority, also seemed to have the potential to solve another problem I had been trying to solve almost from the beginning of my work with digital texts. As stated already, one of the aims of my work from the outset was to make a digital version of a text at least as versatile as a printed scholarly edition. For me, this also included taking ownership of one specific copy of such an edition and tracking the work by - adding marginal notes, comments, and references directly into the book. With GitHub as a repository for texts and Git as a means to control the various - maintenance tasks, researchers interested in a text could clone the text, add their own - marginal notes, then make their version of the text available to us or any other - researcher to integrate, if we so chose.

-

A Git workflow can use any kind of digital material, but it works better with textual - material as opposed to images or videos, and even better for texts that use lines as a - structural element. This again is where the plain text we used in the Daozang - jiyao project worked better than did the XML tree structure, which is at the - core of every TEI file.

+ adding marginal notes, comments, and references directly into the book. With GitHub as a repository for texts and Git as a means to control the various maintenance + tasks, researchers interested in a text could clone the text, add their own marginal + notes, then make their version of the text available to us or any other researcher to + integrate, if we so chose.

+

A Git workflow can use any kind of digital material, + but it works better with textual material as opposed to images or videos, and even better + for texts that use lines as a structural element. This again is where the plain text we + used in the Daozang jiyao project worked better than did the XML tree + structure, which is at the core of every TEI file.

When I first presented this idea at the TEI conference in Würzburg in October 2011, I got this comment via a tweet from one of the most respected members of the TEI community (: @rahtz: interesting that @@ -524,28 +525,32 @@

As described in that talk (published as Wittern 2013), the text format used here is not simply plain text, but rather an extended - form of the text format used in the Emacs Orgmode,Accessed May 18, 2020, . - in spirit comparable to the much more frequently seen Markdown, but better. The defining - difference here is the more elegant and functional choice of markup elements, and the fact - that the format was originally conceived as the base for a note-taking and scheduling - application, so the markup itself and the software that operates on it are essentially one - unit, and the development of the software (which is itself community driven) informs the - choices and considerations for markup constructs. For the DZJY project, we added a few - more conventions, to accommodate our specific needs, but without changing any of the - essential features. Org mode uses what I called an implicit markup, - which is exactly the opposite of XML. Org mode’s markup is as short as possible and in - many cases derived from context. An asterisk * followed by a space at the - start of a line indicates a heading of level one, instead of TEI’s div followed - by a headFor a full description of this format, see The - Mandoku Text Format, accessed April 20, 2020, Emacs + Orgmode,Accessed May 18, 2020, . in spirit comparable to the much + more frequently seen Markdown, but better. The defining difference here is the more + elegant and functional choice of markup elements, and the fact that the format was + originally conceived as the base for a note-taking and scheduling application, so the + markup itself and the software that operates on it are essentially one unit, and the + development of the software (which is itself community driven) informs the choices and + considerations for markup constructs. For the DZJY project, we added a few more + conventions, to accommodate our specific needs, but without changing any of the essential + features. Org mode uses what I called an + implicit markup, which is exactly the opposite of XML. Org mode’s + markup is as short as possible and in many cases derived from context. An asterisk + * followed by a space at the start of a line indicates a heading of level + one, instead of TEI’s div followed by a headFor a full description + of this format, see The Mandoku Text Format, accessed April 20, + 2020, . (and the corresponding closing tags to convey this information).

From the beginning, the DZJY was in my view itself a pilot project for a much larger project, on which preparatory work started in earnest in 2012: the Kanseki Repository (GitHub - username @kanripo).Accessed June24, 2020, Kanseki Repository (GitHub username @kanripo).Accessed June24, 2020, and . Kanseki here is the Japanese term for premodern Chinese texts, and @@ -555,24 +560,22 @@ foundation for the creation of digital textual artifacts, based mostly on the German tradition of scholarly editing and its distinction between documentary edition and interpretative edition. These two types are - distinguished through naming conventions for the Git branches. Documentary editions are - also represented through digital facsimiles, which can be called up to be displayed side - by side with the transcribed text. Interpretative editions may normalize the characters - used to modern forms, add punctuation, and also make it possible to add translations and - semantic annotations.

+ distinguished through naming conventions for the Git + branches. Documentary editions are also represented through digital facsimiles, which can + be called up to be displayed side by side with the transcribed text. Interpretative + editions may normalize the characters used to modern forms, add punctuation, and also make + it possible to add translations and semantic annotations.

From earlier textual projects, such as ZenBase, CBETA, and DZJY, but also from other sources available on the Internet, we have compiled an initial catalog of about 10,000 titles to be included in a first phase of the project; this catalog is also being supplemented by users who deposit whatever texts they are interested in into the - repository. Since the initial publication on GitHub in September 2015, and - the launch of a dedicated website in March 2016, usage has been increasing slowly but - steadily.

+ repository. Since the initial publication on GitHub in September 2015, and the launch of a dedicated website in March 2016, + usage has been increasing slowly but steadily.

Kanripo Project Details -

All the texts are freely available on GitHub in their source form. - This repository of texts can be accessed through the All the texts are freely available on GitHub in their + source form. This repository of texts can be accessed through the kanripo.org website, but also through a module of the Emacs editor called Mandoku. This allows users to query, access, clone, edit, and push the texts directly from their own computer. Reading, commenting, and editing do not @@ -584,32 +587,31 @@ the context of their aims—and authoritative vetting and editorial quality assurance.

demonstrate the concept and functions of the Kanseki Repository. On the website, users can search for - texts or browse the catalog. Once a text is found, the webserver reads it from the GitHub repository and serves it to the user. For most texts, there are different - editions to choose from; usually both documentary and interpretative versions exist. For - many texts, there is also a digital facsimile, which can be called up alongside the - text; if there is more than one edition documented with a digital facsimile, the others - can also be directly inspected on the page for the text on the Kanseki Repository - website.

+ texts or browse the catalog. Once a text is found, the webserver reads it from the GitHub repository and serves it to the user. For + most texts, there are different editions to choose from; usually both documentary and + interpretative versions exist. For many texts, there is also a digital facsimile, which + can be called up alongside the text; if there is more than one edition documented with a + digital facsimile, the others can also be directly inspected on the page for the text on + the Kanseki Repository website.

A text in the Kanseki Repository

In the screenshot in , there is a link at the - top of the page labeled GitHub, from which the source of the text - can be directly accessed. A user who wishes to make changes to the text, by correcting, - annotating, or even translating it, can transfer a copy of this text from the public - @kanripo account, either by cloning it to their own account on GitHub, or by downloading it locally.

-

The user can also log in to the Kanripo website with their Github credentials. When - this is done for the first time, the user has to grant the Kanseki Repository access to - their repositories. In addition, a new repository KR-Workspace is - created; some settings related to the use of the Kanseki Repository are stored here. - (Most websites store this kind of information in their own database, with no direct - access to it for the user. KR does it in this way + top of the page labeled GitHub, from + which the source of the text can be directly accessed. A user who wishes to make changes + to the text, by correcting, annotating, or even translating it, can transfer a copy of + this text from the public @kanripo account, either by cloning it to their + own account on GitHub, or by downloading it + locally.

+

The user can also log in to the Kanripo website with their Github credentials. When this is done for the first time, the user + has to grant the Kanseki Repository access to their repositories. In addition, a new + repository KR-Workspace is created; some settings related to the + use of the Kanseki Repository are stored here. (Most websites store this kind of + information in their own database, with no direct access to it for the user. KR does it + in this way to allow the user control over their data and so that the user’s preferences and settings can be applied to different applications with which the user might access the KR. @@ -625,28 +627,33 @@ distant reading, text analysis, and similar purposes, a separate account @kr-shadowAccessed June 24, 2020, has been - created on Github. You will find here the texts of the master - branch, which is usually the normalized and edited version of the text in a form that - makes it easy to download the whole archive at once.

+ created on Github. You will find here the texts + of the master branch, which is usually the normalized and edited + version of the text in a form that makes it easy to download the whole archive at + once.

- Mandoku -

As mentioned, the texts can also be accessed from the text editor Emacs, which is - available on all major platforms. This is intended for people who work intensely with a - text, for example as the topic for a PhD thesis. The Emacs module MandokuAccessed May 18, 2020, - . provides ways to search - the KR, clone texts, create new branches, and many other functions. All other Emacs - extensions and modules can also be used. shows - an example of a text with its digital facsimile, and shows the same poems, rearranged by line, with a translation - added. In the middle there is an example of an inline note. And finally, shows the same text, pushed to the user’s account - and displayed from there on the Kanripo website.

+ Mandoku +

As mentioned, the texts can also be accessed from the text editor Emacs, which is available on all major platforms. This is intended + for people who work intensely with a text, for example as the topic for a PhD thesis. + The Emacs module MandokuAccessed May 18, 2020, . provides ways to search the + KR, clone texts, create new branches, and many other functions. All other Emacs extensions and modules can also be used. shows an example of a text with its digital + facsimile, and shows the same poems, + rearranged by line, with a translation added. In the middle there is an example of an + inline note. And finally, shows the same text, + pushed to the user’s account and displayed from there on the Kanripo website.

A text from the Kanseki Repository, side by side with a facsimile, - displayed using the Emacs module Mandoku + displayed using the Emacs module Mandoku
@@ -655,9 +662,8 @@
- The text with translation, now pulled from the user’s GitHub account + The text with translation, now pulled from the user’s GitHub account
diff --git a/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml b/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml index 4bcce845..0c9911a9 100644 --- a/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml +++ b/data/JTEI/14_2021-23/jtei-cc-ra-mylonas-202-source.xml @@ -619,7 +619,7 @@ target="http://nomisma.org/">Nomisma, and CRMtexCIDOC (International Committee for Documentation) Conceptual Reference Model, + target="#omekareference"/>Reference Model, accessed July 4, 2022, ; Nomisma (knowledge organization system for numismatics), accessed July 4, 2022, ; CRMtex model for the study of ancient texts (an diff --git a/data/JTEI/7_2014/jtei-7-dee-source.xml b/data/JTEI/7_2014/jtei-7-dee-source.xml index 8792c1ff..8066f09f 100644 --- a/data/JTEI/7_2014/jtei-7-dee-source.xml +++ b/data/JTEI/7_2014/jtei-7-dee-source.xml @@ -734,9 +734,9 @@ Integrated Resources

While initiatives such as TAPAS, TEICHI, and CWRC-Writer

Welcome to CWRC Writer, - CWRC-Writer Help, accessed September 7, 2013, .

have begun to address to different aspects of these needs (Cocoon. and the native XML database eXist-dbeXist-db. deserve to be mentioned. Specifically for TEI-annotated documents, TUSTEP,Java objects. The resources are stored and maintained in a native XML database management system (i.e., eXist-db). The APIs and services provided by Lucene, a software library developed and hosted by the Apache Foundation, have been used for indexing the textual data.

@@ -646,7 +646,7 @@

The marshalling and unmarshalling process handles the serialization of the object representation of the TEI document, in order to store and retrieve data on the filesystem or in native XML databases, such as eXist-db.

+ target="#existdb"/>eXist-db.

Performance measurement tools such as JMeter will help to optimize the performance of the library components.

Software currently under development will be available on Example rs. -

Reference attributes (ref) point to nodes located elsewhere in the TEI dataset. It should be noted that the organization of the TEI dataset and the location of the entity notes therein is of no importance to the reference linking @@ -471,7 +471,7 @@ . software library. One of our goals is to implement the aggregations within the digital edition, and for this we would like to use web technologies only. The D3.js + xml:id="D3.js" target="#d3js"/>D3.js (Data Driven Documents Javascript library) created by Mike Bostock provides a framework for different visualizations. The list of examplesJavaScript and a reference to the external D3.js library. The second is a + target="#d3js"/>D3.js library. The second is a JSON file, which contains one object per entity and one associated array per object that includes a list of connected entities. The tree-like structure of XML allows the transformation of any document to a network graph by selecting elements that share the @@ -526,8 +526,8 @@ headline is Thüringens Geschichte (History of Thuringia), which is also the topic of the following pages. The benefit of the network is that a major topic can be identified with a single view.

-

The output of this D3.js application is an SVG graphic which can be +

The output of this D3.js application is an SVG graphic which can be further transformed. svg:title elements are used to store the node names, which modern browsers should display on mouseover. To get a better overview of the entities in the notebook, the node names should actually be inserted as nodes, but since there is @@ -599,7 +599,7 @@ xml:id="XSLT" target="#XSLT"/>XSLT) and code customization were easily carried out in addition to our regular work within the Fontane edition project. These efforts were facilitated by a spirit of openness shared by all - parties involved: both the D3.js library and the SIMILE Timeline widget are open-source software released under a BSD license; the data sources GND, GeoNames, and OpenStreetMap have permissive licenses—Creative Commons Zero (CC0), Creative Commons diff --git a/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml b/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml index 0cefc121..751edd6f 100644 --- a/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml +++ b/data/JTEI/8_2014-15/jtei-8-rosselli-source.xml @@ -116,7 +116,11 @@ it comes to image-based digital editions. The different needs of scholars, coupled with the constant search for an effective price/result ratio and the local availability of technical skills, have led to a remarkable fragmentation: publishing solutions range from - simple HTML pages produced using the TEI stylesheets (or the TEI Boilerplate software) to + simple HTML pages produced using the + + TEI stylesheets (or the + + TEI Boilerplate software) to very complex frameworks based on CMS and SQL search engines. Researchers of the Digital Vercelli Book project started looking into a simple, user-friendly solution and eventually decided to build their own: EVT (Edition Visualization Technology) has been under @@ -158,28 +162,39 @@ />.

in favor of a web-based publication. While this decision was critical in that it allowed us to select the most supported and widely-used medium, we soon discovered that it did not make choices any simpler. On the one hand, the XSLT stylesheets provided by TEI are great for HTML rendering, but do not - include support for image-related features (such as the text-image linking available - thanks to the P5 version of the TEI schema) and tools (including zoom in/out, magnifying - lens, and hot spots) that represent a significant part of a digital facsimile and/or - diplomatic edition; other features, such as an XML search engine, would have to be - integrated separately, in any case. On the other hand, there are powerful frameworks - based on CMS

The Omeka framework () supports - publishing TEI documents; see also Drupal (‎) and - TEICHI ().

and other web - technologies

Such as the eXist XML database, .

which looked far too complex and - expensive, particularly when considering future maintenance needs, for our project’s - purposes. Other solutions, such as the EPPT software

Edition - Production and Presentation Technology, .

developed by K. - Kiernan or the Elwood - viewer

Elwood Viewer, .

created - by G. Lyman, either were not yet ready or were unsuitable for other reasons (proprietary - software, user interface issues, specific hardware and/or software requirements).

+ type="software" xml:id="R1" target="#XSLT"/>XSLT + stylesheets provided by TEI are great for HTML rendering, but do not include support for + image-related features (such as the text-image linking available thanks to the P5 + version of the TEI schema) and tools (including zoom in/out, magnifying lens, and hot + spots) that represent a significant part of a digital facsimile and/or diplomatic + edition; other features, such as an XML search engine, would have to be integrated + separately, in any case. On the other hand, there are powerful frameworks based on + CMS

The Omeka framework () supports publishing TEI documents; see also + Drupal () + and TEICHI ( + ).

and other web + technologies

Such as the + eXist XML database, .

which looked far too + complex and expensive, particularly when considering future maintenance needs, for our + project’s purposes. Other solutions, such as the + EPPT software

Edition Production and Presentation Technology, + .

developed by K. Kiernan or + the + + Elwood viewer

Elwood Viewer, + .

created by G. Lyman, either were + not yet ready or were unsuitable for other reasons (proprietary software, user interface + issues, specific hardware and/or software requirements).

Standard vs. Fragmentation @@ -196,9 +211,10 @@
First Experiments -

At first, however, EVT was more an experimental research project for students at the - Informatica Umanistica - course of the University of Pisa

BA course, At first, however, + EVT was more an experimental research project for + students at the Informatica + Umanistica course of the University of Pisa

BA course, .

than a real attempt to solve the digital edition viewer problem. We aimed at investigating some user interface–related aspects of such a viewer, in particular certain usability problems @@ -218,9 +234,12 @@
- The Current EVT Version + The Current + EVT Version
- EVT v. 2.0: Rebooting the Project + + EVT + v. 2.0: Rebooting the Project

To get out of the impasse we decided to completely reboot the project, removing secondary features and giving priority to fundamental ones. We also found a solution for the data-loading problem: instead of finding a way to load the data into the software we @@ -229,51 +248,63 @@ text, with very little configuration needed to create the edition. This approach also allowed us to quickly test XML files belonging to other edition projects, to check if EVT could go beyond being a project-specific tool. The inspiration for these changes - came from work done in similar projects developed within the TEI community, namely TEI Boilerplate,

TEI - Boilerplate, .

John A. - Walsh’s collection of XSLT stylesheets,

tei2html, .

and Solenne Coutagne’s - work for the Berliner Intellektuelle 1800–1830 - project.

Digitale Edition Briefe und Texte aus dem - intellektuellen Berlin um 1800, + + TEI Boilerplate,

TEI Boilerplate, + + .

+ + John A. Walsh’s collection of XSLT stylesheets,

+ tei2html + , .

and Solenne + Coutagne’s work for the Berliner + Intellektuelle 1800–1830 project.

Digitale Edition Briefe und Texte aus dem intellektuellen Berlin um 1800, .

- Through this approach, we achieved two important results: first, usage of EVT is quite - simple—the user applies an XSLT stylesheet to their already marked-up file(s), - and when the processing is finished they are presented with a web-ready edition; second, - the web edition that is produced is based on a client-only architecture and does not - require any additional kind of server software, which means that it can be simply copied - on a web server to be used at once, or even on a cloud storage service (provided that it - is accessible by the general public).

+ Through this approach, we achieved two important results: first, usage of + EVT is quite simple—the user applies an XSLT + stylesheet to their already marked-up file(s), and when the processing is finished they + are presented with a web-ready edition; second, the web edition that is produced is + based on a client-only architecture and does not require any additional kind of server + software, which means that it can be simply copied on a web server to be used at once, + or even on a cloud storage service (provided that it is accessible by the general + public).

To ensure that it will be working on all the most recent web browsers, and for as long - as possible on the World Wide Web itself, EVT is built on open and standard web - technologies such as HTML, CSS, and JavaScript. Specific - features, such as the magnifying lens, are entrusted to jQuery plug-ins, again chosen - from the best-supported open-source ones to reduce the risk of future incompatibilities. - The general architecture of the software, in any case, is modular, so that any component - which may cause trouble or turn out to be not completely up to the task can be replaced - easily.

+ as possible on the World Wide Web itself, + EVT is built on open and standard web technologies + such as HTML, CSS, and JavaScript. Specific features, such as the magnifying + lens, are entrusted to jQuery plug-ins, again chosen from the best-supported open-source + ones to reduce the risk of future incompatibilities. The general architecture of the + software, in any case, is modular, so that any component which may cause trouble or turn + out to be not completely up to the task can be replaced easily.

How it Works

Our ideal goal was to have a simple, very user-friendly drop-in tool, requiring little work and/or knowledge of anything beyond XML from the editor. To reach this goal, EVT is based on a modular structure where a single stylesheet (evt_builder.xsl) - starts a chain of XSLT 2.0 transformations calling in turn all the - other modules. The latter belong to two general categories: those devoted to building - the HTML site, and the XML processing ones, which extract the edition text lying between - folios using the pb element and format it according to the edition level. All - XSLT modules live inside the builder_pack folder, in order to - have a clean and well-organized directory hierarchy.

+ starts a chain of XSLT + 2.0 transformations calling in turn all the other + modules. The latter belong to two general categories: those devoted to building the HTML + site, and the XML processing ones, which extract the edition text lying between folios + using the pb element and format it according to the edition level. All XSLT + modules live inside the builder_pack folder, in order to have a clean and + well-organized directory hierarchy.
- The EVT builder_pack directory structure. + The + EVT + builder_pack directory structure.

Therefore, assuming the available formatting stylesheets meet your project’s criteria, @@ -286,46 +317,50 @@ evt_builder-conf.xsl, to specify for example the number of edition levels or presence of images; you can then apply the evt_builder.xsl stylesheet to your TEI XML - document using the Oxygen XML editor or another XSLT 2–compliant + document using the Oxygen XML editor or another XSLT 2–compliant engine.

- The EVT data directory structure. + The + EVT data directory structure.

-

When the XSLT processing is finished, the starting point for the edition is - the index.html file in the root directory, and all the HTML pages - resulting from the transformations will be stored in the output_data - folder. You can delete everything in this latter folder (and the - index.html file), modify the configuration options, and start again, - and everything will be re-created in the assigned places.

+

When the XSLT processing is finished, the starting point for the edition is the + index.html file in the root directory, and all the HTML pages resulting + from the transformations will be stored in the output_data folder. You + can delete everything in this latter folder (and the index.html file), + modify the configuration options, and start again, and everything will be re-created in + the assigned places.

- The XSLT stylesheets + The XSLT stylesheets

The transformation chain has two main purposes: generate the HTML files containing the edition and create the home page which will dynamically recall the other HTML files.

The EVT builder’s transformation system is composed of a modular collection of XSLT 2.0 stylesheets: these modules are designed to permit scholars to freely - add their own stylesheets and to manage the different desired levels of the edition - without influencing other parts of the system, for instance the generation of the home - page.

-

The transformation is performed applying a specific XSLT stylesheet + type="software" xml:id="R25" target="#XSLT"/>XSLT + 2.0 stylesheets: these modules are designed to permit + scholars to freely add their own stylesheets and to manage the different desired levels + of the edition without influencing other parts of the system, for instance the + generation of the home page.

+

The transformation is performed applying a specific XSLT stylesheet (evt_builder.xsl) which includes links to all the other stylesheets that are part of the transformation chain and that will be applied to the TEI XML document containing the transcription.

-

EVT can be used to create image-based editions with different edition levels starting - from a single encoded text. The text of the transcription must be divided into smaller - parts to recreate the physical structure of the manuscript. Therefore, it is essential - that paginated XML documents are marked using a TEI page break element (pb) at - the start of each new page or folio side, so that the transformation system will be able - to recognize and handle everything that stands between a pb element and the - next one as the content of a single page.

+

+ EVT can be used to create image-based editions with + different edition levels starting from a single encoded text. The text of the + transcription must be divided into smaller parts to recreate the physical structure of + the manuscript. Therefore, it is essential that paginated XML documents are marked using + a TEI page break element (pb) at the start of each new page or folio side, so + that the transformation system will be able to recognize and handle everything that + stands between a pb element and the next one as the content of a single + page.

The system is designed to generate an arbitrary number of edition levels: as a consequence, the user is required to indicate how many (and which) output levels they intend to create by modifying the corresponding parameter in the configuration file.

@@ -360,17 +395,19 @@ xsl:apply-templates select="current-group()" mode="dipl" instruction before its content is inserted into the diplomatic output file.

-

Using XSLT modes it is possible to separate the rules for the different transformations - of a TEI element and to recall other XSLT stylesheets in order to manage the - transformations or send different parts of a document to different parts of the - transformation chain. This permits the extraction of different texts for different +

Using XSLT modes it is possible to separate the rules for the different + transformations of a TEI element and to recall other XSLT stylesheets in order to + manage the transformations or send different parts of a document to different parts of + the transformation chain. This permits the extraction of different texts for different edition levels (diplomatic, diplomatic-interpretative) processing the same XML file, and to save them in the HTML site structure, which is available as a separate XSLT module.

+ type="software" xml:id="R30" target="#XSLT"/>XSLT + module.

The use of modes also allows users to separate template rules for the different transformations of a TEI element and to place them in different XSLT files or in + xml:id="R31" target="#XSLT"/>XSLT files or in different parts of a single stylesheet. So templates such as the following and personalize the edition generation parameter as shown above; - copy their own XSLT files containing the template rules to + copy their own XSLT files containing the template rules to generate the desired edition levels in the directory that contains the stylesheets used for TEI element transformation (builder_pack/modules/elements); @@ -403,25 +440,30 @@

For the time being, this kind of customization has to be done by hand-editing the - configuration files, but in a future version of EVT we plan to add a more user-friendly - way to configure the system.

+ configuration files, but in a future version of + EVT we plan to add a more user-friendly way to + configure the system.

Features -

At present, EVT can be used to create image-based editions with two possible edition - levels: diplomatic and diplomatic-interpretative; this means that a transcription - encoded using elements belonging to the appropriate TEI module

See chapter 11, - Representation of Primary Sources, in the At present, + EVT can be used to create image-based editions with + two possible edition levels: diplomatic and diplomatic-interpretative; this means that a + transcription encoded using elements belonging to the appropriate TEI module

See + chapter 11, Representation of Primary Sources, in the TEI Guidelines.

should already be - compatible with EVT, or require only minor changes to be made compatible. The Vercelli - Book transcription schema is based on the standard TEI schema, with no custom elements - or attributes added: our tests with similarly encoded texts showed a high grade of - compatibility. A critical edition level is currently being researched and it will be - added in the future.

-

When the website produced by EVT is loaded in a browser, the viewer will be presented - with the manuscript image on the left side, and the corresponding text on the right: - this is the default view, but on the main toolbar at the top right corner of the browser - window there are icons to access all the available views: + compatible with + EVT, or require only minor changes to be made + compatible. The Vercelli Book transcription schema is based on the standard TEI schema, + with no custom elements or attributes added: our tests with similarly encoded texts + showed a high grade of compatibility. A critical edition level is currently being + researched and it will be added in the future.

+

When the website produced by + EVT is loaded in a browser, the viewer will be + presented with the manuscript image on the left side, and the corresponding text on the + right: this is the default view, but on the main toolbar at the top right corner of the + browser window there are icons to access all the available views: Image-Text view: as mentioned above, this is the default view showing a manuscript folio image and the corresponding text in one or more edition levels; @@ -445,8 +487,9 @@ required by the editor. The only necessary requirement at the encoding level, in fact, is that the editor should encode folio numbers by means of the pb element including r and v letters to mark recto - and verso pages, respectively. EVT will take care of automatically associating each - folio to the images copied in the input_data/images folder using a + and verso pages, respectively. + EVT will take care of automatically associating + each folio to the images copied in the input_data/images folder using a verso-recto naming scheme (for example: 104v-105r.png). It is of course possible that in some cases the transformation process is unable to return the correct result: this is why we decided to @@ -454,8 +497,10 @@ independent from the HTML interface; this file will be updated automatically every time the transformation process is started and can be customized by the editor.

Although the different views access different kinds of content, such as single side and - double side images, the navigation algorithms used by EVT allow the user to move from - one view to another without losing the current browsing position.

+ double side images, the navigation algorithms used by + EVT allow the user to move from one view to another + without losing the current browsing position.

All content is shown inside HTML frames designed to be as flexible as possible. No matter what view one is currently in, one can expand the desired frame to focus on its specific content, temporarily hiding the other components of the user interface. It is @@ -488,8 +533,8 @@ target="http://www.tapor.uvic.ca/~mholmes/image_markup/">Image Markup Tool

The UVic Image Markup Tool Project, .

software - and was implemented in XSLT and CSS; all the other features are achieved by + and was implemented in XSLT and CSS; all the other features are achieved by using jQuery plug-ins.

In the text frame tool bar you can see three drop-down menus which are useful for choosing texts, specific folios, and edition levels, and an icon that triggers the @@ -499,30 +544,39 @@

A First Use Case -

On December 24, 2013, after extensive testing and bug fixing work, the EVT team - published a beta version of the Digital - Vercelli Book edition,

Full announcement on the project blog, On December 24, 2013, after extensive testing and bug fixing work, the + EVT team published a beta version of the Digital Vercelli Book + edition,

Full announcement on the project blog, . The beta edition is directly accessible at .

soliciting feedback from all interested parties. Shortly afterwards, the version of the - EVT software we used, improved by more bug fixes and small enhancements, was made - available for the academic community on + EVT software we used, improved by more bug fixes and + small enhancements, was made available for the academic community on the project’s SourceForge site.

Edition Visualization Technology: Digital edition visualization - software, .

+ software, .

- The Digital Vercelli Book edition based on EVT v. 0.1.48. - Image-text linking is active. + The Digital Vercelli Book edition based on + EVT + v. 0.1.48. Image-text linking is active.

Future Developments -

EVT development will continue during 2014 to fix bugs and to improve the current set of - features, but there are also several important features that will be added or that we are - currently considering for inclusion in EVT. Some of the planned features will require +

+ EVT development will continue during 2014 to fix bugs + and to improve the current set of features, but there are also several important features + that will be added or that we are currently considering for inclusion in + EVT. Some of the planned features will require fundamental changes to the software architecture to be implemented effectively: this is probably the case for the Digital Lightbox (see ), which requires a client-server architecture (

New Layout -

One important aspect that has been introduced in the current version of EVT is a - completely revised layout: the current user interface includes all the features which - were deemed necessary for the Digital Vercelli Book beta, but it also is ready to accept - the new features planned for the short and medium terms. Note that nontrivial changes to - the general appearance and layout of the resulting web edition will be necessary, and - this is especially the case for the XML search engine and for the critical edition - support. Fortunately the basic framework is flexible enough to be easily expanded by - means of new views or a redesign of the current ones.

+

One important aspect that has been introduced in the current version of + EVT is a completely revised layout: the current + user interface includes all the features which were deemed necessary for the Digital + Vercelli Book beta, but it also is ready to accept the new features planned for the + short and medium terms. Note that nontrivial changes to the general appearance and + layout of the resulting web edition will be necessary, and this is especially the case + for the XML search engine and for the critical edition support. Fortunately the basic + framework is flexible enough to be easily expanded by means of new views or a redesign + of the current ones.

Search Engine -

The EVT search engine is already working and being tested in a separate development - branch of the software; merging into the main branch is expected as soon as the user - interface is finalized. It was implemented with the goal of keeping it simple and usable - for both academics and the general public.

+

The + EVT search engine is already working and being + tested in a separate development branch of the software; merging into the main branch is + expected as soon as the user interface is finalized. It was implemented with the goal of + keeping it simple and usable for both academics and the general public.

To achieve this goal we began by studying various solutions that could be used as a basis for our efforts. In the first phases of this study we looked at the principal XML - databases, such as of BaseX, eXist, etc., and we found a solution by envisioning EVT as - a distributed application using the client-server architecture. For this test we - selected the eXist

eXist-db, .

open source XML database, and in a - relatively short time we created, sometimes by trial-and-error, a prototype that queried - the database for keywords and highlighted them in context.

+ databases, such as of + BaseX, + eXist, etc., and we found a solution by envisioning + + EVT as a distributed application using the + client-server architecture. For this test we selected the + eXist

eXist-db, .

open source XML database, and + in a relatively short time we created, sometimes by trial-and-error, a prototype that + queried the database for keywords and highlighted them in context.

While this model was a step in the right direction and partially operational, we also felt that it was not sufficiently user-friendly, which is a critical goal of the entire project. In fact, forcing the editor to install and configure specific server software @@ -564,7 +628,8 @@ could be accessed anywhere, and possibly distributed in optical formats (CD or DVD). Forcing the prerequisites of an Internet connection and of dependency on a server-based XML database would have undermined our original goal. Going the database route was no - longer an option for a client-only EVT and we immediately felt the need to go back to + longer an option for a client-only + EVT and we immediately felt the need to go back to our original architecture to meet this standard. This sudden turnaround marked another chapter in the research process and brought us to the current implementation of EVT Search.

@@ -578,43 +643,59 @@ expected by the user. Essentially, we found that at least two of them were needed in order to make a functional search engine: free-text search and keyword highlighting. To implement them we looked at existing search engines and plug-ins programmed in the most - popular client-side web language: JavaScript. In the - end, our search produced two answers: Tipue Search and DOM + popular client-side web language: JavaScript. In the end, our search produced two + answers: + Tipue Search and DOM manipulation.

- Tipue Search -

Tipue search

Tipue Search, - .

is a jQuery plug-in + + Tipue Search +

+ Tipue search

+ Tipue Search, .

is a jQuery plug-in search engine released under the MIT license and aimed at indexing and searching large collections of web pages. It can function both offline and online, and it does not necessarily require a web server or a server-side programming/query language (such as SQL, PHP, or Python) in order to work. While technically a plug-in, its architecture is quite interesting and versatile: Tipue uses a combination of client-side JavaScript for the actual bulk of the work, and JSON (or JavaScript object literal) for storing the content. By - accessing the data structure, this engine is able to search for a relevant term and - bring back the matches.

-

Tipue Search operates in three modes: - in Static mode, Tipue Search operates without a web server by + type="software" xml:id="R57" target="#JavaScript"/>JavaScript for the actual bulk of the work, and JSON (or JavaScript + object literal) for storing the content. By accessing the data structure, this engine + is able to search for a relevant term and bring back the matches.

+

+ Tipue Search operates in three modes: + in Static mode, + Tipue Search operates without a web server by accessing the contents stored in a specific file (tipuedrop_content.js); these contents are presented in JSON format; - in Live mode, Tipue Search operates with a web server by indexing - the web pages included in a specific file - (tipuesearch_set.js); - in JSON mode, Tipue Search operates with a web server by using - AJAX to load JSON data stored in specific files (as defined by the user). + in Live mode, + Tipue Search operates with a web server by + indexing the web pages included in a specific file + (tipuesearch_set.js); + in JSON mode, + Tipue Search operates with a web server by + using AJAX to load JSON data stored in specific files (as defined by the + user).

This plug-in suited our needs very well, but had to be modified slightly in order to - accommodate the requirements of the entire project. Before using Tipue to handle the - search we needed to generate the data structure that was going to be used by the - engine to perform the queries. We explored some existing XSL stylesheets aimed at TEI - to JSON transformation, but we found them too complex for the task at hand. So we - modified our own stylesheets to produce the desired output.

+ accommodate the requirements of the entire project. Before using + Tipue to handle the search we needed to generate + the data structure that was going to be used by the engine to perform the queries. We + explored some existing XSL stylesheets aimed at TEI to JSON transformation, but we + found them too complex for the task at hand. So we modified our own stylesheets to + produce the desired output.

This output consists of two JSON files: diplomatic.json contains the text of the diplomatic edition of the Vercelli Book; @@ -623,19 +704,25 @@

These files are produced by including two templates in the overall flow of XSLT transformations that extract crucial data from the TEI documents and format them with JSON syntax. The procedure complements well the entire logic of - automatic self-generation that characterizes EVT.

+ automatic self-generation that characterizes + EVT.

After we managed to extract the correct data structure, we began to include the - search functionality in EVT. By using the logic behind Tipue JSON mode, we implemented - a trigger (under the shape of a select tag) that loaded the desired JSON data - structure to handle the search (diplomatic or facsimile, as mentioned above) and a - form that managed the query strings and launched the search function. Additionally, we - decided to provide the user with a simple virtual keyboard composed of essential keys - related to the Anglo-Saxon alphabet used in the Vercelli Book.

-

The performance of Tipue Search was deemed acceptable and our tests showed that even - large collections of data did not pose any particular problem.

+ search functionality in + EVT. By using the logic behind + Tipue JSON mode, we implemented a trigger (under + the shape of a select tag) that loaded the desired JSON data structure to handle the + search (diplomatic or facsimile, as mentioned above) and a form that managed the query + strings and launched the search function. Additionally, we decided to provide the user + with a simple virtual keyboard composed of essential keys related to the Anglo-Saxon + alphabet used in the Vercelli Book.

+

The performance of + Tipue Search was deemed acceptable and our tests + showed that even large collections of data did not pose any particular problem.

Experimental search interface. @@ -644,20 +731,23 @@
Keyword Highlighting through DOM Manipulation

The solution to keyword highlighting was found while searching many plug-ins that - deal with this very problem. All these plug-ins use JavaScript and DOM manipulation in order to wrap the HTML text nodes that - match the query with a specific tag (a span or a user-defined tag) and a CSS class to - manage the style of the highlighting. While this implementation was very simple and - self-explanatory, making use of simple recursive functions on relevant HTML nodes has - proved to be very difficult to apply to the textual contents handled by EVT.

-

HTML text within EVT is represented as a combination of text nodes and span - elements. These spans are used to define the characteristics of the current selected - edition. They contain both philological information about the inner workings of the - text and information about its visual representation. Very often the text is composed - of spans that handle different versions of words (such as the sub-elements of the TEI - choice element) or highlight an area of a word (based on the TEI - hi element, for example).

+ deal with this very problem. All these plug-ins use JavaScript and DOM + manipulation in order to wrap the HTML text nodes that match the query with a specific + tag (a span or a user-defined tag) and a CSS class to manage the style of the + highlighting. While this implementation was very simple and self-explanatory, making + use of simple recursive functions on relevant HTML nodes has proved to be very + difficult to apply to the textual contents handled by + EVT.

+

HTML text within + EVT is represented as a combination of text nodes + and span elements. These spans are used to define the characteristics of the + current selected edition. They contain both philological information about the inner + workings of the text and information about its visual representation. Very often the + text is composed of spans that handle different versions of words (such as the + sub-elements of the TEI choice element) or highlight an area of a word (based + on the TEI hi element, for example).

This type of markup would not have constituted a problem if it had wrapped complete words, since the plug-ins could recursively explore its content and search for a matching term. In certain portions of the text, however, some letters are separated by @@ -700,30 +790,35 @@ information about the image, but is placed inside a zone element, which defines two-dimensional areas within a surface, and is transcribed using one or more line elements.

-

Originally EVT could not handle this particular encoding method, since the XSLT stylesheets could only process TEI XML documents encoded according to the - traditional transcription method. Since we think that this is a concrete need in many - cases of study (mainly epigraphical inscriptions, but also manuscripts, at least in - some specific cases), we recently added a new feature that will allow EVT to handle - texts encoded according to the embedded transcription method. This work was possible - due to a small grant awarded by EADH.

See EADH Small Grant: Call for - Proposals, .

+

Originally + EVT could not handle this particular encoding + method, since the XSLT stylesheets could only process TEI XML + documents encoded according to the traditional transcription method. Since we think + that this is a concrete need in many cases of study (mainly epigraphical inscriptions, + but also manuscripts, at least in some specific cases), we recently added a new + feature that will allow + EVT to handle texts encoded according to the + embedded transcription method. This work was possible due to a small grant awarded by + EADH.

See EADH Small Grant: Call for Proposals, .

Support for Critical Edition

One important feature whose development will start at some point this year is the - support for critical editions, since at the present moment EVT allows dealing only - with diplomatic and interpretative ones. We aim not only to offer full support for the - TEI Critical Apparatus module, but also to find an innovative layout that can take - advantage of the digital medium and its dynamic properties to go beyond the - traditional, static, printed page: The layers of footnotes, - the multiplicity of textual views, the opportunities for dramatic visualization - interweaving the many with each other and offering different modes of viewing the - one within the many—all this proclaims I am a hypertext: invent a dynamic device - to show me. The computer is exactly this dynamic device (Robinson 2005, § 12).

+ support for critical editions, since at the present moment + EVT allows dealing only with diplomatic and + interpretative ones. We aim not only to offer full support for the TEI Critical + Apparatus module, but also to find an innovative layout that can take advantage of the + digital medium and its dynamic properties to go beyond the traditional, static, + printed page: The layers of footnotes, the multiplicity of + textual views, the opportunities for dramatic visualization interweaving the many + with each other and offering different modes of viewing the one within the many—all + this proclaims I am a hypertext: invent a dynamic device to show me. The + computer is exactly this dynamic device (Robinson 2005, § 12).

A digital edition can, of course, maintain the traditional layout, possibly moving the apparatus from the bottom of the page to a more convenient position, but could and should also explore different ways of organizing and displaying the connection between @@ -742,43 +837,48 @@

Some of the problems related to this approach are related to the user interface and the way it should be designed in order to be usable and useful: how to conceive and where to place the graphical widgets holding the critical apparatus, how to integrate - these UI elements in EVT, how to contextualize the variants and navigate through the - witnesses’ texts, and more. There are other problems, for instance scalability issues - (how to deal with very big textual traditions that count tens or even hundreds of - witnesses?) or the handling of texts produced by collation software, which strictly - depend on the current TEI Critical Apparatus module. Considering that there is a - subgroup of the TEI’s Manuscript Special Interest Group devoted to significantly - improving this module, we can only hope that at least some of these problems will be - addressed in a future version.

+ these UI elements in + EVT, how to contextualize the variants and + navigate through the witnesses’ texts, and more. There are other problems, for + instance scalability issues (how to deal with very big textual traditions that count + tens or even hundreds of witnesses?) or the handling of texts produced by collation + software, which strictly depend on the current TEI Critical Apparatus module. + Considering that there is a subgroup of the TEI’s Manuscript Special Interest Group + devoted to significantly improving this module, we can only hope that at least some of + these problems will be addressed in a future version.

- Digital Lightbox + + Digital Lightbox

Developed first at the University of Pisa, and then at King’s College London as part of the DigiPal

DigiPal: Digital Resource and Database of Palaeography, Manuscript Studies and Diplomatic, .

project, the Digital Lightbox

A beta - version is available at .

is a web-based visualization framework which aims to support - historians, paleographers, art historians, and others in analyzing and studying digital - reproductions of cultural heritage objects. The methodology of research inspiring - development of this tool is to study paleographic elements in a qualitative way, helping - scholars’ interpretations as much as possible, and therefore to reject any automatic - methods such as pattern recognition and clustering which are supposed to return - quantitative and objective results. Although ongoing projects making use of these - computational methods are very promising, the results that may be obtained at this time - are still significantly less precise (with regard to specific image features, at least) - than those produced through human interpretation.

-

Initially developed exclusively for paleographic research, the Digital Lightbox may be - used with any type of image because it includes a set of general graphic tools. Indeed, - the application allows a detailed and powerful analysis of one or more images, arranged - in up to two available workspaces, providing tools for manipulation, management, - comparison, and transformation of images. The development of this project is - consistently tested by paleographers at King’s College London working on the DigiPal - project, who are using the web application as a support for analyzing and gathering - samples of paleographic elements.

+ target="http://lightbox-dev.dighum.kcl.ac.uk"> + Digital Lightbox

A beta version is + available at .

is a + web-based visualization framework which aims to support historians, paleographers, art + historians, and others in analyzing and studying digital reproductions of cultural + heritage objects. The methodology of research inspiring development of this tool is to + study paleographic elements in a qualitative way, helping scholars’ interpretations as + much as possible, and therefore to reject any automatic methods such as pattern + recognition and clustering which are supposed to return quantitative and objective + results. Although ongoing projects making use of these computational methods are very + promising, the results that may be obtained at this time are still significantly less + precise (with regard to specific image features, at least) than those produced through + human interpretation.

+

Initially developed exclusively for paleographic research, the + Digital Lightbox may be used with any type of image + because it includes a set of general graphic tools. Indeed, the application allows a + detailed and powerful analysis of one or more images, arranged in up to two available + workspaces, providing tools for manipulation, management, comparison, and transformation + of images. The development of this project is consistently tested by paleographers at + King’s College London working on the DigiPal project, who are using the web application + as a support for analyzing and gathering samples of paleographic elements.

The software offers a rich set of tools: besides basic functions such as resizing, rotation, and dragging, it is possible to use a set of filters—such as opacity, brightness, color inversion, grayscale effect, and contrast—which, used in combination, @@ -795,79 +895,108 @@ Lightbox.

-

Collaboration is a very important characteristic of Digital Lightbox: what makes this - tool stand apart from all the image-editing applications available is the possibility of - creating and sharing the work done using the software framework. First, you can create - collections of images and then export them to the local disk as an XML file; this - feature not only serves as a way to save the work, but also to share specific - collections with other users. Moreover, it is possible to export (and, consequently, to - import) working sessions, or, in other words, the current status of the work being done - using the application: in fact, all the images, letters, and notes present on the - workspace will be saved when the user leaves and restored when they log in again. These - features have been specifically created to encourage sharing and to make collaborative - work more effective and easy. Thanks to a new HTML5 feature, it is possible to support - the importing of images from the local disk to the application without any server-side +

Collaboration is a very important characteristic of + Digital Lightbox: what makes this tool stand apart + from all the image-editing applications available is the possibility of creating and + sharing the work done using the software framework. First, you can create collections of + images and then export them to the local disk as an XML file; this feature not only + serves as a way to save the work, but also to share specific collections with other + users. Moreover, it is possible to export (and, consequently, to import) working + sessions, or, in other words, the current status of the work being done using the + application: in fact, all the images, letters, and notes present on the workspace will + be saved when the user leaves and restored when they log in again. These features have + been specifically created to encourage sharing and to make collaborative work more + effective and easy. Thanks to a new HTML5 feature, it is possible to support the + importing of images from the local disk to the application without any server-side function.

-

Digital Lightbox has been developed using some of the latest web technologies - available, such as HTML5, CSS3, the front-end framework Bootstrap,

Bootstrap, .

and the JavaScript (ECMAScript 6) programming language, in combination with the jQuery library.

.

The code architecture has been designed - to be modular and easily extensible by other developers or third parties: indeed, it has - been released as open source software on GitHub,

Digital - Lightbox, .

and is freely available to be downloaded, edited, and tinkered +

+ Digital Lightbox has been developed using some of + the latest web technologies available, such as HTML5, CSS3, the front-end framework Bootstrap,

Bootstrap, .

and the JavaScript (ECMAScript 6) programming language, + in combination with the jQuery + library.

.

The code architecture has been designed to be modular and easily + extensible by other developers or third parties: indeed, it has been released as open + source software on GitHub,

+ Digital Lightbox, .

and is freely available to be downloaded, edited, and tinkered with.

-

The Digital Lightbox represents a perfect complementary feature for the EVT project: a - graphic-oriented tool to explore, visualize, and analyze digital images of manuscripts. - While EVT provides a rich and usable interface to browse and study manuscript texts - together with the corresponding images, the tools offered by the Digital Lightbox allow - users to identify, gather, and analyze visual details which can be found within the - images, and which are important for inquiries relating, for instance, to the style of - the handwriting, decorations on manuscript folia, or page layout.

-

An effort to adapt and integrate the Digital Lightbox into EVT is already underway, - making it available as a separate, image-centered view, but there is a major hurdle to - overcome: some of the DL features are only possible within a client-server architecture. - Since EVT or, more precisely, a separate version of EVT will migrate to this - architecture, at some point in the future it will be possible to integrate a full - version of the DL. Plans for the current, client-only version envision implementing all - those features that do not depend on server software: even if this means giving up - interesting features such as collaborative work and annotation, we believe that even a - subset of the available tools will be an invaluable help for manuscript image analysis. - Furthermore, as noted above, thanks to HTML5 and CSS3 it will become more and more - feasible to implement features in a client-only mode.

+

The + Digital Lightbox represents a perfect complementary + feature for the + EVT project: a graphic-oriented tool to explore, + visualize, and analyze digital images of manuscripts. While + EVT provides a rich and usable interface to browse + and study manuscript texts together with the corresponding images, the tools offered by + the Digital Lightbox allow users to identify, gather, and analyze visual details which + can be found within the images, and which are important for inquiries relating, for + instance, to the style of the handwriting, decorations on manuscript folia, or page + layout.

+

An effort to adapt and integrate the Digital Lightbox into + EVT is already underway, making it available as a + separate, image-centered view, but there is a major hurdle to overcome: some of the DL + features are only possible within a client-server architecture. Since + EVT or, more precisely, a separate version of + EVT will migrate to this architecture, at some point + in the future it will be possible to integrate a full version of the DL. Plans for the + current, client-only version envision implementing all those features that do not depend + on server software: even if this means giving up interesting features such as + collaborative work and annotation, we believe that even a subset of the available tools + will be an invaluable help for manuscript image analysis. Furthermore, as noted above, + thanks to HTML5 and CSS3 it will become more and more feasible to implement features in + a client-only mode.

New Architecture

In September 2013 we met with researchers of the Clavius on the Web project

See . A - preliminary test using a previous version of EVT is available at + EVT is available at .

to discuss a possible use of - EVT in order to visualize the documents that they are collecting and encoding; the main - goal of the project is to produce a web-based edition of all the correspondence of this - important sixteenth–seventeenth-century mathematician.

Currently preserved at - the Archivio della Pontificia Università Gregoriana.

The integration of - EVT with another web framework used in the project, the eXist XML database, will require - a very important change in how the software works: as mentioned above, everything from - XSLT processing to browsing of the resulting website has been done on the client - side, but the integration with eXist will require a move to the more complex - client-server architecture. A version of EVT based on this architecture would present + + EVT in order to visualize the documents that they + are collecting and encoding; the main goal of the project is to produce a web-based + edition of all the correspondence of this important sixteenth–seventeenth-century + mathematician.

Currently preserved at the Archivio della Pontificia + Università Gregoriana.

The integration of + EVT with another web framework used in the project, + the eXist XML database, will require a very important change in how the software works: + as mentioned above, everything from XSLT processing to browsing of the resulting + website has been done on the client side, but the integration with + eXist will require a move to the more complex + client-server architecture. A version of + EVT based on this architecture would present several advantages, not only the integration of a powerful XML database, but also the - implementation of a full version of the Digital Lightbox. We will try to make the move - as painless as possible and to preserve the basic simplicity and flexibility that has - been a major feature of EVT so far. The client-only version will not be abandoned, - though for quite some time there will be parallel development with features trickling - from one version to the other, with the client-only one being preserved as a subset of - the more powerful one.

+ implementation of a full version of the + Digital Lightbox. We will try to make the move as + painless as possible and to preserve the basic simplicity and flexibility that has been + a major feature of + EVT so far. The client-only version will not be + abandoned, though for quite some time there will be parallel development with features + trickling from one version to the other, with the client-only one being preserved as a + subset of the more powerful one.

@@ -876,32 +1005,36 @@ to the publishing of TEI-encoded digital editions, this software has grown to the point of being a potentially very useful tool for the TEI community: since it requires little configuration, and no knowledge of programming languages or web frameworks except for what - is needed to apply an XSLT stylesheet, it represents a user-friendly method + is needed to apply an XSLT stylesheet, it represents a user-friendly method for producing image-based digital editions. Moreover, its client-only architecture makes it very easy to test the edition-building process (one has only to delete the output folders and start anew) and publish preliminary versions on the web (a shared folder on any cloud-based service such as Dropbox is all that is needed).

-

While EVT has been under development for 3–4 years, it was thanks to the work and focus - required by the Digital Vercelli Book release at end of 2013 that we now have a solid - foundation on which to build new features and refine the existing ones. Some of the future - expansions also pose important research questions: this is the case with the critical - edition support, which touches a sensitive area of the very recent digital philology - discipline.

Digital philology makes use of ICT methods and tools, such as text - encoding, in the context of textual criticism and philological study of documents to - produce digital editions of texts. While many of the first examples of such editions - were well received (see for instance Kiernan - 2013; also see Siemens 2012 for an - example of the new theoretical possibilities allowed by the digital medium), serious - doubts concerning not only their durability and maintainability, but also their - methodological effectiveness, have been raised by some scholars. The debate is still - ongoing, see Gabler 2010, Robinson 2005 and 2013, Rosselli Del Turco, - forthcoming.

The collaborative work features of the Digital - Lightbox are also critical to the way modern scholars interact and share their research - findings. Finally, designing a user interface capable of hosting all the new features, - while remaining effective and user-friendly, will itself be very challenging.

+

While + EVT has been under development for 3–4 years, it was + thanks to the work and focus required by the Digital Vercelli Book release at end of 2013 + that we now have a solid foundation on which to build new features and refine the existing + ones. Some of the future expansions also pose important research questions: this is the + case with the critical edition support, which touches a sensitive area of the very recent + digital philology discipline.

Digital philology makes use of ICT methods and + tools, such as text encoding, in the context of textual criticism and philological + study of documents to produce digital editions of texts. While many of the first + examples of such editions were well received (see for instance Kiernan 2013; also see Siemens 2012 for an example of the new theoretical + possibilities allowed by the digital medium), serious doubts concerning not only their + durability and maintainability, but also their methodological effectiveness, have been + raised by some scholars. The debate is still ongoing, see Gabler 2010, Robinson + 2005 and 2013, Rosselli Del Turco, forthcoming.

The + collaborative work features of the + Digital Lightbox are also critical to the way modern + scholars interact and share their research findings. Finally, designing a user interface + capable of hosting all the new features, while remaining effective and user-friendly, will + itself be very challenging.

diff --git a/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml b/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml index 5fae6897..6f7d0f71 100644 --- a/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml +++ b/data/JTEI/9_2016-17/jtei-9-armaselu-source.xml @@ -1121,7 +1121,7 @@ Commonwealth Office, Western Organisations Department: Registered Files (W and WD Series). Western European Union (WEU). Future of Standing Armaments Committee of Western European Union. 01/01/1975–31/12/1975, FCO 41/1749 (Former Reference Dep: WDU 11/1 PART B). The interpretation of the less predictable results is not straightforward, since they may have been determined by an under- or overrepresentation of certain elements in the discourse, diff --git a/data/JTEI/9_2016-17/jtei-9-turska-source.xml b/data/JTEI/9_2016-17/jtei-9-turska-source.xml index d3ca9a92..50f7b614 100644 --- a/data/JTEI/9_2016-17/jtei-9-turska-source.xml +++ b/data/JTEI/9_2016-17/jtei-9-turska-source.xml @@ -303,7 +303,7 @@ SARIT or Buddhist Stonesutras or experiments with EEBO-TCPEarly English Books Online eXist-db app, accessed February 11, 2016, . are more than promising (see, for example, Wicentowski and diff --git a/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml b/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml index 3d41b158..0eed9e2e 100644 --- a/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml +++ b/data/JTEI/rolling_2021/jtei-vagionakis-204-source.xml @@ -73,14 +73,14 @@

The paper presents the database Cretan Institutional Inscriptions, which was created as part of a PhD research project carried out at the University of Venice Ca’ Foscari. The database, built using the EpiDoc Front-End Services (EFES) platform, collects the EpiDoc editions of six hundred inscriptions that shed light on the institutions of the political entities of Crete from the seventh to the first century BCE. The aim of the paper is to outline the main issues addressed during the creation of the database and the encoding of the inscriptions and to illustrate the core features of the database, with an emphasis on the advantages deriving from the combined use of the TEI-EpiDoc standard and of the EFES platform.

@@ -124,15 +124,15 @@ document. The editions of these inscriptions, along with a collection of the most relevant literary sources, have been collected in the database Cretan Institutional Inscriptions, which I created using the EpiDoc Front-End Services (EFES) platform. To facilitate consulting the epigraphic records, the database also includes, in addition to the ancient sources, two catalogs providing information about the Cretan political entities and the institutional elements considered.

The aim of this paper is to illustrate the main issues tackled during the creation of the database and to examine the choices made, focusing on the advantages offered by - the use of EpiDoc and EFES.

+ the use of EpiDoc and EFES.

Cretan Epigraphy and Cretan Institutions @@ -246,7 +246,7 @@
Towards the Creation of a Born-Digital Epigraphic Collection with EFES

Once the relevant material had been defined, another major issue that I had to face was to decide how to deal efficiently with it. While I was in the process of starting @@ -269,20 +269,21 @@ collection of editions of the previously selected six hundred inscriptions to creating it as a born-digital epigraphic collection because of another event that also happened in 2017: the appearance of a powerful new tool for digital epigraphy, - EpiDoc Front-End Services (EFES).

GitHub repository, accessed July - 21, 2021, .

Although I - was already aware of the many benefits deriving from a semantic markup of the - inscriptions,

On which see - and .

what - really persuaded me to adopt a TEI-based approach for the creation of my epigraphic - editions was actually the great facilitation that EFES offered in using - TEI-EpiDoc, which I will discuss in the following section.

+ EpiDoc Front-End Services (EFES).

GitHub repository, accessed July + 21, 2021, .

Although I was already aware of the many benefits + deriving from a semantic markup of the inscriptions,

On which see and .

what really persuaded me to + adopt a TEI-based approach for the creation of my epigraphic editions was actually + the great facilitation that EFES offered in using TEI-EpiDoc, which I will + discuss in the following section.

- The Benefits of Using EpiDoc and EFES + The Benefits of Using EpiDoc and EFES

I was already familiar with the epigraphic subset of the TEI standard, EpiDoc,

EpiDoc: Epigraphic Documents in TEI XML, accessed July 21, 2021,

This is particularly true for the creation of publishable output of the encoded - inscriptions. The EpiDoc Reference XSLT Stylesheets, created for transformation of - EpiDoc XML files into HTML,

Accessed July 21, 2021, .

require - relatively advanced knowledge of XSLT to use them to produce a satisfying HTML - edition for online publication or to generate a printable PDF. Not to mention the - creation of a complete searchable database to be published online, equipped with - indexes and appropriate search filters: this is far beyond the IT skills of the - average epigraphist.

+ inscriptions. The EpiDoc Reference XSLT Stylesheets, created for + transformation of EpiDoc XML files into HTML,

Accessed July 21, 2021, + .

require relatively advanced knowledge of XSLT to use + them to produce a satisfying HTML edition for online publication or to generate a + printable PDF. Not to mention the creation of a complete searchable database to be + published online, equipped with indexes and appropriate search filters: this is far + beyond the IT skills of the average epigraphist.

The situation is a little better for those who use EpiDoc as a tool for simplifying their research work on a collection of ancient documents, without aiming at the publication of the encoded inscriptions. The querying of a set of EpiDoc inscriptions is possible to some extent even without technical support: in some advanced XML - editors, particularly Oxygen, it is possible to perform XPath queries that allow the - identification of all the occurrences of specific features in the epigraphic - collection according to their markup. The XPath queries in an advanced XML editor - also allow the creation of lists of specific elements mentioned in the inscriptions, - but to my knowledge the creation of proper indexes—before EFES—was - almost impossible to achieve without the help of an IT expert.

+ editors, particularly + Oxygen, it is possible to perform XPath queries + that allow the identification of all the occurrences of specific features in the + epigraphic collection according to their markup. The XPath queries in an advanced XML + editor also allow the creation of lists of specific elements mentioned in the + inscriptions, but to my knowledge the creation of proper indexes—before EFES—was almost impossible to achieve without the help of an IT expert.

Thus, despite the many benefits that EpiDoc encoding potentially offers, epigraphists might often be discouraged from adopting it by the amount of time that such an approach requires, combined with the fact that in many cases these benefits become tangible only at the end of the work, and only if one has IT support.

In light of these limitations, it is easy to understand how deeply the release of - EFES has transformed the field of digital epigraphy. EFES, developed at the Institute of Classical Studies of the School of - Advanced Study of the University of London as the epigraphic specialization of the - Kiln platform,

New Digital Publishing Tool: EpiDoc Front-End - Services, September 1, 2017, EFES has transformed the field of digital epigraphy. EFES, developed + at the Institute of Classical Studies of the School of Advanced Study of the + University of London as the epigraphic specialization of the + Kiln platform ,

New + Digital Publishing Tool: EpiDoc Front-End Services, September 1, + 2017, ; see also the Kiln GitHub repository, accessed July 21, 2021,.

is a platform that - simplifies the creation and management of databases of inscriptions encoded following - the EpiDoc Guidelines. More specifically, EFES was developed to make - it easy for EpiDoc users to view a publishable form of their inscriptions, and to - publish them online in a full-featured searchable database, by easily ingesting - EpiDoc texts and providing formatting for their display and indexing through the - EpiDoc reference XSLT stylesheets. The ease of configuration of the XSLT + />; see also the Kiln GitHub repository, accessed July 21, 2021, + .

is a + platform that simplifies the creation and management of databases of inscriptions + encoded following the EpiDoc Guidelines. More specifically, + EFES was developed to make it easy for EpiDoc + users to view a publishable form of their inscriptions, and to publish them online in + a full-featured searchable database, by easily ingesting EpiDoc texts and providing + formatting for their display and indexing through the EpiDoc + reference XSLT stylesheets. The ease of configuration of the XSLT transformations, and the possibility of already having, during construction, an immediate front-end visualization of the desired final outcome of the TEI-EpiDoc marked-up documents, allow smooth creation of an epigraphic database even without a - large team or in-depth IT skills. Beyond this, EFES is also remarkable for + large team or in-depth IT skills. Beyond this, EFES is also remarkable for the ease of creation and display of the indexes of the various categories of marked-up terms, which significantly simplifies comparative analysis of the data - under consideration. EFES is thus proving to be an extremely useful + under consideration. EFES is thus proving to be an extremely useful tool not only for publishing inscriptions online, but also for studying them before their publication or even without the intention of publishing them, especially when dealing with large collections of documents and data sets.

See Bodard and Yordanova (2020).

-

Some of these useful features of EFES are common to other existing tools, - such as TEI Publisher,

Accessed July 21, 2021, .

- TAPAS,

Accessed July 21, 2021, Some of these useful features of EFES are common to other existing tools, + such as TEI Publisher,

Accessed July 21, 2021, + .

+ TAPAS,

Accessed July 21, 2021, .

or Kiln itself, which is - EFES’s direct ancestor. What makes EFES unique, - however, is the fact that it is the only one of those tools to have be designed - specifically for epigraphic purposes and to be deeply integrated with the EpiDoc - Schema/Guidelines and with its reference stylesheets. Not only does it use, by - default, the EpiDoc reference stylesheets for transforming the inscriptions and for - indexing, it also comes with a set of default search facets and indexes that are - specifically meant for epigraphic documents. The default facets include the findspot - of the inscription, its place of origin, its current location, its support material, - its object type, its document type, and the type of evidence of its date. The - search/browse page, moreover, also includes a slider for filtering the inscriptions - by date and a box for textual searches, which can be limited to the indexed forms of - the terms. The default indexes include places, personal names (onomastics), - identifiable persons (prosopography), divinities, institutions, words, lemmata, - symbols, numerals, abbreviations, and uninterpreted text fragments. New facets and - indexes can easily be added even without mastering XSLT, along the lines of the - existing ones and by following the detailed instructions provided in the EFES Wiki documentation.

Accessed July 21, 2021, EFES’s direct ancestor. What makes EFES unique, however, is the + fact that it is the only one of those tools to have be designed specifically for + epigraphic purposes and to be deeply integrated with the EpiDoc Schema/Guidelines and + with its reference stylesheets. Not only does it use, by default, the EpiDoc + reference stylesheets for transforming the inscriptions and for indexing, it also + comes with a set of default search facets and indexes that are specifically meant for + epigraphic documents. The default facets include the findspot of the inscription, its + place of origin, its current location, its support material, its object type, its + document type, and the type of evidence of its date. The search/browse page, + moreover, also includes a slider for filtering the inscriptions by date and a box for + textual searches, which can be limited to the indexed forms of the terms. The default + indexes include places, personal names (onomastics), identifiable persons + (prosopography), divinities, institutions, words, lemmata, symbols, numerals, + abbreviations, and uninterpreted text fragments. New facets and indexes can easily be + added even without mastering XSLT, along the lines of the existing ones and by + following the detailed instructions provided in the EFES Wiki + documentation.

Accessed July 21, 2021, . Creation of new facets, last updated April 11, 2018: . Creation of new indexes, last updated May 27, 2020: .

- Furthermore, EFES makes it possible to create an epigraphic concordance of the + Furthermore, EFES makes it possible to create an epigraphic concordance of the various editions of each inscription and to add information pages as TEI XML files (suitable for displaying both information on the database itself and potential additional accompanying information).

Against this background, the combined use of the EpiDoc encoding and of the EFES tool seemed to be a promising approach for the purposes of my research - project, and so it was.

+ type="software" xml:id="R26" target="#EFES"/> + EFES tool seemed to be a promising approach for + the purposes of my research project, and so it was.

I initially aimed to create updated digital editions of the inscriptions mentioning Cretan institutional elements that could be used to facilitate a comparative analysis of the latter. The ability to generate and view the indexes of the mentioned @@ -425,18 +435,18 @@ inscriptions in EpiDoc, totally met my needs, and helped me very much in the identification of recurring patterns. As I was expected to submit my doctoral thesis in PDF format, I also needed to convert the epigraphic editions into PDF, and by - running EFES locally I have been able to view their transformed HTML + running EFES locally I have been able to view their transformed HTML versions on a browser and to naively copy and paste them into a Microsoft Word file.

I am very grateful to Pietro Maria Liuzzo for teaching me how to avoid this conversion step by using XSL-FO, which can be used to generate a PDF directly from the raw XML files. The use of XSL-FO, however, requires some additional skills that are not needed in the copy-and-paste-from-the-browser process.

Although I had not planned it from the beginning, EFES also proved to be useful in the (online) publication of the results of - my research. The ease with which EFES allows the creation of a searchable + my research. The ease with which EFES allows the creation of a searchable epigraphic database, in fact, spontaneously led me to decide to publish it online once completed, making available not only the HTML editions—which can also be downloaded as printable PDFs—but also the raw XML files for reuse. The aim of the @@ -447,8 +457,8 @@
Cretan Institutional Inscriptions: An Overview of the Database -

The core of the EFES-based database Cretan + <p>The core of the <ptr type="software" xml:id="R30" target="#EFES"/><rs + type="soft.name" ref="#R30">EFES</rs>-based database <title level="m">Cretan Institutional Inscriptions consists of the EpiDoc editions of the previously selected six hundred inscriptions, which can be exported both in PDF and in their original XML format. Each edition is composed of an essential descriptive @@ -486,8 +496,8 @@ >Political entities, Institutions, Literary sources, and Bibliographic references, have been added to the database as pages generated from TEI - XML files, which could be natively included in EFES.

+ XML files, which could be natively included in EFES.

As mentioned above, the database also includes several thematic indexes listing the marked-up terms along with the references to the inscriptions in which they occur, divided into institutions, toponyms and ethnic adjectives, lemmata (both of @@ -663,8 +673,8 @@ type="crossref"/> (I.Cret. II 23 5). -

Given the markup described above, EFES was able to generate detailed indexes +

Given the markup described above, EFES was able to generate detailed indexes having the appearance of rich tables, where each piece of information is displayed in a dedicated column and can easily be combined with the other ones at a glance.

In the most complex case, that of the institutions, the index displays for each @@ -702,8 +712,8 @@ An excerpt from the prosopographical index.

In addition to the more tabular institutional and - prosopographical indexes, EFES facilitated the creation of other more + prosopographical indexes, EFES facilitated the creation of other more traditional indexes, including the indexed terms and the references to the inscriptions that mention them. The encoding of the most significant words with w lemma="" led to the creation of a word index of relevant @@ -721,10 +731,10 @@

Conclusions

In conclusion, I would like to emphasize how particularly efficient the combined use - of EpiDoc and EFES has proven to be for the creation of a thematic database - like Cretan Institutional Inscriptions. By collecting in a searchable database all - the inscriptions pertaining to the Cretan institutions, records that were hitherto + of EpiDoc and EFES has proven to be for the creation of a thematic database like + Cretan Institutional Inscriptions. By collecting in a searchable database all the + inscriptions pertaining to the Cretan institutions, records that were hitherto accessible only in a scattered way, Cretan Institutional Inscriptions is a new resource that can facilitate the finding, consultation, and reuse of these very heterogeneous documents, many of which offer further points of reflection only when diff --git a/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml b/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml index fdcaa0fc..f6423955 100644 --- a/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml +++ b/data/JTEI/rolling_2022/jtei-mitiku-212-source.xml @@ -132,7 +132,7 @@ from manuscripts, to be published alongside the catalogue description of the manuscript itself, we have investigated a series of options, among which we have chosen to use the Transkribus sofware by READ Coop.Accessed February 2, 2022, .

@@ -151,7 +151,7 @@ an historical catalogue that involves copying from the former cataloguer transcription. Having a new transcription, based on autopsy or at least on the images of the manuscript would be preferable and technology as Transkribus allows one to obtain this transcription in an almost entirely automated way. Additionally, most of the internal referencing within a manuscript is done with the indication of the ranges of folios, and in TEI with @@ -202,7 +202,7 @@

The following steps have been taken to carry out an investigation of the possibilities for the automated production of text transcriptions based on images of manuscripts, before we opted for Transkribus + target="#transkribus"/>Transkribus and its integration in the workflow to make texts available in the Beta maṣāḥǝft research environment.

@@ -291,7 +291,7 @@ one script.

- Transkribus

This software is freely accessible and has a subscription model based on credits. The platform was created within the framework of the EU projects @@ -304,7 +304,7 @@ platform. The Pattern Recognition and Human Language Technology (PRHLT) group of the Universitat Politècnica de València and the CITlab group of the University of Rostock should be mentioned in particular.

-

Transkribus comes as an expert tool in its downloadable version and its online version,Accessed February 2, 2022,

Thus, the first stage for developing a model was gathering the data and preparing an initial dataset. Also for this aspect, Transkribus + target="#transkribus"/>Transkribus proved superior to all other options offering support also for this step. Colleagues which we called to contribute could be added to a collection, share their images without publishing them and add their transcriptions in the tool with a very mild learning curve.

-

Within Within Transkribus we have trained a model called Manuscripts from Ethiopia and Eritrea in Classical Ethiopic (Gǝʿǝz).See, accessed February 2, 2022,

Training a model in Transkribus

Gathering data to train an HTR model in Transkribus + target="#transkribus"/>Transkribus was not easy. Researchers were directly asked to contribute images of which they had already done the correct transcription. Sets of images with the relative transcription was thus obtained thanks to the generosity of contributors listed @@ -437,7 +437,7 @@ for the available time of the colleagues to fix the work of the machine, since we intended to train the model again. After three months with a full-time dedicated person, we had more than 50k words in the Transkribus + target="#transkribus"/>Transkribus expert tool, and we could train a model which could be made public, since this is the unofficial threshold to make a model available to everyone.

The features of the final model can be seen in

Adding transcriptions to Beta maṣāḥǝft from Transkribus

Even if a user already worked through each page of a manuscript to produce a transcription, doing it again with Transkribus + target="#transkribus"/>Transkribus and checking it has many advantages, chiefly the alignment of the text regions and lines on the base image to the transcription.Guidelines are provided for this steps to the users in theproject Guidelines, @@ -470,7 +470,7 @@ />.

With the transcribed images, either by hand with the help of the tool, or using the HTR model, the export functionalities of the Transkribus tool, allow to download a TEI encoded version of this transcription where we encourage users to use Line Breaks (lb) instead of l and preserve the coordinates of the boxes.

@@ -487,7 +487,7 @@

We have then prepared a bespoke XSLT transformation which can be used to transform the rich TEI from Transkribus, + target="#transkribus"/>Transkribus, called transkribus2Beta maṣāḥǝft.xsl. This transformation, given a few parameters, @@ -510,7 +510,7 @@

Conclusions -

Working with Working with Transkribus for the Beta maṣāḥǝft project gives the community of users a way to support the process of transcribing to the text on source manuscripts without typing it down. This is not intended to substitute the @@ -612,7 +612,7 @@ Weidemann, Herbert Wurster, and Konstantinos Zagoris. 2019. Transforming scholarship in the archives through handwritten text recognition: <ptr type="software" - xml:id="Transkribus" target="#Transkribus"/><rs type="soft.name" + xml:id="Transkribus" target="#transkribus"/><rs type="soft.name" ref="#Transkribus">Transkribus</rs> as a case study. Journal of Documentation, 75 (5) + + + + + : + + + + + + \ No newline at end of file diff --git a/schema/tei_jtei_annotated.odd b/schema/tei_jtei_annotated.odd index 59ac324c..666508b5 100644 --- a/schema/tei_jtei_annotated.odd +++ b/schema/tei_jtei_annotated.odd @@ -2210,225 +2210,238 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + @@ -3488,59 +3501,13 @@ - - - Bibliographieeintrag für Software - (Bib.Soft) - Die Bibliographie enthält einen - Eintrag für die Software selbst. Dieser Eintrag kann den Namen der Software - selbst enthalten, Namen von Verantwortlichen, eine URL, einen PID, - Versionsangaben, usw. - - - Bibliographieeintrag für - Referenzpublikation (Bib.Ref) - Die Bibliographie enthält einen - Eintrag mit einer Publikation über die Software. - - - Nur namentliche Nennung der - Software (Name.Only) - Die Software ist nur namentlich - genannt. - - - Namentliche Nennung der - Verantwortlichen (Agent) - Personen, Gruppen oder - Institutionen, die für die Entwicklung der Software verantwortlich sind, - werden namentlich genannt. - - - URL - Die Zitation enthält eine URL, die - auf die Software selbst verweist (z.B. zu einer Webseite über die Softwaree, - einem Code-Repositorium, einem Metadatensatz oder einer ausführbaren Version). - URLs zu Publikationen über die Software (z.B. zu Zeitschriftenartikeln, - Büchern oder auch User Manuals) werden nicht gezählt. - - - Persistenter Identifikator - (PID) - Die Zitation enthält einen - persistenten Identifikator (PID), z.B. eine DOI, der auf die Software selbst - verweist (z.B. zu einer Webseite über die Software, einem Code-Repositorium, - einem Metadatensatz oder einer ausführbaren Version). PIDs zu Publikationen - über die Software (z.B. zu Zeitschriftenartikeln, Büchern oder auch User - Manuals) werden nicht gezählt. - - - Version (Ver) - Die Zitation enthält die Angabe - einer bestimmten Softwareversion oder -revision und ggf. anderweitig - notwendige Spezifikationen (z.B. eine Version für ein spezifisches - Betriebssystem, ein bestimmtes Softwarepaket oder ein Datum). - + + + + + + + diff --git a/schema/tei_jtei_annotated.rng b/schema/tei_jtei_annotated.rng index 94e0eeaf..52920f4f 100644 --- a/schema/tei_jtei_annotated.rng +++ b/schema/tei_jtei_annotated.rng @@ -5,7 +5,7 @@ xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" ns="http://www.tei-c.org/ns/1.0"> @@ -18,9 +19,21 @@ match="//tei:TEI/tei:text/tei:back/tei:div/tei:schemaSpec/tei:dataSpec[@ident='software.mention.target']/tei:content[1]/tei:alternate[1]/tei:valList[1]"> - + + + + + + + + + + + + diff --git a/utilities/ids2lowercase.xsl b/utilities/ids2lowercase.xsl new file mode 100644 index 00000000..c09a02f1 --- /dev/null +++ b/utilities/ids2lowercase.xsl @@ -0,0 +1,22 @@ + + + + + + + + + + + + + + + + + + + +