diff --git a/.github/workflows/schema.yml b/.github/workflows/schema.yml
index 627a11fc..a911337f 100644
--- a/.github/workflows/schema.yml
+++ b/.github/workflows/schema.yml
@@ -29,6 +29,8 @@ jobs:
git config --local user.name "GitHub Action"
git add schema/tei_software_annotation.xml
git add schema/tei_software_annotation.rng
+ git add schema/tei_jtei_annotated.odd
+ git add schema/tei_jtei_annotated.rng
git commit -m "Add updated odd and generated rng"
- name: Push changes
uses: ad-m/github-push-action@master
diff --git a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml
index 9ec8142f..c62b4c2b 100644
--- a/data/JTEI/10_2016-19/jtei-10-haaf-source.xml
+++ b/data/JTEI/10_2016-19/jtei-10-haaf-source.xml
@@ -211,9 +211,10 @@
target="http://www.deutschestextarchiv.de/doku/software#cab"/>. as
well as collaborative text
correction and annotation
The
@corresp="#animal"
).
- In horse
is the
+
In horse
is the
value for the attribute horse
since 1517 at maximum (corresponding to the first Spanish
@@ -2485,8 +2489,8 @@
For the issues regarded as the most fundamentally important to creating a dynamic and
sustainable model for both etymology and general lexicographic markup in TEI, we have
- submitted formal requests for changes to the TEI
Although this paper is TEI-centered, other XML technologies will be mentioned.
The idea of basing corpus texts directly on + theCorpus Resource Database (CoRD) (): Placing the Helsinki Corpus Middle English Section Introduction into + Context ,The idea of basing corpus texts directly on manuscript sources has been presented more recently.The principles of preparing manuscript texts for print have undergone changes during the history of editing
Either way, in order to avoid this potential maintenance nightmare,
Remember, the TEI Guidelines are written in TEI. The
source to all of P5 is a single TEI document, although for convenience it is
split into well over 850 separate files.
The
The For example, the
@@ -902,10 +907,13 @@
target="http://www.tei-c.org/Vault/P5/3.3.0/xml/tei/custom/schema/relaxng/tei_customization.rnc"
/>.
Furthermore, the current version of
Because P5 does not use the ID/IDREF mechanism, the only one of
the three added constraints that is useful is (2), that the value of
@@ -1157,7 +1166,8 @@
The goal of the CORLI consortium is to make it easier to deposit, share, and reuse data. With this goal in mind, CORLI has always promoted the use of open public repositories and open formats. Our policy is to advocate for the use of a common single @@ -213,14 +231,18 @@
Many software packages dedicated to editing spoken language transcription contain
- utilities that can convert many formats: for example,
The list of tools that are considered in the two projects is nearly the same. The only
- tools missing in the TEICORPO approach are
There are, however, differences between the two projects that make them nonredundant but complementary, each project having specificities that can be useful or damaging - depending on the user’s needs. One minor difference is that the TEICORPO project is not - a functionality of an editing tool, but is a standalone tool for converting data between - one format and another. This had certain effects on the user interface and explains some - of the choices made in the development of the two tools.
-There are two major differences between TEICORPO and Schmidt’s approach, which affected
+ depending on the user’s needs. One minor difference is that the
There are two major differences between
The second major difference is that the TEICORPO initiative does not target only spoken
+ developing
The second major difference is that the
Data in PRAAT and
Data in
Processing of the microstructure, with the exception of information already available - in the tools themselves (for example silence in Transcriber), is not done during the - conversion to TEI. The division into words or other elements such as morphemes or - phonemes is not systematically done in any of the tools used by researchers in CORLI. - When it exists, it is not included in the main transcription line but most often in - dependent lines, as it represents an annotation with its own rules and guidelines. - Division into words or other elements is part of the linguistic analysis rather than a - simple storage operation.
-TEICORPO therefore preserves as long-term storage data both the original information - that was created in the original software—the full unprocessed transcription—and the - other linguistically processed transcriptions and annotations. For TEICORPO, - microstructure processing, such as division into words, or text standardization when - necessary, belongs to the linguistic analysis of the corpora. Hence, the TEI data file - can be used both for data exploration and for scientific purposes. For example, when a - researcher needs to parse the data, or to explore the data with textometric tools, - then it is necessary to decide which type of preprocessing is necessary. As this - decision often depends on the initial project as well as on linguistic choices, it is - difficult to standardize this task.
+ in the tools themselves (for example silence inThe TEICORPO project contains two different sets of tools. One set focuses on conversion - between various software packages used for spoken language coding and TEI. The other set - focuses on using the TEI format for linguistic analyses (textometric or grammatical - analyses).
+ TheThe
Some common practices have been identified in our community but other uses of the same software are of course possible:
It should be pointed out here that whereas Transcriber and CLAN files nearly always
contain
The list of tools reflects the uses and practices in the CORLI network, and is very
similar to the list suggested by Schmidt (2011) with the exception of
Alignment applications deal with two main types of data presentation and organization. The presentation of the data has direct consequences for how the data are exploited, and therefore on the design of the tools that are used.
@@ -391,35 +459,42 @@ chronologically but is sorted by the names of the tiers (or any other order), with all the production within the same tier sorted by timeline. -No tool offers both types of presentation.
No tool offers both types of presentation.
Each presentation format has its own pros and cons. Because of the possibilities offered by the presentation formats, and because the same software, even within the same presentation models, rarely provides a solution for all the needs of all users, researchers often have to use two or more pieces of software.
-The use of multiple tools is quite common. For example, Praat and Transcriber cannot be
- used when working on video recordings because these programs are limited to audio
- formats. But if researchers need to conduct spectral analysis for some purpose, they
- will have to use the Praat software and convert not only the transcription, but also the
- media. In the field of language acquisition, where the CLAN software is generally used
- to describe both the child productions and the adult productions, when researchers are
- interested in gestures, they use the
The use of multiple tools is quite common. For example,
Converting the metadata is straightforward, as the four tools (CLAN,
Converting the metadata is straightforward, as the four tools (
Moreover, some tools, such as Transcriber, include information about silences, - pauses, and events in their XML format. This information is also processed within - TEICORPO, once again following the recommendations of the ISO/TEI standard.
+ pauses, and events in their XML format. This information is also processed withinConversion of the main data, the transcription and the annotations, cannot always be
done solely on the basis of the description provided in the ISO/TEI guidelines. These
- guidelines do, however, suffice to fully describe the content of the CLAN and
- Transcriber software. We took advantage of the new
The
In the TEICORPO approach, no modification is made to the original format and conversion
- remains as lossless as possible. This allows for all types of corpora to be stored for
- long-term preservation purposes. It also allows the corpora to be used with other
- editing tools, some of which are suited to specific processing: for example, Praat for
- phonetics/phonology; Transcriber/CLAN for raw transcription; and
In the
However, a large proportion of scientific research and applications done using corpora requires further processing of the data. For example, although querying or using raw language forms is possible, many research investigations and tools use words, parts of @@ -754,8 +867,9 @@ structure. This microstructure is integrated in Schmidt’s approach, in which the TEI file can contain standardized information about words, specific spoken language information, and sometimes even POS information.
-This approach was not adopted in TEICORPO for several reasons. First, we had to deal - with a large variety of coding approaches, which makes it difficult to conduct work +
This approach was not adopted in
The command-line interface (see
In addition to the conversion to and from the alignment software, the online version of
- TEICORPO offers import and export in common spreadsheet formats (.xlsx and .csv) and
- word processing formats (.docx and .txt). Importing data is useful to create new data,
- and exporting is used to make reports or examples for a publication and for end users
- not familiar with transcription tasks or computer software (see
The environment variable TREE_TAGGER can be used to locate the model and the program.
- If no -program
option is used, the default name for the TreeTagger
- program is used.
-program
option is used, the default name for the The -model
parameter is mandatory.
The resulting filename ends with .tei_corpo_ttg.tei_corpo.xml
or a
specific name provided by the user (option -o
).
The Stanford Core Natural Language Processing Accessed March 11, 2021, Accessed March 11, 2021, Accessed May 5, 2021,
The The results from the grammatical analysis can be used in transcription files such as
- those used by Praat and An example is presented below in the An example is presented below in the Export can be done from TEI into a format used by textometric software (see See the Textométrie website, last updated June 29, 2020, See the Textométrie website, last updated June 29, 2020, The additional functionalities available in the TEICORPO suite are close to those
- available in the Weblicht web services (Hinrichs, Hinrichs, and Zastrow 2010). To a certain extent, the two suites of
- tools (Weblicht and TEICORPO) have the same purpose and functionalities. They can import
- data from various formats, run similar processes on the data, and export the data for
- scientific uses. In some cases, the services could complement each other or TEICORPO
- could be integrated in the Weblicht services. This is the case, for example, for
- handling the CHILDES format, which at the time of writing is more functional in TEICORPO
- than in Weblicht. The additional functionalities available in the A major difference between the two suites is in the way they can be used and in the
- type of data they target. TEICORPO is intended to be used not as an independent tool,
- but as a utility tool that helps researchers to go from one type of data to another. For
- example, the syntactic analysis is intended to be used as a first step before being used
- in tools such as Praat, Our approach is somewhat similar to what is suggested in the conclusion of Schmidt,
Hedeland, and Jettka (2017), who describe a
- mechanism that makes it possible to use the power of Weblicht to process their files
- that are in the ISO/TEI format. A similar mechanism could be used within TEICORPO to
- take advantage of the tools that are implemented in Weblicht. However, Schmidt,
- Hedeland, and Jettka (2017) suggest in
- their conclusion that it would be more interesting to work directly on ISO/TEI files
- because they contain a richer format. This is exactly what we did in TEICORPO. Our
- suggestion would be to use the tools created by Schmidt, Hedeland, and Jettka (2017) directly with the TEICORPO files, so
- that their work would complement ours. Moreover, in this way, the two projects would be
- compatible and provide either new functionalities when the projects have clearly
- different goals, or data variants when the goals are closer.java -cp "teicorpo.jar:directory_for_SNLP/*" fr.ortolang.teicorpo.TeiSNLP
@@ -1136,7 +1307,7 @@
-cp
to indicate that 5 GB of memory will be used
for a full English analysis).-a
option of TEICORPO, or limiting the display with the annotation
- tool.-a
option of
TEICORPO is a functional tool, created by the CORLI network and ORTOLANG, that converts - files created by software specializing in editing spoken-language data into TEI format. - The result is fully compatible with the most recent developments in TEI, especially those - that concern spoken-language material.
+The TEI files can also be converted back to the original formats or to other formats used
in spoken-language editing to take advantage of their functionalities. This makes TEI a
- useful pivot format. Moreover, TEICORPO allows conversion to formats used by tools
+ useful pivot format. Moreover,
TEICORPO exists as a command-line interface as well as a web service. It can thus be used - by novice as well as advanced users, or by developers of linguistic software. The tool is - free and open source so it can be further used and developed in other projects.
-TEICORPO is intended to be part of a large set of tools using TEI for linguistic corpus
- research. It can be used in parallel with or as a complement to other tools such as
- Weblicht or the
Potential further developments could provide wider coverage of different formats such as - CMDI or linked data for editing or data exploration purposes; allow TEICORPO to work with - other external tools such as grammatical analyzers; or enable the visualization of - multilevel annotations.
+ CMDI or linked data for editing or data exploration purposes; allow