Skip to content

Detailed syntax of information elements in the SIS

Piotr Banski edited this page Jun 13, 2024 · 9 revisions

(*** See an error, omission, obsolete information below? Let us know by opening a new issue report with one click. Thanks! ***)

This page provides more details on editing information elements in the SIS.

1. The easy way

First of all, if you choose to clone or fork the entire standards repository, editing XML information will be made easier thanks to the associated document grammars that provide some content completion or warn you about errors. That should work out of the box for any reasonably modern XML editor that recognises XML Schema and Schematron associations.

2. The target

Data deposition format recommendations are hiding in the directory /SIS/clarin/data/recommendations/.

3. Digression: content

In the process of preparing format recommendations, some information is completely predefined: these are the data domain names and the recommendation levels. XML Schema supplies them in the form of drop-down selections; otherwise you're down to copy & paste, and in the crucial places, the SIS makes that easier by providing buttons that automatically copy names into the clipboard. That is true of domain names and also the data deposition formats that have been described in the SIS.

Sometimes, the format that a centre recommends (or discourages, etc.) will not (yet) be described by the SIS. A list of such formats, not having their own information pages but nevertheless mentioned by recommendations, is to be found in our Sanity Checker, at the top.

If that still doesn't help, please make up a sensible ID and use that in your recommendations, and kindly notify us about that e.g. when submitting a pull request.

4. General centre/repository information

Use the element <info> for that. Note that that element may bear the @xml:lang attribute to indicate the language of the content. It is expected that, for example, Text+ centres are going to present at least some of their information in German (xml:lang="de"). Where the attribute is not present, its value is defaulted to "en" = English.

The syntax of <info> is a mix of <p> (paragraph), <ul> (unordered list), and <ol> (ordered list). The latter two contain one or more <li> elements that are basically equivalent to paragraphs.

Inline elements that can be used inside <p> or <li> are:

  • <a> for links (with the obligatory @href attribute)
  • <code> for quoted code or labels, etc. (monotype font)
  • <formatRef> for links (with the obligatory @ref attribute that takes unprefixed format IDs as values)
  • <i> for highlighted passages (cursive font)

5. Comments on the individual recommendations

The role of the comments is either to provide more information or to provide finer granularity. The latter role is illustrated below:

      <format id="fWave">
         <domain>Audiovisual Source Language Data</domain>
         <level>recommended</level>
         <comment>PCM-WAV, 48 kHz, 16 bit</comment>
      </format>
      <format id="fWave">
         <domain>Audiovisual Source Language Data</domain>
         <level>acceptable</level>
         <comment>PCM-WAV with non-recommended parameters (not 48 kHz, 16 bit)</comment>
      </format>

(Same format ID, same domain, but different recommendation levels depending on the subcategorisation provided in the comments.)

Comments can also be language-tagged:

      <format id="fTextPlain">
         <domain>Textual Source Language Data</domain>
         <level>recommended</level>
         <comment>without markup</comment>
         <comment xml:lang="de">ohne Mark-up</comment>
      </format>

They can also reference other formats:

      <format id="fCHAT-XML">
         <domain>Audiovisual Annotation</domain>
         <level>discouraged</level>
         <comment>Consider using <formatRef ref="fTEISpoken"/> instead.</comment>
      </format>

Inline elements that can be used inside <comment> are:

  • <a> for links (with the obligatory @href attribute)
  • <code> for quoted code or labels, etc. (monotype font)
  • <formatRef> for links (with the obligatory @ref attribute that takes unprefixed format IDs as values)
  • <i> for highlighted passages (cursive font)

6. Language tags

It is possible to tag the <info> and <comment> elements with language tags, by means of the xml:lang attribute. So far, this has been used systematically when switching environments, in particular for Text+, which uses both German and English. As of April 2024, we have not yet had the occasion to test the effectiveness of language tags in recommendations of the individual CLARIN centres (CLARIN's official language is English and that has so far sufficed). But we're open to testing this -- just let us know, please, if a centre wants to do that (and especially if that fails!).

7. Whoddunit

Each template contains now an element that has a serious effect on the centre information page: it switches off the warning that says, in red, "The recommendations have not been curated yet" and instead provides information on who may be contacted if something seems off. For now, let's just have a look at the documented fragment that is present in each template (before the template gets customised):

        <respStmt>
            <name><!--for multiple curators, use multiple respStmt elements, each with a resp--></name>
            <resp><!--optionally specify the range of responsibility; useful if multiple respStmt are used--></resp>
            <link><!--URI; GitHub page or any other way to id/contact the curator--></link>
            <reviewDate>0001-01-01</reviewDate>
        </respStmt>