Skip to content

Inputhons

Piotr Banski edited this page Jul 11, 2023 · 26 revisions

"Inputhon" is our super-fancy name for a type of a hackathon where the persons responsible for a centre's recommendations for data deposition formats meet for (say) an hour in order to prepare or update their centre's content for the SIS.

Please note: the content of this document is still being formed. A pilot inputhon at the IDS is going to be held in July 2023 and feedback from there will make it to here. But feedback on the contents is very welcome at any point -- click to open a new github issue to let us know what's wrong or what needs improving.

0. TL;DR

The goal is to (ideally) end the event with a submission of a pull request against one of the files in https://github.com/clarin-eric/standards/tree/formats/SIS/clarin/data/recommendations (note that it's not the master branch).

Post-event, the centre can either

  • point its users to the SIS (recommended, because of the data aggregation that happens there), or else
  • re-use the same data (note: you don't want to maintain two copies of recommendations, do you) by pulling them out of the SIS via its API (an example is supplied; essentially, you just need to style the data according to your site's make-up).

1. Motivation

For CLARIN B-centres which need to undergo (re-)certification,

  • storing format recommendations in the SIS satisfies the relevant CoreTrustSeal recommendation (see section 8 (R08, "Deposit & Appraisal") of the Extended Guidance), which checks, a.o., whether the repository offers a list of preferred formats.
  • Incidentally, two bullets down, R08 asks about info on "the approach towards digital objects that are deposited in non-preferred formats" -- that information can also be provided by the SIS, both in the general section describing the centre and/or in comments on formats, especially those labelled as "discouraged" (="non-preferred", in CTS lingo).

For other centres/repositories, storing the information is a way to:

  • get that done in a uniform format, and along a tested route;
  • be able to use a clean template and/or examples provided in the recommendations by other centres;
  • obtain statistics based on the aggregated information
  • not bother about displaying the information...
    • at all, if the centre/repository points to the SIS for that purpose, or
    • much, if the centre takes the data via the SIS API and applies its own (or provided) CSS styling to it.

2. Preparation

These steps are optional but advisable. If they seem like too much time investment, skip them. But we would appreciate if you could go via pull requests, also for the sake of keeping track of the project's history.

2.1. Give us a heads-up

Tell us about the intention to hold an inputhon, so that we can make sure that the centre is represented in the system, and that at least a skeletal recommendations file for it exists. We can then also at least try to make ourselves available for consultation over Zoom, etc.

2.2. Get the SIS

  • fork the SIS, clone your own repo instance, install eXist and the SIS
  • optionally, you might want to integrate that new DB instance with your oXygen editor (yes, there's a lot of assumptions here), because then you will be able to visualise your changes just by dragging the recommendations file from oXygen's project panel to the DB connection panel (and refreshing the local SIS instance in the browser). Please do not worry if this paragraph is not clear to you.

2.3. Locate the XML document describing the recommendations for your centre

The native GitHub way, if you've forked/cloned

The recommended way is to look at the SIS/clarin/data/recommendations/ directory, and locate your centre's data. For example for the IDS, the document is IDS-recommendation.xml. Please bear in mind that the same centre may use different names across different RIs, so search also for the alternatives. We're not yet sure how to handle that kind of variation and your opinion on this matter may help.

If you can't locate your centre, please let us know, either by e-mail (see the "About" page of the SIS) or by posting an issue.

The workaround by exporting centre data

If you don't want to bother with cloning the SIS repository (oh please, do bother...) then locate your centre in the list of centres supported by the SIS. If you can't locate the centre, use the link above to post a github issue.

Once you have located your centre, click on "download template" (if the page is empty) or "export table to XML" if the table has already been populated. In the latter case, please note that, as a centre representative, you should not feel obligated to keep the content of the existing recommendations if you see a red notice saying "Warning: The recommendations have not been curated yet" -- this in most cases means that we have populated the recommendations ourselves, at the testing stage, with information obtained either from the centre directly by one of the Standards Committee members, or we have (superficially and quickly) interpreted the recommendations posted by your centre by squeezing them into, and smearing them across, the functional domain system that the SIS uses, and by more or less straightforwardly taking the recommendations levels (recommended, acceptable, discouraged) from your centre's documentation. You may want to thoroughly re-examine our choices -- we were only seeding the system.

2.4. Mind the functional domains

Reserve a few minutes to take a look at the data domains, see which of them correspond to the functions of the data that your centre is ready to receive. Please read through the descriptions of the particular domains. Treat the domains, together with the three levels of recommendation, as a scaffolding upon which your centre's recommendations will be placed.

3. Execution

3.1. General procedure

  • In case you haven't done that in the previous step, have a look at the data domains, see which of them correspond to the functions of the data that your centre is ready to receive.
  • For each of the selected domains, decide which formats are recommended and how (that is,
    • if the centre wishes to receive data in that format, it is going to be easy to curate, archive, etc. -- then choose "recommended", or
    • if it's an "if you really must" format -- then choose "acceptable";
    • you might also want to discourage submissions in some format -- choose "discouraged" in such cases, and do consider providing a short explanation about what is the preferred alternative, if there is any; or mentioning why submissions in the given format are discouraged by the centre. The place for that explanation is the <comment> element (see below for some examples).

We suggest that you go domain by domain, and that you work with either fork of the SIS or work in a branch created from the local "formats" branch -- and then make your pull requests against that branch, please.

If you take the path of editing the source with an XML editor, you will be able to use the benefit of XML Schema and Schematron -- both are used to constrain the XML you're going to produce, often providing suggestions on the valid values and structures. You will then also be able to use the template provided in each empty recommendations document.

3.2. Comments on the individual recommendations

You may want to comment on the recommendations, e.g. to restrict the range of acceptable options (to e.g. mention the kind of a/v encoding that you would be most happy with, etc.) or to point users to alternative formats if you choose to label some format as "discouraged". Use the <comment> element for that.

You can also use language tags (in the optional xml:lang attribute) to provide information in the native language of your users. Comments without a language tag are going to be treated as comments in English, and the system will fall back to English whenever in doubt. Note that if your RI is Text+, German text is going to be prioritised over English.

If you want to reference another format, use the <formatRef> element with the ref attribute containing the ID of the format as defined in the SIS. You can copy the SIS IDs from the page listing formats, by clicking on the button next to the format name. If you don't see your chosen format in that list, please make the ID up and let us know about that.

A few examples follow:

      <format id="fCHAT-XML">
         <domain>Audiovisual Annotation</domain>
         <level>discouraged</level>
         <comment>Consider using <formatRef ref="fTEISpoken"/> instead.</comment>
      </format>

Note: below, we use the same ID twice, and the <comment> element for fine-graded distinctions.

      <format id="fWave">
         <domain>Audiovisual Source Language Data</domain>
         <level>recommended</level>
         <comment>PCM-WAV, 48 kHz, 16 bit</comment>
      </format>
      <format id="fWave">
         <domain>Audiovisual Source Language Data</domain>
         <level>acceptable</level>
         <comment>PCM-WAV with non-recommended parameters (not 48 kHz, 16 bit)</comment>
      </format>

Below, note the differing data domains as well as language-tagged comments:

      <format id="fTextPlain">
         <domain>Audiovisual Annotation</domain>
         <level>discouraged</level>
      </format>
      <format id="fTextPlain">
         <domain>Documentation</domain>
         <level>recommended</level>
      </format>
      <format id="fTextPlain">
         <domain>Text Annotation</domain>
         <level>discouraged</level>
      </format>
      <format id="fTextPlain">
         <domain>Textual Source Language Data</domain>
         <level>recommended</level>
         <comment>without markup</comment>
         <comment xml:lang="de">ohne Mark-up</comment>
      </format>

3.3. Overall information about the centre

At the top of the recommendations document is the <info> element, which you can use to provide information on the centre, but also information about "the approach towards digital objects that are deposited in non-preferred formats", to quote the CTS requirements. Put the text into the HTML <p> elements, and you can also use the HTML list elements (ol, ul, li) as well as <a> for links.

4. Using the data

The SIS is set up in such a way that you shouldn't need to maintain two instances of data, one for the local pages, and one for the SIS. Note that such an approach would increase the maintenance costs (person-hours). The idea is: you do it once, use the data you've input, and revisit only if the centre/repository policy changes or when, as a B-centre, you need to get re-certified.

Pointing to the data in the SIS

You can simply point your users to your centre's data by using a direct link. For the IDS, you would use https://clarin.ids-mannheim.de/standards/views/view-centre.xq?id=IDS (note the final ID).

Using the data input into SIS to populate the centre's local pages

You can also retrieve your data via the REST API offered by the SIS. Again, for IDS, you would use, e.g. curl 'https://clarin.ids-mannheim.de/standards/rest/views/recommended-formats-with-search.xq?centre=IDS&domain=1&level=recommended&export=yes' -- have a look at the API documentation to see what parameters are possible, etc.

You can see an example way of querying the data with jQuery at https://github.com/IDS-Mannheim/IDS-Mannheim.github.io , and the corresponding simple webpage is available for viewing at https://ids-mannheim.github.io/standards/ (many browsers will allow you to view the source by doing Ctrl+U). If you would like to contribute a CSS (or XSL) stylesheet to render the info in a nicer way, please feel welcome to contact us and we will set up a directory for such contributions.

See also

Clone this wiki locally