Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml:ids in medieval catalogue #192

Open
holfordm opened this issue Nov 16, 2018 · 5 comments
Open

xml:ids in medieval catalogue #192

holfordm opened this issue Nov 16, 2018 · 5 comments

Comments

@holfordm
Copy link
Collaborator

It is a good idea to have xml:ids for msItem and msPart in the medieval catalogue (and other TEI) catalogues as it makes our data more useful for others (e.g. MMM) and could enable linking to particular elements of a description in future.

However, the current format of xmlids arising from James Cumming's conversion script, which attempts to reflect the order of items in the manuscript, involves a lot of effort for the editor to create / maintain, and makes the records difficult to update if it is necessary, for example, to add a new item.

I propose (1) that the exisiting xmlids be replaced at some point by randomly generated alphanumeric ids (I'm assuming this could be done relatively easily with XSLT) (2) that an XSLT transformation be created to add such ids to records that do not have them, to save editors time when creating / updating records (3) that similar changes be suggested to the other TEI catalogues.

@andrew-morrison
Copy link
Contributor

Replacing existing IDs is risky. James Cumming's conversion script did that for Fihrist, changing msItem/@xml:id values from "a1", "a2", etc, but not changing the ref/@target attributes which pointed to them.

The existing IDs in Medieval have been included in the RDF generated for MMM, but I don't know if their systems retain these, or if changing them all and then regenerating the RDF would be OK.

Adding IDs after creation/editing would involve either training people to do it before committing, or doing it centrally afterwards. The former would be possibly more arduous than just adding them manually and sometimes forgotten. The latter would lead to Git conflicts when inevitably people don't remember to pull before editing.

But if you want a script to add IDs for Medieval, for one-off or ad-hoc use by yourself, that would be very easy.

@ahankinson
Copy link
Contributor

Beyond simply linking to, we should consider these XML IDs to be part of the system of persistent identifiers; that is ID xml:id="c422dfa" on manuscript_1234 (or ark:/29792/nnnAAA123) would be maintained in perpetuity once assigned.

It is certainly handy that browsers understand how to link to them and display them as fragment identifers, but that should be a secondary consideration. Behaviours of the catalogues will change (in 5,10,20 years) so it's never guaranteed that a new interface will know how to properly resolve the links, but the identity of the part/item as a URI should never change.

(The value of IDs in projects like MMM is in their permanence, and not necessarily their resolution. It's enough to make an assertion about a particular ID in a graph without actually ever needing to retrieve the content. But if that ID changes and is no longer guaranteed to have the same identity, the identifier loses all value).

So to be clear (I'm assuming 1, 2, & 3 are obvious, but just to state it):

  1. The IDs would need to persist beyond updates to the HTML, so should be derived from the TEI ids.
  2. The IDs would need to be generated at the time of cataloguing, assigned by the cataloguer, and persist across updates to the record. They must not ever be re-generated and replaced once assigned, if we are making a guarantee of persistence.
  3. Retroconversion of existing records will break any existing links; it may be better to do this now rather than later.
  4. In order to be valid xml:ids they must start with a letter, not a number.
  5. We should specify the format of these identifiers; a UUID would probably be overkill, but many other 'random' identifiers have the risk of creating collisions if not properly designed. Given (3) we will probably need to use a system like NOID to try and guarantee format as well as 'randomness'.

@holfordm
Copy link
Collaborator Author

I suppose that existing ids don't need to be replaced? How important is consistency in their format?

Future cataloguing may lead to changes in how parts and items are idenitified, e.g. one item is split into two or more items. This is guaranteed to happen in cases like this https://medieval.bodleian.ox.ac.uk/catalog/manuscript_9003 where 8 items have been catalogued under a general subject heading.

@ahankinson
Copy link
Contributor

ahankinson commented Nov 16, 2018

Consistency is only important for humans; uniqueness and validity are important for computers. So the consistency question is more "how annoyed will you be if they're not consistent"? 😄

Adding new unique identifiers isn't a problem. It's only a problem if you change what they identify, or remove them. So creating 8 new identifiers for that record will be fine, but how we handle the old identifier is where the problem will be. I can see we either:

  1. ignore the problem and just delete it, recognizing that we're piling up technical debt if we need to keep these identifiers in perpetuity.
  2. create a stub record that says "Yes, this was an item, but it has been updated to be these other items"; We could keep this in the TEI and not show it in the catalogue record, which shouldn't affect identity, but it will affect resolution (breaking a link).
  3. keep the original item in place and just augment the record.

I'm guessing the appropriate action is somewhere between 1 & 2.

@holfordm
Copy link
Collaborator Author

In that case (and perhaps in others) the existing title could be moved to the head or summary element, and the existing xmlid could go with it.

I don't think that we will have a Fihrist-type problem of internal references within records to their xmlids, since that was a Fihrist only method of encoding, and I don't think we have anything similar.

as to whether we should change the existing xmlids, perhaps it depends on MMM and whether their RDF can/will be updated or not? I can't imagine that anyone else is referring to our xmlids at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants