Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing EPUB contributors #126

Closed
JayPanoz opened this issue Mar 23, 2020 · 21 comments
Closed

Parsing EPUB contributors #126

JayPanoz opened this issue Mar 23, 2020 · 21 comments

Comments

@JayPanoz
Copy link
Contributor

As a side-effect of readium/r2-streamer-kotlin#125, I’m now wondering whether the contributors’ section of the EPUB parsing doc should not be modified in more depth.

For reference: contributors’ section in RWPM doc.

Per EPUB spec:

The [DC11] contributor element is used to represent the name of a person, organization, etc. that played a secondary role in the creation of the content of an EPUB Publication.
The requirements for the contributor element are identical to those for the creator element in all other respects.

And then

The role property can be attached to the element to indicate the function the creator played in the creation of the content

So my interpretation is that creator vs. contributor is more vague than the mapping tables we have right now e.g. illustrator could be a creator but this is not covered. And as another example, a contributor could have the role of aut (author), so should it be promoted to author or be kept as a contributor to not lose this nuance?

Any thoughts?

@llemeurfr
Copy link
Contributor

Refinements of Dublin Core properties have always been a complex matter. Even without refinements, creator vs contributor is often an artificial differentiation. creator has been defined for co-authors, even if some authors as a bit more prominent than others.
By mapping both creator AND contributor+ 'opf:role=aut' to author, I don't think we loose an interesting nuance for what we intend to do, i.e. displaying book info to users.

@JayPanoz
Copy link
Contributor Author

That makes sense. I’m now personally leaning towards the option @qnga has proposed in his PR: giving more weight to the role than the dc:element, and then using contributor for unknown roles.

So that would mean removing the first column (element) in the mappings tables and having a global rule saying it can be creator or contributor.

Of course other opinions are warmly welcome but as an early proposal starting from Quentin’s PR…


Contributors

The contributor’s key depend on the role of the creator or contributor. It is an object that contains a name, a sortAs and an identifier key.

The name of each contributor is an object where each key is a BCP 47 language tag and each value of the key is a string.

The contributor object may also contain a sortAs string, used to sort the contributor as well, and an identifier string that must be a valid URI.

When parsing an EPUB, we need to establish:

  • the key of the contributor;
  • a language map for the name of this contributor;
  • the string used to sort the name of the contributor.

EPUB 2.x

The contributor element can either be <dc:creator> or <dc:contributor>.

The following mapping should be used to determine the key of the contributor’s object:

opf:role key
aut or <empty> or <unknown> author
trl translator
edt editor
ill illustrator
art artist
clr colorist
nrt narrator
<empty> or <unknown> contributor

Where opf:role is the value of the attribute of the contributor element.

Parse the carrying element as a localized string to compute a language map for the contributor’s name.

Finally, the string used to sort the name of the contributor is the value of the opf:file-as attribute of this element.

EPUB 3.x

The contributor element can either be <dc:creator> or <dc:contributor>.

The following mapping should be used to determine to key of the contributor’s object:

role key
aut or <empty> or <unknown> author
trl translator
edt editor
ill illustrator
art artist
clr colorist
nrt narrator
<empty> or <unknown> contributor

Where role is the value of the refine whose scheme is a value of marc:relators.

In addition, <media:narrator> should be mapped to narrator.

Parse the contributor element as a localized string to compute a language map for the contributor’s name.

Finally, the string used to sort the name of the contributor is the value of a refine with a file-as property.

Publisher

The publisher of a publication is an object that contains a name, a sortAs and an identifier key.

The name of each publisher is an object where each key is a BCP 47 language tag and each value of the key is a string.

The publisher object may also contain a sortAs string, used to sort the publisher as well, and an identifier string that must be a valid URI.

Parse the <dc:publisher> element as a localized string to compute a language map for the publisher’s name.

Finally, the string used to sort the name of the publisher is the value of a refine with a file-as property.


Although I’m not completely happy with this rewording – I think publisher is best if left as its own section, as it is in the RWPM doc, but it could probably refer to contributor given the duplicated text.

@qnga
Copy link
Contributor

qnga commented Mar 23, 2020

What about a preface writer? I think he could legitimately be declared as a contributor with 'opf:role=aut', and we definitely don't want him to appear side by side with the author of the main content.

Although I’m not completely happy with this rewording – I think publisher is best if left as its own section, as it is in the RWPM doc, but it could probably refer to contributor given the duplicated text.

What about <dc:creator> or <dc:contributor> with role="pbl"? Currently, we map them to publishers.

@mickael-menu
Copy link
Member

The contributor element can either be dc:creator or dc:contributor.

It's not clear from the table how these elements are mapped when the opf:role attribute is missing. Is <dc:creator> ending up in authors and <dc:contributor> in contributors, like before?

What about a preface writer? I think he could legitimately be declared as a contributor with 'opf:role=aut', and we definitely don't want him to appear side by side with the author of the main content.

Yes, I wonder how it's used in the wild... Also, wouldn't it be weird to have a Contributor with a role=aut in contributors instead of authors?

@JayPanoz
Copy link
Contributor Author

Indeed, we will probably have to reach a consensus on an opinionated parsing for some metadata, especially as contributor is a minefield for corner cases.

For preface writer, for instance, there’s wpr in MARC relators so either we stick to that, or try to accommodate vaguer metadata.

I wonder how it's used in the wild...

TBH, I’ve always been interested in a comparison between the metadata of EPUB files and their ONIX record, as lots of producers treat EPUB metadata as something that is displayed in the UI – and you could consequently find conflicts. As the most simple example I’m aware of, instead of:

<dc:creator>Author 1</dc:creator>
<dc:creator>Author 2</dc:creator>

Some will do:

<dc:creator>Author 1 & Author 2</dc:creator>

so that both names are displayed in the page header of say iBooks, despite what the spec is saying: “If an EPUB Publication has more than one creator, each SHOULD be included in a separate creator element.”

What about <dc:creator> or <dc:contributor> with role="pbl"? Currently, we map them to publishers.

That’s a good point.

It's not clear from the table how these elements are mapped when the opf:role attribute is missing. Is <dc:creator> ending up in authors and <dc:contributor> in contributors, like before?

That’s another one.

@hadrien commented the PR a few minutes ago:

<dc:creator> and <dc:contributor> are unfortunately very vague, creator is not as specific as author for instance. I think these two elements are vague enough to ignore the element and override them with role.

I don’t have a better answer right now but looking at the EPUB spec, I think it’s fair to add something to make “<dc:creator> ending up in authors and <dc:contributor> in contributors” explicit because this is what is implied.

Let me a few minutes to change the earlier proposal :-)

@JayPanoz
Copy link
Contributor Author

JayPanoz commented Mar 23, 2020

Updated proposal


Contributors

The contributor’s key depend on the role of the creator or contributor. It is an object that contains a name, a sortAs and an identifier key.

The name of each contributor is an object where each key is a BCP 47 language tag and each value of the key is a string.

The contributor object may also contain a sortAs string, used to sort the contributor as well, and an identifier string that must be a valid URI.

When parsing an EPUB, we need to establish:

  • the key of the contributor;
  • a language map for the name of this contributor;
  • the string used to sort the name of the contributor.

EPUB 2.x

The following mapping should be used to determine the key of the contributor’s object:

element opf:role key
dc:creator <empty> or <unknown> author
dc:creator or dc:contributor aut author
dc:contributor <empty> or <unknown> contributor
dc:publisher <any> publisher
dc:creator or dc:contributor pbl publisher
dc:creator or dc:contributor trl translator
dc:creator or dc:contributor edt editor
dc:creator or dc:contributor ill illustrator
dc:creator or dc:contributor art artist
dc:creator or dc:contributor clr colorist
dc:creator or dc:contributor nrt narrator

Where opf:role is the value of the attribute of the contributor element.

Parse the carrying element as a localized string to compute a language map for the contributor’s name.

Finally, the string used to sort the name of the contributor is the value of the opf:file-as attribute of this element.

EPUB 3.x

The following mapping should be used to determine to key of the contributor’s object:

element role key
dc:creator <empty> or <unknown> author
dc:creator or dc:contributor aut author
dc:contributor <empty> or <unknown> contributor
dc:publisher <any> publisher
dc:creator or dc:contributor pbl publisher
dc:creator or dc:contributor trl translator
dc:creator or dc:contributor edt editor
dc:creator or dc:contributor ill illustrator
dc:creator or dc:contributor art artist
dc:creator or dc:contributor clr colorist
dc:creator or dc:contributor nrt narrator
media:narrator <any> narrator

Where role is the value of the refine whose scheme is a value of marc:relators.

Parse the contributor element as a localized string to compute a language map for the contributor’s name.

Finally, the string used to sort the name of the contributor is the value of a refine with a file-as property.


So this rolls back to the text of the PR with updated mapping tables. Does that sound reasonable?

@llemeurfr
Copy link
Contributor

I agree except on
dc:publisher | or
should be IMO
dc:publisher | any value

same for narrator

@qnga
Copy link
Contributor

qnga commented Mar 23, 2020

I agree with the content, but a different presentation might be clearer:

dc:creator | aut or <empty> or <unknown> | author
dc:contributor | aut | author

What about
dc:creator| <empty> or <unknown> | author
dc:creator or dc:contributor | aut | author

@JayPanoz
Copy link
Contributor Author

JayPanoz commented Mar 23, 2020

Ah yes thanks, that will make it more consistent.

On a related note, would <any> fit better than “any value” so that it’s more consistent with <empty> and <unknown>?

@llemeurfr
Copy link
Contributor

yes is a good representation IMO.

qnga added a commit to qnga/architecture that referenced this issue Mar 24, 2020
@qnga
Copy link
Contributor

qnga commented Mar 26, 2020

Multiple roles have not been mentioned. In Kotlin so far a <dc:creator> or <dc:contributor> can be put both in authors and illustrators if it is refined with these two roles. I think that's ok. But what about a <dc:publisher> with role aut? I think we should put it in both publishers and authors.

@JayPanoz
Copy link
Contributor Author

role only extends contributor and creator so it should be ignored for publisher (cf. role section in EPUB 3.2 but it’s the same in EPUB 2.0.1).

@qnga
Copy link
Contributor

qnga commented Mar 26, 2020

Oh, yes, you have already said that, sorry.
In addition, only one role seems to be allowed. So I guess Kotlin implementation could be simplified.

@JayPanoz
Copy link
Contributor Author

In addition, only one role seems to be allowed. So I guess Kotlin implementation could be simplified.

Which also begs the question of making a rule of thumb that “if cardinality is zero or one in the EPUB Spec, only consider the first occurrence” or something like that? Would that help or not?

And maybe also adding links to the EPUB OPF docs in the header of the doc (with RWPM link) given we use them quite extensively.

@qnga
Copy link
Contributor

qnga commented Mar 26, 2020

Which also begs the question of making a rule of thumb that “if cardinality is zero or one in the EPUB Spec, only consider the first occurrence” or something like that? Would that help or not?

Yes, we could. But are you sure multiples roles are never used in real life despite the specification? Using multiple <dc:contributor> elements just for defining multiple roles seems to me quite strange.

@JayPanoz
Copy link
Contributor Author

JayPanoz commented Mar 26, 2020

I’d say one good way to look at some issues when we have doubts is probably EPUBCheck – that also helped open bug issues in the epubcheck repo in the past.

So the bad news is that epubcheck is fine with multiple roles.

Messages: 0 fatals / 0 errors / 0 warnings / 0 infos

So:

  1. probably a bug to report to epubcheck – I’ll check with Romain and open the issue myself with a test file if needed;
  2. now, if a content producer has been doing that, it exists in the wild;
  3. if this is a bug in epubcheck, on principle I’d personally still be leaning towards aligning with epubcheck than trying to accommodate what would clearly be a non spec-compliant usage, but I’m obviously open to diverging points of view, esp. as we’re trying to take the authors’ intent into account;
  4. lack of EPUB data is becoming kinda painful – even for epubcheck maintainers – as it would inform such decisions.

Anyways, on an interesting note, the editor I used to make this quick test file, Sigil, has another take on this:

sigil-bookeditor

Its metadata editor enforces this cardinality (only one role), even if you try to add multiple values in the GUI. Where it differs is that the last value for role will override all preceding values.

Reminds me of the “multiple dc:language + what is the primary language? the first occurrence, or the last occurrence?” issue I reported for the EPUB 3.2 spec. Most Reading Systems used the first occurrence, a handful of others the last one. So they chose the first occurrence and, on principle, I’m using first occurrence myself in case of doubt, but some cases may well have to be handled differently.

@JayPanoz
Copy link
Contributor Author

So Matt Garish just confirmed me it’s an epubcheck bug here. So at least this point is clarified: one role = one dc:element (and you have to repeat that for each role), to be consistent with opf:role in EPUB 2.x – and I think this detail might be important in terms of consistency.

Then what do people want to do about that?

@qnga
Copy link
Contributor

qnga commented Mar 26, 2020

Personally, I'm ok with considering only the first role. This should lead to simpler code.

@mickael-menu
Copy link
Member

That's similar to this issue I reported yesterday. Even though it's not spec compliant, IMHO it's better to handle cases like these, that are definitely found in the wild.

epubcheck needs to be spec compliant because that's its purpose, but I guess that Readium should strive to render a publication as a publisher intended. Now, I have no idea if multiple role attributes is something that exist out there...

I checked on Swift, and we're parsing only one role by the way, so we'll have to align one way or the other. Maybe we should cast a vote?

@JayPanoz
Copy link
Contributor Author

JayPanoz commented Mar 27, 2020

Maybe we should cast a vote?

Seems to me this is the best we can do.

So as a recap, the important pieces of information:

  1. cardinality is one or zero in the EPUB spec, and it was applied to role so that EPUB 2.x and EPUB 3.x are consistent;
  2. but epubcheck has not been reporting multiple role refines as an error so far;
  3. so theoretically multiple role for a single dc:element exist in the wild;
  4. also, not sure that was mentioned or noteworthy but the spec is using a MAY for cardinality.

Who’s in favour of handling multiple role refines for the same single contributor element i.e. dc:creator and dc:contributor?

Use 👍 or 👎 emojis to cast your vote.

@llemeurfr
Copy link
Contributor

llemeurfr commented Apr 1, 2020

The group has chosen to handle only one role per contributor.

In case the source format allows a contributor with several roles, the corresponding Readium Web Publication (serialized or not) will have one contributor entry per role.
This rule is extended to other source properties allowing several "roles" (or equivalent).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants