Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What exactly is "Attribution" for, and what Classes need one? #192

Merged
merged 4 commits into from
Sep 25, 2012

Conversation

stoicflame
Copy link
Member

This issue arises specifically from #144, where I proposed that having an Attribution property in SourceReference is redundant because Conclusion, which will carry the bulk of the SourceReference objects, already has one.

The original intention of Attribution was to carry a proof argument, but the purpose seems to have morphed a bit with the major architectural changes proposed in #182. Thad ( @thomast73 ) has described it as more of a version-control structure, but also said that it needs its own issue.

This is that issue.

@jralls
Copy link
Contributor Author

jralls commented Jul 19, 2012

In particular, the proof argument role has been taken over by Conclusion::Document::AnalysisDocument.

So Attribution now looks like

element attribution {
   attribute modified {integer},
   attribute contributor {foaf:person},
   attribute confidence-level,
   element justification {text}
}

Contributor and confidence-level are self-explanatory (though how confidence-level values like "probably" or "possibly" make sense in any context other than some subclasses of conclusion is questionable). "Justification" replaces "proofArgument" and is what Thad has likened to a commit note. I's like FS's thinking on that expanded upon a bit.
"Modified" is a bit baffling. Is it a version number?

I'm in favor of tagging everything with a contributor and a date. I've no objection to versioning some things, and for those things having a short string explaining the change is useful. The list of what gets versioned, though, and how to transfer history, is highly debatable. I also wonder whether it makes sense to include a versioning scheme inside of GedcomX or if exposing an uncompressed serialization to an external version control system would make more sense.

@nilsbrummond
Copy link

If it really is like reversion control - does there need to be a history of attribution objects? Right now it is keep the latest change only, it appears.

Is a person edits an attributed object of someone else is it now attributed to them? Perhaps contributor need to be a list that can grow to track everyone that edits the object?

@jralls
Copy link
Contributor Author

jralls commented Jul 20, 2012

If it really is like reversion control - does there need to be a history of attribution objects? Right now it is keep the latest change only, it appears.

Not keep, transfer: It's easy to imagine a use case where a group of researchers are passing GedcomX files around to keep each other in sync, and each researcher's local program is responsible for taking a record with a changed modified value, checking that it's a single increment of the one it has stored, making a diff against the previous one and creating a commit from the diff with the new contributor and justification. That permits any GedcomX to stand on its own for someone importing it without a previous version, but that user wouldn't get any history.

Is a person edits an attributed object of someone else is it now attributed to them? Perhaps contributor need to be a list that can grow to track everyone that edits the object?

I don't like the idea of having a list of authors/revisers without also tracking who made what revisions. Suppose I took a conclusion you had written and re-wrote it in a way that you disagree with. You wouldn't want your name to still be on the conclusion, would you?

@jralls
Copy link
Contributor Author

jralls commented Jul 20, 2012

I'm in favor of tagging everything with a contributor and a date.

Clarification: I'm in favor of tagging every top-level object, i.e., everything that gets its own MIME part or ZIP psuedo-file. Tagging every element of every object would be tiresome and create bloat.

@EssyGreen
Copy link

I'm in favor of tagging everything with a contributor and a date

I'm not ... I believe (maybe wrongly) that the vast majority of GEDCOM files are owned/used/created by a single researcher. They know who they are and don't need to keep recording that fact. If they reference other researchers work then this is just like any other source which (rightly) has it's own attribution. The GEDCOMX file is in itself a source and therefore will have it's own Attribution.

@jralls
Copy link
Contributor Author

jralls commented Aug 12, 2012

I believe (maybe wrongly) that the vast majority of GEDCOM files are owned/used/created by a single researcher.

We're discussing GedcomX here, not Gedcom, and GedcomX has a web services mission as well as a single-researcher file import/export one. In the online family tree use case, the whole point is collaboration with multiple researchers, and who did what when is important metadata.

@nilsbrummond
Copy link

I can see a few use cases in the single-user world that time-stamps on everything would be useful.

example: You find a mistake in the analysis of a source. You fix the mistake and the source analysis time-stamp is updated. Now everything dependent on that source analysis with a time-stamp before source-time-stamp is invalid til reviewed by the researcher... the re-approval by the researcher would re-time-stamp the dependent resource...

@EssyGreen
Copy link

GedcomX has a web services mission as well as a single-researcher file import/export one. In the online family tree use case, the whole point is collaboration with multiple researchers, and who did what when is important metadata.

I can see a few use cases in the single-user world that time-stamps on everything would be useful.

What we seem to be coming down to here and in #198 is that for you guys every object and property can optionally have an id, a datestamp, and "Attribution" (ie ownership) - and possibly other meta-data as yet to be identified. The trouble is that the more granular you make everything the more complex it becomes. I can see something like a "Change" object coming around (datestamp and Attribution) ... and guess what .. might we then need an ID for it? Where do we stop?

I don't believe we are trying to re-create a source/change control system here. That might well be wanted but would be better served by other existing software.

@jralls
Copy link
Contributor Author

jralls commented Aug 13, 2012

I don't believe we are trying to re-create a source/change control system here. That might well be wanted but would be better served by other existing software.

Not necessarily create, just facilitate. But maybe a better approach would be to have another spec for version-controlled GedcomX where the Attribution is part of the changeset metadata rather than part of GedcomX Object (meta)data.

@EssyGreen
Copy link

have another spec for version-controlled GedcomX

Why re-invent the wheel?

@jralls
Copy link
Contributor Author

jralls commented Aug 13, 2012

have another spec for version-controlled GedcomX

Why re-invent the wheel?

What reinvention? Is it already specified somewhere how GedcomX would interact with a version control system?

@EssyGreen
Copy link

Is it already specified somewhere how GedcomX would interact with a version control system?

Why does it need to be?

@nilsbrummond
Copy link

@jralls

Is a person edits an attributed object of someone else is it now attributed to them? Perhaps contributor need to be a list that can grow to track everyone that edits the object?

I don't like the idea of having a list of authors/revisers without also tracking who made what revisions. Suppose I took a conclusion you had written and re-wrote it in a way that you disagree with. You wouldn't want your name to still be on the conclusion, would you?

I agree in with you in this case. But what if someone just spell checked, grammatically corrected, or formatted my conclusion? Or just added another supporting point...

Semantically is "Suppose I took a conclusion you had written and re-wrote it in a way that you disagree with." still the same conclusion, or should it be an entirely new conclusion? Where in that case you should mark my conclusion incorrect and add your own new conclusion?

Semantically someone else performing minor cleans is really the same conclusion. Does it make sense to remove the primary contributor and replacing them with the person just doing the cleanups?

Do the lifetime and identity semantics of a conclusion need to be define in the specification?

Perhaps:

element attribution {
   attribute modified {integer},
   list of: attribute contributors {foaf:person},
   attribute last-contributor {foaf:person},
   attribute confidence-level,
   element justification {text}
}

@EssyGreen
Copy link

I'm going to ask again ... why do we need to create a version control system rather than allowing those who want one to overlay it onto whatever system they are using?

@nilsbrummond
Copy link

I'm going to ask again ... why do we need to create a version control system rather than allowing those who want one to overlay it onto whatever system they are using?

+1

Version control should be out side the scope of gedcom X, except to maybe the file format working well with general version control systems.

@jralls
Copy link
Contributor Author

jralls commented Aug 15, 2012

I'm going to ask again ... why do we need to create a version control system rather than allowing those who want one to overlay it onto whatever system they are using?

Because existing VCS are designed for diffing human-edited text files. They mark reordering as a change. They mark whitespace changes. How many times have you spent n times longer reviewing a changeset because while making a 10-line code change the submitter did something like untabbify or delete-trailing-whitespace?

Those systems do not work well with machine-generated serializations because output order is usually not deterministic unless the underlying objects make it so: A linked list or an array will always come out in the same order because it's easy and natural to start at the beginning and go to the end. Hashmaps and trees can be iterated a bunch of different ways, and a single insertion or deletion can make a big difference in iteration order for most of them.

If anyone knows about an existing semantically-aware VCS, I'd love to hear about it. I agree that building one is outside the scope of GedcomX, but not that it isn't a requirement for collaborative genealogy.

@jralls
Copy link
Contributor Author

jralls commented Aug 15, 2012

Perhaps:

element attribution {
   attribute modified {integer},
   list of: attribute contributors {foaf:person},
   attribute last-contributor {foaf:person},
   attribute confidence-level,
   element justification {text}
}

Is putting "lipstick on a pig".

Your list of "what abouts" that preceded it are all handled nicely by a VCS that shows who changed what and when. Only the formatting & spelling is handled correctly by a list of contributors. However,

Semantically is "Suppose I took a conclusion you had written and re-wrote it in a way that you disagree with." still the same conclusion, or should it be an entirely new conclusion? Where in that case you should mark my conclusion incorrect and add your own new conclusion?

Offers an interesting alternative: Immutability. Add a supersedes: Optional Conclusion reference to Conclusion and the current Attribution structure. Not very space-efficient, but it fixes the other objections except "we don't need no version control".

@nilsbrummond
Copy link

name description data type
id A local, transient identifier for the resource being described. Note that as a local, transient identifier, the id may only be used to resolve references to the resource within a well-defined scope (such as a single web service request or a single file). string
  • Make sure the ID field is on every resource.
  • Remove the 'transient' attribute
  • Require output be sorted by ID.

The file order should be consistent then at least.

@nilsbrummond
Copy link

Is putting "lipstick on a pig".

Your list of "what abouts" that preceded it are all handled nicely by a VCS that shows who changed what and when.

Agreed

Remove the contributor attribute from the attribution? Push that out of scope to a revision control system?

If anyone knows about an existing semantically-aware VCS, I'd love to hear about it. I agree that building one is outside the scope of GedcomX, but not that it isn't a requirement for collaborative genealogy.

I have used systems with application specific and integrated VCS (e.g. Team based UML design tools), but never a generic semantically-aware VCS.

Offers an interesting alternative: Immutability. Add a supersedes: Optional Conclusion reference to Conclusion and the current Attribution structure. Not very space-efficient, but it fixes the other objections except "we don't need no version control".

This is basically adding revision control to GedcomX.

I think we really need the requirements / use cases for collaborative genealogy support in GEDCOM X. What information does the GedcomX need to communicate? Assuming the Attribution object is primarily to support this.

  1. Support for anyone edits and commits? (normal text file / code based VCS.)
  2. Support patch and submit to project manager? (ala Github, Linux kernel, etc...)
  3. Others?
  4. Also these collaborative features need to specify how to set then in the single user case.

To support option 2 there could be a defined patch/diff format that includes just the changed records with external references to the records changed in the original gedcomX.

@jralls
Copy link
Contributor Author

jralls commented Aug 15, 2012

I think we really need the requirements / use cases for collaborative genealogy support in GEDCOM X.

+1

Thad?

@EssyGreen
Copy link

I think we really need the requirements / use cases for collaborative genealogy support in GEDCOM X

+1

My take on this (which I know John will disagree with) is that GEDCOMX (or rather the cohesive set of conclusions in a GEDCOM X file) is in itself a Source. Therefore it must have the same form of ownership/attribution or whatever as a source has (which should include author(s), editor(s), publisher(s), publication/edit date etc etc). If a group of people are working collaboratively then they must collaborate and hence form a cohesive conclusion which they agree upon. If not, then they are just a bunch of people over-writing each others data. If I were reading a research text by a team of people I would not expect to have each guy pop up on each paragraph to put their own stamp on it. It would make the whole thing extremely difficult to read and understand if they constantly contradicted or corrected one another. Hence I vote against having attribution objects scattered around willy nilly.

My assumption here is that GEDCOM X is (or should be) still primarily focused on import/export (see #141) and not on prescribing the be-all-and-end-all data-store for all applications to use.

Hence it is up to each application to work out how it manages version control. John says the order doesn't matter .... well in some applications it might matter. Ditto other 'trivial' changes - any editor will tell you that small changes can change the context and meaning of what is written. It is not for GEDCOM-X to say which of these is or is not important. It is for GEDCOM X to provide a base standard of data which 'good' genealogical applications should be able to import/export.

Leave change control to the application and ensure that the GEDCOM-X file is in itself a Source with all the bells and whistles we demand of any other SourceDescription.

@jralls
Copy link
Contributor Author

jralls commented Aug 16, 2012

John says the order doesn't matter

I said nothing of the sort. I said that standard serialization routines for certain types of containers don't produce output in a way that lends itself to minimal (and therefore meaningful) diffs with standard programming-language-centric version control systems, and so if GedcomX is to be version-controlled then we need to specify how to make that work.

It is for GEDCOM X to provide a base standard of data which 'good' genealogical applications should be able to import/export.

Which is why I suggested that it should be a separate spec.

@jralls
Copy link
Contributor Author

jralls commented Aug 16, 2012

My take on this (which I know John will disagree with) is that GEDCOMX (or rather the cohesive set of conclusions in a GEDCOM X file) is in itself a Source. Therefore it must have the same form of ownership/attribution or whatever as a source has (which should include author(s), editor(s), publisher(s), publication/edit date etc etc).

You're not advocating GedcomX as a source there, you're advocating it as a publishing medium. That's certainly a use to which Gedcom5 has been put, but I don't think that it was a primary design use-case for Gedcom and I don't think that it should be for GedcomX.

@EssyGreen
Copy link

John says the order doesn't matter

I said nothing of the sort.

Apologies - I evidently misinterpreted your negative comment about change control systems "They mark reordering as a change" - I assumed you were implying that reordering was not a significant change ergo I assumed order was not important to you.

You're not advocating GedcomX as a source there, you're advocating it as a publishing medium

Both .. being published whilst being a work-in-progress ... isn't that in-line with your collaborative working requirement?
When I create a new project/tree (say for a client) to me this is a new "work-in-progress" source. As such I ensure that it is always cohesive and it is authored by me ... if someone else contributes then the stuff they produce is a source too. If there are multiple possible scenarios then I create branches (sources) to follow their trails until they are resolved. I don't see why this is such a problem.

@jralls
Copy link
Contributor Author

jralls commented Aug 18, 2012

I don't see why this is such a problem.

Because you're using "source" to mean different things. In the context of the GPS and GedcomX, a "source" is something that can be cited and contributes evidence towards a conclusion. The usual term for "creating sources" is "making stuff up", and I don't think that's what you mean when you say "I create branches (sources)". I think you mean that you create different sets of conclusions and write proof arguments for each, and use those proof arguments as guidance for further research, recursing down each until you run out of new evidence and have one hypothesis (conclusion set) that fits that evidence better than the others.

@EssyGreen
Copy link

Because you're using "source" to mean different things

A source is just something which can be used in evidence ... it could be a book, an archive document, an email, a letter, a photograph etc etc etc .... A source is by definition "different things" ... a GEDCOM-X file is just another type of thing/source.

@EssyGreen
Copy link

The usual term for "creating sources" is "making stuff up"

That's just the credibility of the source ... an old baptism record could have been "just made up" by the priest who forgot to record it on the day and had to try to reconstruct it to send it to his superior. This is why we always seek multiple sources to prove/add evidence.

@EssyGreen
Copy link

"I create branches (sources)" ... I think you mean that you create different sets of conclusions and write proof arguments for each, and use those proof arguments as guidance for further research

That too but what do I put them in? A source - that way I can use the source in other projects/conclusions etc as per normal without having to make it something else to cite from it/use it as evidence.

@stoicflame
Copy link
Member

You guys are awesome. Very good stuff here.

"Modified" is a bit baffling. Is it a version number?

It's a timestamp.

Is a person edits an attributed object of someone else is it now attributed to them?

It kind of depends on how smart the application is. If the application can tell (or have the user say) that the change is just a whitespace edit or something of insignificant semantic value, then I'd probably not change the contributor. Otherwise, I probably would.

What we seem to be coming down to here and in #198 is that for you guys every object and property can optionally have an id, a datestamp, and "Attribution" (ie ownership) - and possibly other meta-data as yet to be identified. The trouble is that the more granular you make everything the more complex it becomes.

Indeed. What if we removed attribution from everything except the "top-level" entities like Person, Relationship, Event, and SourceDescription? We would remove it as an explicit property of the lower-level resources like Name, Gender, SourceReference, etc.

If we wanted, we could add a separate section that describes how to handle the attribution element if it shows up as an extension element somewhere. This might be useful because (as much as I hate it personally), FamilySearch product management has decided to go way overboard on attribution and is adding it at a much more fine-grained level than is sane. This is why attribution is currently on SourceReference, for example.

I don't believe we are trying to re-create a source/change control system here.

Agreed. But I agree with @jralls that it might be appropriate as a separate specification or as an extension project. I know that a "change history" will be part of the FamilySearch Platform APIs, so it might be useful to have such a notion for import/export, too.

If anyone knows about an existing semantically-aware VCS, I'd love to hear about it.

Me too! Sounds like a fascinating research project, if you ask me.

I think we really need the requirements / use cases for collaborative genealogy support in GEDCOM X.

I think we need to hash those out here. As I mentioned earlier, FamilySearch has their own ideas about the level of granularity for attribution, but that doesn't mean we have to explicitly provide for those ideas in the spec, especially if it means a significant reduction in complexity.

I guess I thought that the requirements could start at just being able to track who edited and committed each top-level entity.

Anyway, thanks again for your comments. I'm going to put together a set of changes and attach them to this issue as we hash through this. I'll start with the changes I proposed above.

@stoicflame
Copy link
Member

Changes have been attached to this issue, waiting your review, summarized as follows:

  • Move confidence from Attribution to Conclusion
  • Clarify the issue of granularity of attribution in the conceptual model, recognize attribution as an extension property.
  • Recognize attribution explicitly for Person, Relationship, Document, Event, Note, and SourceDescription. Remove it from Name, Gender, Fact, and SourceReference.

@jralls
Copy link
Contributor Author

jralls commented Sep 23, 2012

Hmm. It's easy to understand Attribution on Document, Note, and SourceDescription, but what does it mean on the container-like objects Person, Relationship, and Event? Is it the last person who touched one of the contained conclusions (Name, etc.)?

@stoicflame
Copy link
Member

Same definition applies: attributed to the agent who made the latest significant change to the person. I would presume that modifying a name conclusion on a person would be considered significant, yes.

Of course if the application wanted to keep tack of changes at the level of the Name conclusions, it could provide an attribution on the Name as an extension element. But as defined here, the level of granularity is at the Person. That's open for discussion, though. should we just put it back on the Conclusion base type?

@jralls
Copy link
Contributor Author

jralls commented Sep 24, 2012

should we just put it back on the Conclusion base type?

I still don't see a use-case for it. I think at this point I agree with Sarah: Take it out.

@stoicflame stoicflame merged commit c7265b0 into master Sep 25, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants