Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify again support for "n-tiered" implementation #149

Closed
wants to merge 8 commits into from

Conversation

stoicflame
Copy link
Member

We started on this with #72, but things have changed a bit. What we need are some recipes.

See #138 for discussion on this point.

@jralls
Copy link
Contributor

jralls commented Feb 23, 2012

I'll step out on a limb a bit and summarize Tom Wetmore's view: He sees the progression from individual source records to a final conclusion as a series of combinations of "Personas" (the notional persons reflected in a single document) into a set of conclusion "Persons" (each represented by the biographical sketch that is the output of Family History research). In order to model this series he argues that conclusion model Person class should support recursive aggregation (in other words, each Person except one derived directly from a source document) should have an 1->1..* aggregation of Person objects, each supported by a proof argument.

@jralls
Copy link
Contributor

jralls commented Feb 23, 2012

I find this model somewhat attractive if a bit mechanistic. The principal advantage is the ease of dissociation: If one decides later that a particular upstream person is actually not the person one thinks he is, it's very easy to remove that upstream person and his whole set of upstream persons from the present conclusion without disrupting the other parts of one's conclusion and without having to rewrite from scratch a proof argument.

The problem with that approach is that according to the Genealogical Proof Standard a good proof argument requires a synthesis of all of the collected evidence taken together, including conflicting evidence. If one decides that one set of John Smiths in the evidence of Fooville is not the John Smith that one is writing about, one must explain that, and why one believes it, in the proof argument. Simply dropping that set of records without explanation would give the impression that one hadn't conducted the required "reasonably exhaustive search".

That said, there may be other benefits to an N-tiered recursive construct, and it isn't difficult to support -- a PrecursorPerson class with attributes Person, Researcher, and ArgumentString, with class Person having a 1->0..* association with PrecursorPerson will do the trick. The more interesting problem will be for applications which don't choose to support N-tiered to import GedcomX datasets which do. If we decide to support N-tiered we should at least outline a recommendation for how to go about that.

@EssyGreen
Copy link

The problem with that approach is that according to the Genealogical Proof Standard a good proof argument requires a synthesis of all of the collected evidence taken together, including conflicting evidence.

+1

The more interesting problem will be for applications which don't choose to support N-tiered to import GedcomX datasets which do. If we decide to support N-tiered we should at least outline a recommendation for how to go about that.

+1

I believe that an alternative approach is for a Person (Conclusion Model) to support a collection of Evidence, each Evidence containing a single link to the Persona (Record Model) which the researcher believes is (or is not) the Person being researched.

@stoicflame
Copy link
Member Author

Alright, @ttwetmore, I've started a wiki page to describe the n-tiered architecture and how it's supported. Would you be willing to fill in the relevant todo: sections for me?

https://github.com/FamilySearch/gedcomx/wiki/The-N-Tiered-Genealogical-Architecture

Once you fill that in, we can start hacking out how to support those concepts in GEDCOM-X until we get this right.

@stoicflame
Copy link
Member Author

@jralls and @EssyGreen, I think you've got some good points. Let's give @ttwetmore a chance to fill in his ideas so we can all make our own evaluations on how to best accommodate those ideas in a standard.

@ttwetmore
Copy link

@jralls summarizes my views quite well; thanks. I will write some on my own about places where there are some other shades of meaning when I get a little time. I'll try this afternoon.

@ttwetmore
Copy link

Sources have locations (where they are), descriptions (called metadata), and content (the evidence therein). We store the location and metadata in source records and source references. Where do we store the evidence?

There are standard answers to this question, and there are variations:

  1. Nowhere. When we want to consult the evidence we look at the source records to find out where the evidence is and we go wherever the source is and look at it.
  2. As paper copies in folders or notebooks independent of our computers.
  3. As image files, spreadsheet files, etc, on a computer, not associated with a genealogy program.
  4. As computer files that are also known to a genealogy program so the program can help manage them.
  5. Digitized into text and included in the source records (now holding location, metadata and content).
  6. Digitized into text in evidence records, the most universal being the persona record. A persona record holds all the information inferable about a person that can be extracted from a single item of evidence. They have references to their sources. Because they are derived from a source some people call them conclusions because of any interpretations made during the extraction.
  7. Any combination of the above and any other way you can think of.

For the remainder I assume that GEDCOMX goes with option 6 and includes evidence records, and more specifically, persona records. This is not a done deal, but for any discussion of the N-tier idea to make sense personas must exist.

A model with personas also requires conclusion persons. A conclusion person is the classic person record of today’s genealogy programs. It is where all the information believed to be true about a real individual accumulates. It is the collection of facts that pertains to that that individual, and each fact can have its own source reference to provide provenance.

When using persona records and conclusion persons, genealogical research proceeds by uncovering evidence, extracting personas from the evidence, concluding which sets of personas refer to the which individuals, and then building a conclusion person for each individual from those sets of personas.

What happens to the personas during this process? There are two obvious answers. First, the personas can be consumed by the process, yielding up their facts to be added to the growing conclusion persons. Since each fact has the same source reference as its persona, the chain of evidence can be maintained at the fact level. After the facts are copied to the conclusion persons the personas are removed.

There is a second answer, however, where the personas are not removed, and each conclusion person is linked to its own set of personas. In this scheme the conclusion persons inherit their facts from its personas, and in most cases the facts don’t need to be copied into the conclusion persons at all.

There are advantages and disadvantages to both. The first approach can be implemented in every genealogy program around today. Each new persona is added to the program as a new person record, and when the user decides it describes the same individual as another person in the database, the user merges them together. The disadvantage of this approach is that after the records are merged it is hard to correct errors, undo merges, and keep track of the decision history that led to the building of that conclusion person.

Things are reversed for the approach that keeps the personas. One disadvantage is that little software today supports the approach. It also has the disadvantages of requiring more records in the database and more complexity in its implementation. Its advantage is in keeping a persistent, complete record of all the evidence, a clear history of the research process, and a nearly painless solution to the undoing of decisions when errors or new evidence is discovered. I strongly believe that the advantages of the persistent persona approach outweighs both its own disadvantages and the advantages of the first approach.

A system that supports grouping persistent personas into sets that are bound together by conclusion persons to represent individuals is a 2-tier system. The New Family Search application is a 2-tier system. The current GEDCOMX model, with its Record Model and Conclusion Model anticipates a 2-tier system.

From a 2-tier model it is short step to an N-tier model. In an N-tier model, person records are joined together into tree structures, with the persons at the leaves being persona records extracted from evidence, the persons at the roots being the user’s current set of conclusion persons, and the interior persons representing intermediate decisions and conclusions made at different times during the research process.

In a 2-tier system the possibly long history of decisions a user makes during the process of bringing a set of personas together, gets lost in the single proof statement for the conclusion person. In the N-tier system, the root person and all interior persons represent clean, direct and easily describable decision points during the research process. The structure of the tree is therefore isomorphic to and fully captures the research and decision processes. The overall conclusion at the root person level is the recursive recapitulation of the decisions made at each node during the construction of the person tree.

The N-tier approach also makes the conclusion making process reversible, and because the decision making in a tree is only partially ordered (i.e., the same tree can be built by making certain conclusions in different orders), the tree can be decomposed in an order other than opposite the order of construction. Therefore decomposing N-tier person structures can lead to a different and more likely set of conclusion persons than existed at any time before in the research process. You can actually advance in your research process when undoing incorrect decisions.

Ironically the recursive N-tier system is simpler than a 2-tier system. A single person record suffices all the way up, from the persona level to the root conclusion level. Simple recursion suffices throughout.

@EssyGreen
Copy link

Just a quick point on terminology ... I think when @ttwetmore's use of source and evidence may be different to mine.... @ttwetmore seems to indicate (forgive me if I'm wrong) that the data in the source is the evidence (a bit like Evidence=Source.Data). For me the data in a source is just data unless used in the context of trying to prove something else (a bit like Evidence=Source.Data+Research.Subject). I'm not saying either is right or wrong but it may help to have a clear Glossary of definitions documented with the model so we all refer to things in the same way.

@EssyGreen
Copy link

@ttwetmore - the purpose of this thread seems to be to focus on:

how to best accommodate those ideas in a standard

Whilst I get the strength of your beliefs in your post, I am not clear on what your requirements are. I assume that the only need is the ability to link multiple Persona records together in some sort of object together with a ProofStatement? Is that is what is being proposed/required? Or is there something else/different/more? Could you clarify for me in terms of the model rather than in terms of your particular application of the model?

@ttwetmore
Copy link

@EssyGreen
I define evidence as any information found in a source that provides genealogical facts of interest to a genealogist and has been extracted in some way to make it available. I believe your definition goes further and says that information from sources doesn't become evidence until it is actually used to make conclusions. For me, the fact that someone was interested enough to gather the information is qualification enough to call it evidence. Or put it this way -- the fact that the information could just potentially be used as evidence (in your sense of the word) in some future argument or conclusion is enough for me to call it evidence.

@ttwetmore
Copy link

@EssyGreen
My requirements are full support for the genealogical research process, which now gets called the five step GPS. For me this includes:

  1. Persistently recording all sources and all evidence extracted from those sources, represented in my view by source records and evidence records.
  2. Persistently recording and documenting all decisions and conclusions made during the process. There are a few types of decisions that have to be made, but by far the most important one, that has to be made time after time after time, is what real human being (I use the term individual for this concept) each new item of evidence refers to.
  3. Representing the decision making directly in the database in a way that fully records and documents and maintains the history of the process, and that allows the process to handle errors and new evidence with ease.

I believe the best way to meet these requirements is a system in which the collected evidence about persons is recorded in persistent persona records, and in which nearly all conclusions are tied directly and persistently to the decisions the user makes when deciding which persona records refer to which individuals. When those decisions are made I believe they should be fully recorded, and the methods I have outlined for linking the persona records into a person tree structure is the method I believe best meets that need. The person tree structure is nothing more than an indelible representation and history of the decisions made during the research process.

@jralls
Copy link
Contributor

jralls commented Feb 24, 2012

For me this includes:

I'd like to add:

  1. Persistently recording all searches for relevant evidence in sources and for potentially relevant sources and the outcomes of those searches.

@ttwetmore
Copy link

@jralls
I think the GDM calls that the administration model. It's important, but does it belong in the overall GEDCOMX model, or should it be independent? I don't know.

@EssyGreen
Copy link

@ttwetmore

I believe your definition goes further and says that information from sources doesn't become evidence until it is actually used to make conclusions

Yes that is true and I need the distinction in my model because the Evidence is an object in its own right ... but I digress ... my point was just to highlight the need for a common vocabulary.

a system in which the collected evidence about persons is recorded in persistent persona records, and in which nearly all conclusions are tied directly and persistently to the decisions the user makes when deciding which persona records refer to which individuals.

So, what exactly is lacking in the current model which means that you cannot apply N-tier with it?

@EssyGreen
Copy link

@jralls

Persistently recording all searches for relevant evidence in sources and for potentially relevant sources and the outcomes of those searches.

I agree re the need to record searches and results (and also the goals - see #141 ) but the how and what details I think are worthy of a separate post.

@ttwetmore
Copy link

So, what exactly is lacking in the current model which means that you cannot apply N-tier with it?

Almost nothing. We simply need a way for a person record to be able to refer to sub-person records. There are similar arguments to be made about evidence events, but I'd rather not muddy the waters.

@jralls
Copy link
Contributor

jralls commented Feb 24, 2012

I think the GDM calls that the administration model. It's important, but does it belong in the overall GEDCOMX model, or should it be independent? I don't know.

It's part of the administrative section. The GDM doesn't break the parts down into separate models, but the ERD does have some lines that rather arbitrarily divide it into "administrative", "evidence", and "conclusion" sections. At your request I've summarized the GDM in #138.

Obviously I think that GedcomX needs to record searches. It's an element of the GPS, so it needs to be in the model. Sarah's right, though, I shouldn't be muddying up this issue with it.

@EssyGreen
Copy link

We simply need a way for a person record to be able to refer to sub-person records

In this we agree. N-tier or non-N-tier can use this according to the needs of the user/application.

@EssyGreen
Copy link

Actually scratch that ... I should have read it more carefully .... I thought you meant a way for Person records to refer to Persona records (which I would agree with). If you mean Person->Person then I think the context needs to be specified or it is ambiguous (e.g. like old ALIAses).

@ttwetmore
Copy link

Actually scratch that ... I should have read it more carefully .... I thought you meant a way for Person records to refer to Persona records (which I would agree with). If you mean Person->Person then I think the context needs to be specified or it is ambiguous (e.g. like old ALIAses).

I do mean Person->Person*, because in an N-tier there are no separate persona records -- the person records at the leaves are logical personas and that's it. But this is a specific 1-to-n relationship so it has its own semantics and label. I don't have any clever label for it -- I've been calling it "subPerson" until someone thinks of something better. If anyone thought there should be another kind of 1-n relationship between persons, that relationships would have to have its own name and semantics.

@EssyGreen
Copy link

I do mean Person->Person

Yes I got that (albeit the second time round hehe). It strikes me that this is the same thing as the old GEDCOM ALIAs (if not can you explain how it differs?)

If so, then personally I would not use it because I think it adds unnecessary complexity (I have never yet seen an application that handles ALIAses well - and I can understand why) and it can be more clearly exemplified by creating new trees/files for the different possibilities which can then be used/referenced as sources if/when a conclusion is reached in the original file. Since with GEDCOMX we will now be able to handle recursive sources I don't see why we should add the complexity into the base model.

If it is included as the standard then applications will either be forced to implement N-tier in the way you have specified or to reject any data formatted in this way (resulting in loss of data).

If on the other hand it is not included in the base model, then applications wishing to implement N-tier in the way you specify can simply merge the Persons in the different trees/files using the ALIAs or equivalent. (ie upon finding a Person with a Persona which comes from a Source which is a GEDCOMX file then you can lookup the Persons which this Person represents in the Source file, add them as Persons to your file and link via the ALIAs)

For that reason I would prefer it not to be implemented in the model.

@ttwetmore
Copy link

@EssyGreen
From the GEDCOM standard

ALIA An indicator to link different record descriptions of a person who may be the same person.

The ALIA tag could be used to implement the N-tier structure. Of course GEDCOM is supposed to only hold conclusion persons, and the semantics of ALIA are supposed to mean "may be the same conclusion person." So though ALIA does provide a person->person* mapping, it has 1) an entirely different semantics; and 2) it has never been treated meaningfully in current software so has never been implemented well. So to use the the existence of the alias concept as a reason to reject the N-tier approach is a non-starter.

However, to argue about what requirements the N-tier approach would place on a genealogy software program does make sense. If a program that can't handle the N-tier approach were to import GEDCOMX data with N-tier structures, that software would be in a quandary on how to proceed. One thing for sure, the program couldn't claim to be GEDCOMX compatible. How big a concern must this be for GEDCOMX in defining a new model?

For me the purpose of the N-tier structure is to allow full support of the research process (minus, as has been pointed out by @jralls, the administrative part). My main criticism of the current generation of desktop systems is that they don't provide the features that allow this support. I have always characterized the current set of desktop systems as conclusion only without providing enough research support. So my concern over whether the current set of programs could handle N-tier data is not high. I don't think any could support it without adding features.

In the development of new standards how much concern should be placed on trying to model an area of human discourse in a relative vacuum of what's gone on before, versus how much concern should be placed on the precedences that have been set by the current set of standards, or on how difficult it will be for vendors to implement a new standard? These are tough questions that GEDCOMX will have to answer. My answers to questions like these are always to do what from the technical or scientific or idealistic viewpoints seems the best. I would add the N-tier structure with the attitude that vendors adjust or fade away; that it's good for them even if they kick and scream. Maybe an extreme view that should be rejected out of hand.

@EssyGreen
Copy link

If a program that can't handle the N-tier approach were to import GEDCOMX data with N-tier structures, that software would be in a quandary on how to proceed. One thing for sure, the program couldn't claim to be GEDCOMX

This is only true if GEDCOMX implements N-tier in the way you have specified.

the purpose of the N-tier structure is to allow full support of the research process

Maybe, but it is not the only way that the research process can be fully supported - indeed by your own admission you do not deem the goal/search areas to be important(which others might).

My concern is that N-tier (as you have specified it) is an overly complex model which many applications (and users) may have difficulty with.

@EssyGreen
Copy link

I would add the N-tier structure with the attitude that vendors adjust or fade away; that it's good for them even if they kick and scream

That depends on who holds the majority vote. If some "non-standard" application(s) are seen to support the research process better than those who simply uphold GEDCOMX then it could be GEDCOMX that fades away. In the current market, the applications which are in my opinion more useful only partially support the existing GEDCOM. Conversely, I know of at least one who rigidly stands by GEDCOM 5.5.0 and is (in my opinion) stifling itself by so doing.

GEDCOMX must be flexible enough to support a multitude of applications not just a single way of doing things.

@ttwetmore
Copy link

@EssyGreen
I agree with you on most these points. I would be very interested in other ways to implement N-tier other then through a person->person* approach.

I view the addition of the goal/search stuff as such an easy thing to do that I don't worry about it -- someone who cares about it should decide what to add to GEDCOMX to support it and GEDCOMX should do it.

I have freely admitted (it was among the disadvantages highlighted in my missive) that the N-tier model is complex (I would not go so far as to say overly complex as you have). It will be a challenge to implement for developers, but not so hard for experienced developers. A good implementation would go a long way to making it accessible for users. A few excellent user interface metaphors (e.g., moving index cards around on a desk top) could go a long way in helping a user -- simply provide them with a user interface that mimics the paper and pencil approach they now use.

All that said, I agree with you fully that it is a complex idea with difficulties in its implementation and use. Does its advantages outweigh these disadvantages? Many say no, no, no, no, no. That's life in the high technology fast lane.

@ttwetmore
Copy link

@EssyGreen

That depends on who holds the majority vote. If some "non-standard" application(s) are seen to support the research process better than those who simply uphold GEDCOMX then it could be GEDCOMX that fades away.

This is a dirty little secret that standards writers don’t talk about. If a killer app shows up that sweeps the stakes, its underlying model will become the defacto standard no matter what. My best approach, if GEDCOMX decides that the N-tier approach is too complex, would be to write that killer app;)

GEDCOMX must be flexible enough to support a multitude of applications not just a single way of doing things.

Of course, but there are some issues lurking. The big one, of course, is what does an application do when a GEDCOMX import file uses features that the application does not support? I’m sure you could come up with examples. Mine would center around the N-tier stuff, since that is a feature that would probably be implemented last if at all by some applications.

@jralls
Copy link
Contributor

jralls commented Feb 25, 2012

I would add the N-tier structure with the attitude that vendors adjust or fade away; that it's good for them even if they kick and scream.

They won't scream, they'll ignore, and GedcomX will be stillborn.

@ttwetmore
Copy link

@jralls

They won't scream, they'll ignore, and GedcomX will be stillborn.

You may be right.

@EssyGreen
Copy link

@ttwetmore

I'm sorry if you took my comments as criticism. The were not intended as criticisms of you or your model but simply my opinions of the implications of applying your particular implementation of N-tier in the context of GEDCOMX. Re your specific issues:

Data loss -- criticism unfair -- it has nothing to do with the model, only its acceptance.

Agreed. I should have phrased it as "I believe because of the issues below that there will be a significant number of applications not willing/able to apply your particular implementation of N-tier and in migrating away from an N-tier to a non-N-tier application there will therefore be data loss."

Ambiguity -- there is never ambiguity -- the sub person relationship always means believed to be the same person because of ..., where the because of ... is supplied by a conclusion statement or a proof statement.

OK so you now have an ALIAs object rather than just a link - presumably with evidence links as well as a proof statement ... this was a misunderstanding on my part and is an improvement from my perspective.

However, I still maintain that the directional constraints (based on the order of discovery) which you outlined earlier ("the order of matching/linking/combining/merging") will need to be clarified in order to ascertain how/if the links should be validated:

you argued elsewhere that the order of discovery was important and hence Person A = Person B in your model not the same thing as Person B = Person A so does the link really mean "Person A (who was discovered first) is thought to be the same as Person B (who was discovered later)". If a user then attaches the reverse link then this statement no longer makes any sense. Should the application allow this or not?

Re:

Confusion -- the top level person in a cluster always represents the conclusion person. In 99.9% of the cases data will be 1 and 2 tier so exactly as today. The users of NFS have no trouble with 2 tier, because the UI makes it seamless.

I don't know what NFS is so can't follow your argument here (except by guesswork) but I agree that the UI is the key to making it clear .... hence my next point about the difficulty you are imposing on the developers in creating a good UI.

Complexity -- unfair -- your criticism revolves around the assumption that developers are incompetent and some odd misconceptions that the model requires repeated activities and reconstructions and that circular relationships are difficult to prevent. Merging names? Never happens.

I think your comments here are the ones that are unfair! My comments do not assume anything of the sort!!!

I have always found that the best way to develop a model is to base it on the real world. This makes it easier to develop and more intuitive when explaining concepts to users. I don't believe I am alone in this "misconception".

To reinterate my main point:

I firmly believe that genealogists are trying to re-create the real-world. So the primary objects should be modelled on the real world (ie Persons/People). Your model is a model of the interconnectivity of references to people in sources. That is not the same thing.

I can see that your model would fit very neatly in the world of what I call social-networking genealogy (e.g. MyHeritage, GenesReUnited). You may also be right in that some genealogical research applications may be able to utilise your model in exactly the way you want it implemented. However, I personally think that it is a specific implementation which will not fit well in many other genealogical research applications and hence should not be considered core to the Conclusion Model.

@ttwetmore
Copy link

I think I prefer higher-tier or higher-level to upper.

At ZoomInfo we called the very top level "individuals" and everything else persons or personas.

Sorry I can't be much more help than that.

@stoicflame
Copy link
Member Author

Hi all. RootsTech is over. Back to work.

Your comments are invited on the latest provisions to this issue. I'd like you to especially comment on a113ea8 which extracted out the n-tier architecture provisions into a separate specification. As I added some clarification to the documentation and to the examples, it became pretty clear that we needed a distinct place for this concept to be specified. This will allow us to specify more details about the n-tier architecture (such as how to resolve conflicts) in a distinct place where these principles can evolve separately from the underlying conceptual model.

You missed "persons ... its" in "Ick". Either "Upper-tier persons ... their" or (my preference) "An upper-tier person ... its".

See 9b8f2ba.

A bit inconsistent here ... It would be helpful to have an example for an Evidence identifier.

See 3a67f4f.

I think extractedConclusions are still useful for the non-n-tier model. It came from one of the source discussions IIRC.

I'm listening. I couldn't think of any use case that would justify the cost of explaining what those things are.

@jralls
Copy link
Contributor

jralls commented Mar 28, 2013

The n-tier spec depends on the conceptual model, and not the other way around.

Hmm. OK. I guess.

There's not much to like about the element name, and ISTM 2-tier will cause most software to choke, too. But see below.

The conceptual model supports those models, but I still don't think it needs to explicitly define them. I think we can use "non-normative" documentation (e.g. user documentation, recipes, example data, etc.) to help clarify how applications can exchange data from those models.

OK, rereading in context instead of from the changeset makes it a little clearer (formatting helps ;-) ). Maybe the extensive discussion of "Persona Constraints" belongs in the n-tier document. The description in the conceptual spec could be something like "flag for layered-person evidence architectures" and mention the n-tier spec as the place to look for more information. A weak reference rather than a hard dependency.

I'm starting to see 2-tier as a special case of n-tier rather than an extension of 0-tier, because I think that most current software isn't going to be able to handle any layering. That suggests that your "complianceRequirement" element should apply to both -- but complianceRequirement isn't a very good name. Why not 'evidenceArchitecture" with values "traditional", "2-tier", or 'n-tier"? Even a program capable of handling all three would benefit from being told up front which kind to expect, particularly if it doesn't want to construct the DOM tree in memory.

@stoicflame
Copy link
Member Author

Maybe the extensive discussion of "Persona Constraints" belongs in the n-tier document.

I disagree because I think the notion of a "persona" is universal and distinct from evidence architectures. For example, there are providers who are only interested in providing personas (think exchanging data from the 1940 U.S. Census Project) but aren't concerned how those personas are used to build up an evidence architecture.

I'm starting to see 2-tier as a special case of n-tier rather than an extension of 0-tier

Agreed.

That suggests that your "complianceRequirement" element should apply to both -- but complianceRequirement isn't a very good name.

Okay. How about just "constraint"?

Why not 'evidenceArchitecture" with values "traditional", "2-tier", or 'n-tier"?

But I wanted to keep the notion of compliance requirements flexible enough to handle more than just evidence architectures. I can imagine other data profiles that applications may want to force compliance with, to.

@mikkelee
Copy link

I'm not too fond of the spec allowing too many variations. I mentioned it briefly as we discussed PlaceDescriptions, but in general I think a spec should be very strict regarding the ability to express the same data in multiple ways. The less ways possible, the better.

To me, 2-tier seems an artificial limitation of n-tier that will only annoy me in daily use. If possible, I'd get rid of traditional entirely, but I recognize that it might make migrations easier.

The descriptions of n-tier in the new document look reasonable.

@ttwetmore
Copy link

I don't wish to try to read Ryan's mind, but I do wish to say something.

I see the developing GEDCOM-X standard as being able to support different models/philosophies for genealogical data. First, conventional conclusion based genealogy where we only record summaries of information and their sources, where the person records are bags of facts, each fact taken from a possibly different source, and each fact, if the user is not too lazy, having a reference to its particular source record. Call this 1-tier. Or call this the underlying model of 99% of all current desktop and online genealogical applications.

But Ryan also wants the model to be useful in systems where there are distinct conclusion and evidence layers, where the conclusion persons are not bags of facts, but instead are bags of references to a lower tier of person records that contain facts; and each lower level person is restricted to containing only the facts extracted from exactly one source. Call this 2-tier. New Family Search trees had this to some extent.

And what the heck then. Also allow GEDCOM-X to be used in more complex models. Thus N-tier. Having N-tiers allows each node to represent a decision (that all lower persons are the same real person), which in turn allows complex decision trees to be easily managed. I love it. But how many years will go by before any system truly supports it? I am just wonderfully gratified to know that GEDCOM-X is willing to anticipate that use.

GEDCOM-X does not say anything about the models that should be used. GEDCOM-X simply enables developers to choose whatever model they believe is best for their application. I think of GEDCOM-X as specifying capabilities, but not the requirements on how to use the capabilities.

What is so wonderful about the whole thing, in my opinion, that in order to support 2-tier and N-tier, all a data model has to do is have a way for any person record to be able to refer to a list of other person records. There is some discussion that you also need a tag in order to specify the kind of person record. Personally I don't think this is that important, but it's such an insignificant thing, it's hard to argue about it.

If GEDCOM-X, because it supports N-tier, tried to specify that systems using it had to support N-tier, GEDCOM-X would go down in dust. GEDCOM-X's job is simply to be there for whatever genealogical model a developer wants to implement. It's a meta-model at the capability level.

This all begs the question of how easy or difficult it will be to transfer data between different genealogical systems using GEDCOM-X as the transport mechanism. Think about the problems of transferring data from a 1-tier system to an N-tier system, and vice versa. There are some real issues in here. Are those issues big enough to declare that the goals of GEDCOM-X being capability based rather than requirements based are ill conceived. Speaking only for myself, I don't think so.

@jralls
Copy link
Contributor

jralls commented Mar 28, 2013

Maybe the extensive discussion of "Persona Constraints" belongs in the n-tier document.

I disagree because I think the notion of a "persona" is universal and distinct from evidence architectures. For example, there are providers who are only interested in providing personas (think exchanging data from the 1940 U.S. Census Project) but aren't concerned how those personas are used to build up an evidence architecture.

Ah, the "record model" rears its head again. ;-)
What would motivate those providers to set the "persona" flag? It's not significant unless there are also Person instances which don't have it set. As soon as you introduce both types of Person, it becomes instantly necessary to connect them somehow.

Anyway, I said "extensive discussion", not the flag itself. The discussion is largely about connecting persona-constrained Persons into not-persona-constrained Persons. That action creates a 2+-tier Persona architecture. The conceptual spec can just provide a definition of "persona" and say that the persona constraint tells readers that this Person record fits the description.

That suggests that your "complianceRequirement" element should apply to both -- but complianceRequirement isn't a very good name.

Okay. How about just "constraint"?

It's not really a constraint on the whole GedcomX document, is it? Even the fact that a document contains persona-constrained Person instances shouldn't be a big deal to a 0-tier application: That's what such an application is going to expect from an item of evidence, even if the devs don't know to call it that. What's going to cause trouble is reading a GedcomX document that links (or, as Tom points out, doesn't) Person instances into a hierarchy.

But I wanted to keep the notion of compliance requirements flexible enough to handle more than just evidence architectures. I can imagine other data profiles that applications may want to force compliance with, to.

Can you articulate any of them? What if instead of calling it an "evidence architecture" you call it a "person-conclusion architecture", which more accurately describes what the introduction of personas does?

@jralls
Copy link
Contributor

jralls commented Mar 28, 2013

I think a spec should be very strict regarding the ability to express the same data in multiple ways. The less ways possible, the better.

Unfortunately that would mean that n-tier must be excluded. Otherwise the only application to use it will be FS FamilyTree (assuming that FSFT is moving towards n-tier).

@jralls
Copy link
Contributor

jralls commented Mar 28, 2013

Tom,
Nice summary.

his all begs the question of how easy or difficult it will be to transfer data between different genealogical systems using GEDCOM-X as the transport mechanism. Think about the problems of transferring data from a 0-tier system to an N-tier system, and vice versa. There are some real issues in here.

That's the rub, isn't it? At least a 0-tier program could flatten the tiers into a single Person instance. Teasing a flat person back out into tiers would be pretty difficult.

Are those issues big enough to declare that the goals of GEDCOM-X being capability based rather than requirements based are ill conceived. Speaking only for myself, I don't think so.

I don't understand what you mean. "goals of GEDCOM-X" seems rather broad.

@thomast73
Copy link
Contributor

But I wanted to keep the notion of compliance requirements flexible enough to handle more than just evidence architectures. I can imagine other data profiles that applications may want to force compliance with, to.

If this is about "profiles", can we just call it a profile?

<gedcomx>
  <profile resource="http://gedcomx.org/n-tier-evidence-architecture/v1"/>
  <!-- rest of the data goes here -->
</gedcomx>

@jralls
Copy link
Contributor

jralls commented Mar 28, 2013

"Profile" is at least nicer than "complianceRequirement"!

But what's a "profile"?

@ttwetmore
Copy link

Tom,
Nice summary.

Thanks.

This all begs the question of how easy or difficult it will be to transfer data between different genealogical systems using GEDCOM-X as the transport mechanism. Think about the problems of transferring data from a 0-tier system to an N-tier system, and vice versa. There are some real issues in here.

That's the rub, isn't it? At least a 0-tier program could flatten the tiers into a single Person instance. Teasing a flat person back out into tiers would be pretty difficult.

A similar discussion is recent at rootsdev! 1-tier systems are called C-systems and 2-tier systems are called E-systems (you can probably figure out what the letters mean). The actual discussion is about how to collaborate between systems of the two types, meaning that the users want to share their data back and forth many times.

Are those issues big enough to declare that the goals of GEDCOM-X being capability based rather than requirements based are ill conceived. Speaking only for myself, I don't think so.

I don't understand what you mean. "goals of GEDCOM-X" seems rather broad.

What I would like the goals of GEDCOM-X to be -- the archival and transport format for genealogical data that includes all source, evidence, and conclusion information, where the data is digitally transcribed and structured and not simply narrative; attached images are fine.

@jralls
Copy link
Contributor

jralls commented Mar 29, 2013

[summarizing] capability based rather than requirements based [specification for] genealogical data that includes all source, evidence, and conclusion information, where the data is digitally transcribed and structured and not simply narrative...

Seems rather more ambitious than what the users keep telling us at conferences that they want, which is the ability to losslessly transfer their data from program a to program b. I'm not unsympathetic to the ideal, but until someone actually writes a program that accomplishes the goal, writing a spec for interchange of that sort of data seems to be putting the cart before the horse. I don't think that it's yet possible to "digitally transcribe and structure" genealogical (or any other complex) reasoning. Well, transcribing is easy enough, but what does "transcribing without narrative" mean?

As Thad points out, an n-tier Person/Persona architecture doesn't come close to covering the complexity of high-quality genealogical reasoning: It actually gets in the way because it emphasizes direct evidence over indirect evidence.

@stoicflame
Copy link
Member Author

@ttwetmore I think you did a great job of reading my mind. And you articulated it beautifully. Thanks.

What would motivate those providers to set the "persona" flag? It's not significant unless there are also Person instances which don't have it set.

Indeed. Good point.

The conceptual spec can just provide a definition of "persona" and say that the persona constraint tells readers that this Person record fits the description.

Uh... yeah. That's what the attached changes are proposing, no?

Can you articulate any of them?

Sure. But I'm afraid the ones I can come up with in the time I'm allotting myself might sound a bit contrived:

  • Some application requires that for a relationship of type Couple, person1 must be male and person2 must be female, so a new identifier for a data profile is defined and the import is rejected if the profile isn't declared.
  • Some application requires all images being reference to be in JPEG format, so a new identifier for a data profile is defined and the import is rejected if the profile isn't declared.
  • Some application requires that all persons have at least a birth, death, and name, so a new identifier for a data profile is defined and the import is rejected if the profile isn't declared.

If the is about "profiles", can we just call it a profile?

I like it.

@jralls
Copy link
Contributor

jralls commented Mar 29, 2013

Uh... yeah. That's what the attached changes are proposing, no?

Not when they go on about linking persona-constrained-Persons into not-constrained-Persons. Then they're about n-tier and they belong in the n-tier document.

Sure. But I'm afraid the ones I can come up with in the time I'm allotting myself might sound a bit contrived:

Contrived indeed. I don't think that's the sort of thing that we want to encourage: The paradigm for data exchange protocols is to write strictly and read liberally; this encourages the opposite. Client applications should be encouraged to read all GedcomX files, accepting what data they can use and issuing warnings for items they can't.

@stoicflame
Copy link
Member Author

Not when they go on about linking persona-constrained-Persons into not-constrained-Persons. Then they're about n-tier and they belong in the n-tier document.

Agreed. But I'm not sure what you're referring to in the conceptual model document. The only place that "goes on about linking persona-constrained-Persons into not-constrained-Persons" is in the identifier examples, and I don't think that's specific to n-tier at all, I think that's a generally-applicable.

I don't think that's the sort of thing that we want to encourage

Agreed, but I think the right place to encourage/discourage these kinds of things is in the marketplace. I think this sort of feature provides a means whereby new and innovative ideas, constraints, and profiles (like n-tier) can be proven out and either thrive or die based on their success.

@mikkelee
Copy link

I don't like the idea of profiles at all. That seems to open a can of worms where everyone can define incompatible profiles that could potentially completely break interchange.

@stoicflame
Copy link
Member Author

I don't like the idea of profiles at all. That seems to open a can of worms where everyone can define incompatible profiles that could potentially completely break interchange.

So guys, I understand the visceral reaction to the proposal. I really do. But I'd like you to set aside the "gut reaction" for just a minute and consider the following:

  • Data providers aren't going to just arbitrarily define a bunch of incompatible profiles, because they have a real incentive to ensure the data they provide is as compatible as possible. The whole purpose of this project is to provide a means to exchange data between providers who actually want to exchange data. Providers that don't want to exchange data aren't interested in GEDCOM X.
  • We know there are competing modelling philosophies. Many (most?) of them haven't been proven out yet because we haven't been able to establish a "marketplace of ideas" in which these philosophies can compete. A common data exchange format is, in my opinion, the first step toward catalyzing this marketplace.
  • The n-tier architecture is an example of one of these philosophies that undeniably has merit, but hasn't had the opportunity to prove itself in a marketplace. The n-tier architecture can certainly be serialized using GEDCOM X, but without a way to stipulate constraints by declaring a "profile", the data gets corrupted when shared with other non-n-tier architecture applications.
  • In sum, modelling philosophies that have a real potential for benefit to the community get rejected, not because they don't have merit or value but because they don't have a means to make their case.

I guess I feel that providing a means to declare data profiles doesn't end up harming the marketplace at all, but in fact enables it to grow and develop and mature by providing a way to propose and vet new and innovative ideas.

@jralls
Copy link
Contributor

jralls commented Apr 1, 2013

Data providers aren't going to just arbitrarily define a bunch of incompatible profiles, because they have a real incentive to ensure the data they provide is as compatible as possible. The whole purpose of this project is to provide a means to exchange data between providers who actually want to exchange data. Providers that don't want to exchange data aren't interested in GEDCOM X.

Yeah, the experience with GEDCOM really supports that claim. Right.

Providers sell software based on feature lists. They know their users want them to be able to exchange data, but they also want to lock users in to their upgrade cycle (or their website's ads). The solution? They provide the feature that users demand but they cripple it so that it doesn't actually do what the users want.

@jralls
Copy link
Contributor

jralls commented Apr 1, 2013

The n-tier architecture is an example of one of these philosophies that undeniably has merit, but hasn't had the opportunity to prove itself in a marketplace. The n-tier architecture can certainly be serialized using GEDCOM X, but without a way to stipulate constraints by declaring a "profile", the data gets corrupted when shared with other non-n-tier architecture applications

Rubbish. A GedcomX parser can easily figure out from the data whether its getting a 0-tier or an n-tier file and deal with it or bail out. Having a flag at the beginning might make it easier. It doesn't really contribute anything, but if restricted to that one thing is relatively harmless.

The problem is that you want to generalize it and allow anyone to invent "constraint" flags. That's just giving the vendors another opportunity to have GedcomX on their bullet lists but make it non-functional for exchanging datasets.

@jralls
Copy link
Contributor

jralls commented Apr 1, 2013

In sum, modelling philosophies that have a real potential for benefit to the community get rejected, not because they don't have merit or value but because they don't have a means to make their case.

Rubbish again. There's no barrier to using whatever data model one likes when writing a Genealogy program.

There is a barrier to writing an importer for a format that supports more than one data model: One must have an importer for every supported data model. If the format supports any arbitrary data model the task becomes impossible and the format is useless.

@stoicflame
Copy link
Member Author

Picking up this thread where it last got left, I will not be pursuing my suggestion to support the notion of "data profiles". Even so, I want to make it clear that this is not because I share @jralls sardonic analysis of genealogical data providers. I refuse to attribute to malice what can rightfully be attributed to toolset inadequcies or even just ignorance.

@stoicflame
Copy link
Member Author

With the close of issues #242, #244, and #246 I believe the n-tier model to be supported. To illustrate, I'm going to lift @mikkelee's example at #232. Note that I'm not mentioning things like source references, analysis documents, etc. for the sake of simplicity.

  • A: Danish census 1840 mentions a man Johan Gottholph, aged 64, married to Ane Cathrine Carlsdatter, aged 61.
  • B: Danish census 1834 mentions a man Johan Gothilf, aged 59, married to an Ane Cathrine Carlsdatter, 55.
  • C: Danish census 1787 mentions a boy Johan Friderich, aged 12, son of the widow Mariane Diller, 49
  • D: Church records mention the baptism of a Johan Frideric, son of Gothilf Gierman and Mariane Winge, on March 27, 1776.

I am quite certain that Johan G. in A + B is the same person. A Person A/B is created with two EvidenceReferences: one referencing Person A and another referencing Person B.

I am also quite certain that Johan F. in C + D is the same person. A Person C/D is created with two EvidenceReferences: one referencing Person C and another referencing Person D.

I come across another (hypothetical) source that talked about person E, and that I use to conclude that A/B is the same as C/D because he used his father's first name as a patronymic.

Two-Tier Implementations

The two-tier implementation would end up with a Person A/B/C/D/E with five EvidenceReferences referencing Persons A, B, C, D, and E. Note that presumably there no longer exists neither Person A/B nor Person C/D.

N-Tier Implementations

The n-tier implementation would end up with a Person AB/CD/E with three EvidenceReferences referencing Persons A/B, C/D and E.

Your comments are welcome. I expect to close out this issue early next week. Thank you for all of your contributions.

@jralls
Copy link
Contributor

jralls commented May 13, 2013

With the close of issues #242, #244, and #246 I believe the n-tier model to be supported.

+1

With the caveat that we may want to revisit this after publication of Robert Charles Anderson's The Elements of Genealogical Analysis. I attended his preview lecture on Saturday and immediately recognized that it's about n-tier. Not surprising, since that grew out of the Gentech model of which RCA was a principal author.

@mikkelee
Copy link

I worry that data gets lost when aggregating persons from n-tier to 2-tier. Though I suppose that's up to the implementer to ensure won't happen.

+1 on N-tier, from what I can tell the current spec will knead my suits :)

@jralls
Copy link
Contributor

jralls commented May 13, 2013

I worry that data gets lost when aggregating persons from n-tier to 2-tier. Though I suppose that's up to the implementer to ensure won't happen.

No worries. There aren't any implementations of either. ;-)
Seriously, though, there absolutely would be a huge impedance mismatch between an n-tier and narrative implementation. There would be an even bigger one between either and a GEDCOM-based implementation, which at the moment is pretty much everything. Yeah, the developers will claim otherwise, but all of the current
applications are built on relational databases with schema that follow the GEDCOM data model with varying fidelity. That's going to be a problem for the richer programs we're hoping to stimulate with GedcomX, but if one or two gain traction it will turn into a problem for the older programs. Creative Destruction, to borrow from Tom Peters.

@stoicflame stoicflame closed this May 14, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants