Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Record and Conclusion Models #138

Closed
joshhansen opened this issue Feb 14, 2012 · 87 comments
Closed

Merge Record and Conclusion Models #138

joshhansen opened this issue Feb 14, 2012 · 87 comments

Comments

@joshhansen
Copy link

Executive Summary

The record and conclusion models actually model the same domain but have been artificially separated. Thus, GedcomX isn't even interchangeable with itself in spite of trying to be a widely useful interchange format. For almost all classes in the record model there is a corresponding, parallel class in the conclusion model. Merging these parallel classes together into a single model will make GedcomX easier to understand, easier to implement, and more powerful as a representation of genealogical data. In the remainder of this issue, I explain what's wrong with the current situation, why it's so harmful, and how the problem can be resolved.

What's Wrong

First, some quotes for context:

"The GEDCOM X record model provides the data structures that are used for defining the content of an artifact."[link]

and

"The GEDCOM X conclusion model provides the data structures that are used for making genealogical conclusions."[link]

As it stands, the GedcomX record model does not model the "content of an artifact." A record model that actually modeled "the content of an artifact" would do things like specify its dimensions, textual transcription, identifying marks, etc. Instead, the record model represents conclusions drawn from the content of an artifact. For example, the claim "John Smith was born 1 Jan 1930", though supported by the contents of Mr. Smith's birth certificate, is a conclusion a researcher drew based on that certificate. Conclusions such as this that are made on the basis of artifact contents are just another kind of conclusion.

In its current form, GedcomX tries to model one aspect of conclusion metadata (whether or not a fact was concluded based on the contents of a document or artifact) not by allowing for this to be represented in the metadata classes themselves, but rather by duplicating the entire set of data classes and metadata classes and declaring that metadata dealing with this new set represent conclusions drawn directly from a document. As a result, the record and conclusion data models are two separate but almost exactly parallel models of the same domain. The distinction upon which this duplication is justified is essentially arbitrary, treating a special kind of conclusion as if it were so distinctive that it must be modeled as an entirely different domain.

Why It's Harmful

The model duplication that exists in the current GedcomX specification adds to user confusion ("What's the difference between a person and a persona?"), complicates the task of implementing the standard (twice as many entities to represent), and reduces the utility of data represented using GedcomX (a persona transcribed from a record is not necessarily comparable to a corresponding person in a pedigree, even if they actually represent the same individual).

Resolution

Instead of making a complete copy of the data and metadata classes, this distinction can be much more parsimoniously modeled by simply enriching the metadata model. I propose modeling the genealogy domain as a set of core entity types (person, place, event, date/time, document, etc.) and a vocabulary for making statements about such entities (e.g. person X was born in place Y), combined with a metadata vocabulary for justifying these statements, recording the reasoning behind them, and showing who exactly is making the claims (e.g. researcher A claims/asserts/believes that person X was born in place Y because of evidence found in document Z). This lends itself to a two-part model, one for making statements about the core entities (data), another for making statements about those statements (metadata).

Rather than embedding Facts within the entities they are about, a general Fact class should be created that can represent claims of fact about any entity type. For example, in Turtle syntax:

#Subclassing rdf:Statement gives us subject, predicate, and object 
#properties by which any RDF statement can be represented
:Fact rdf:type owl:Class ;
    owl:subClassOf rdf:Statement .

:assertedBy rdf:type rdf:Property ;
    rdfs:domain :Fact ;
    rdfs:range :Person .

# This property can point to anything -- a document, or a literal string with the researcher's explanation
:supportedBy rdf:type rdf:Property ;
    rdfs:domain :Fact ;
    rdfs:range rdfs:Resource .

:subFactOf rdf:type rdf:Property ;
    rdfs:domain :Fact ;
    rdfs:range :Fact .

(A similar approach is described in my RootsTech Family History Technology Workshop paper.)

An appropriate resolution to this issue would involve either 1) merging the record and conclusion models and perhaps refactoring the result into data and metadata models, or 2) giving a convincing argument for why the current state of affairs is necessary, including specific use cases that could not be modeled using a single merged model. The burden of evidence that would justify the duplication of a substantial subset of the GedcomX vocabulary is, in my opinion, fairly high, given the cost in user confusion, implementation difficulty, and data format utility mentioned above.

See Also

Issue #131 "A Persona IS a Person"
Conclusion Record Distinction

@jralls
Copy link
Contributor

jralls commented Feb 14, 2012

+1 No, make that +10.
However, #7 rather thoroughly explains an alternate view. I think that the two can be reconciled; there's no reason that carpentermp's dual use (personas and persons, e.g.) can't be accomplished with the same data structures.

@EssyGreen
Copy link

++++++1
Although I would reserve judgement on the details of the resolution which seems (but maybe my misinterpretation) to be totally dependent on RDF meta-data .... as I've said elsewhere I believe that RDF/FOAF/DC are simply the data interchange formats not the object definitions which should be uncluttered with this level of detail.

@stoicflame
Copy link
Member

Great job, @joshhansen. Thanks for writing this up. This will provide a great point of discussion about this issue.

We need a wider audience and the alternate point of view clearly articulated before we can make a clear decision. I'll work on that.

@stoicflame
Copy link
Member

Marking this issue as priority 1. We need to resolve this sooner rather than later.

@lkessler
Copy link

Personally, I don't think it is bad if the Record Model and Conclusion Model are separate.

The Record Model should be made up of "just the facts" and information that repositories put together about the source material that they have. They can then use this Record Model for posting and transmitting their information.

The Conclusion Model should be interpretation goes.

By keeping these separate, and using the facts/interpretation distinction, the parts that go where can be easily defined.

@ttwetmore
Copy link

I have found that I generally disagree with Louis, and in this case there is no exception. Except for the fact that Louis is the most anti-persona person I have yet come across in discussions, and if he is now embracing the separation of the record and conclusion models he is tacitly accepting the need for the persona concept. For this I do applaud.

It is clear that in a multi-tiered system of the type I have advocated for nearly twenty years, the lower tiers hold data from the records (the evidence) and the higher tiers hold the conclusions. Thus a single model encompasses both the record level and the conclusion level in a seamless manner. As I have explained in excruciating detail a number of times, an N-tiered model is both considerably SIMPLER and considerably MORE POWERFUL than a dual record and conclusion model. And the real fact is that 95% or more of the time the N-tier system would be used with 2 tiers, resulting in the basic record/evidence and conclusion level model anyway.

It is not too surprising to me that Family Search would begin with a black and white, record and conclusion vision for a model, because it is so much the nature of their environment. They deal in billions of records, and they deal in large constructed pedigrees. At these levels genealogical data does seem starkly divisible into two extremes.

The rest of us deal with a more complex world, in which much of our data about people come from sources that cannot be cataloged as either low level records or as high level conclusions. And when we have gone far enough back, when we can no longer easily follow clear ancestral lines back in time, we must grow up and deal with a world consisting of a perplexing array of records and evidence, that must be fit together, and refit, and refit again, to find the best matches between that evidence and an assumed, but never to be fully known world of possible real human beings whose traces we perceive in the records. The N-tier model is the only way to hold our ideas, our pending decisions, our conclusions, in a way that traces our thoughts, and in a way that we can extricate ourselves from later as we either discover errors, or change our minds, or find new evidence that turns old conclusions on their ears.

@lkessler
Copy link

Tom and I actually agree on most things. But our disagreement regarding Personas and an N-tiered system are well-documented on the BetterGEDCOM wiki, so there's no need to rehash that over again in the GEDCOM-X community.

I went to RootsTech and expounded on my ideas for source-based data entry and evidence/conclusion modeling. No program today does that properly, and there was a lot of interest in the ideas I had and in my plans to implement this in my program Behold. This will be done by taking the raw source details, which will hopefully be able to be taken from repositories that store their data in a "just-the-facts" Record Model, and give the user tools to allow them to find related events, people, places and dates that are relevant to their research. The user will use these source details that are found as evidence and will use them to add their assumptions/conclusions to their own research data Conclusion Model.

That's my thinking and why I believe it is fine to leave the two models separate, if desired. I can handle them separate or together, but I think the repositories need that factual Record Model for them to store their data.

@EssyGreen
Copy link

I think what you are saying @lkessler is that we need to be able to distinguish between the "original" sources (raw source details) from "interpretations" and I would agree with you on that ... the problem is that the "just-the-facts" records are often (if not usually) just someone else's interpretation and hence are not really "originals" at all but derivations. As these records are built on and interpreted we get layers and layers of interpretations and as genealogists we need to peel back the layers as far as we can. If the layers don't exist then we can't.

Like you (I think), when I am downloading/accessing records from Repositories I would want the derivative which is closest to the original (e.g. I want the digital image from Ancestry not their transcription or their interpretation into "Facts" and/or "Roles" etc) but there may be times when a derivative is useful (e.g. a translation) providing that the information about the source it was derived from is also provided and since that source may itself be a derivative it is simplest to model this as a multi-layered structure.

The important thing for me is that the provenance trail is kept intact.

From a commercial point of view I suspect that most on-line record providers will stick to a single layer where the digital image(s), transcription and fragmentation into searchable fields is presented as a single "Record" with a pre-formatted Bibliographic citation as the only link to the original. Similary, research software suppliers will use a model tailored to their own USP and will tend to do as little as possible to comply with whatever de facto standard is out there (necessarily focusing on import/export).

Where GEDCOMX can/must add value is in providing a "best practice" genealogical standard which will encourage and enable quality research. Best practice depends on reaching plausible conclusions by making and investigating hypotheses based on interpretation(s) of information from a wide range of sources. Since we can never be 100% sure of the past, any "conclusion" is in itself a source for further research. Ergo we must have an N-tier approach to support the recursion.

@ttwetmore
Copy link

@lkessler:

but I think the repositories need that factual Record Model for them to store their data.

Combining the Record and Conclusion Models into an N-tiered, seamless model, provides the same lower, persona-based layer that the Record Model provides. It requires no changes to repository data. Why choose a more complex model when a simpler model with more power is available? Parsimony is the best policy.

@lkessler
Copy link

Essy:

I feel that sources that are derivations of other sources should not be treated as layers. They can be much more simply handled by having a source link to it's source. In GEDCOMish this would be:

0 @s1@ SOUR
1 TITL Derivation from original
1 SOUR @S2@

0 @S2@ SOUR
1 TITL Original

This can be chained as deep as necessary. This will do what you want, and do it simply.

If you want to call that an N-tier approach, then okay.

But if you are referring to the hypothesis/conclusions being N-tier, then I have a different but also simple model for that. In GEDCOMish this might be:

1 BIRT
2 DATE 5 JUL 1910
2 NOTE Birth date from 1st source believed to be true. 2nd source stated August. Believed wrong.
3 SOUR @s10@
3 SOUR @s11@
1 CHAN 22 NOV 2011 15:05:00

Let's say some new information in another source comes along. You find it supports the 2nd source and now you change your conclusion:

1 BIRT
2 DATE 5 AUG 1910
2 NOTE Birth date from first 2 sources believed to be true. 3rd is believed to be wrong.
3 SOUR @s11@
3 SOUR @s20@
3 SOUR @s10@
1 CHAN 17 FEB 2011 17:08:30

The N-tier if you want to call it that is simply the Change (or Undo) history of that Event. It documents the complete history of your assumption/conclusion over time, and what additional sources you added to come to the conclusion at each step.

This is what I feel needs to be implemented, is as simple as you can get it, and handles every case.

Everything here is currently possible in GEDCOM 5.5.1 except the Source of the source.

Louis

@EssyGreen
Copy link

@lkessler

sources that are derivations of other sources should not be treated as layers. They can be much more simply handled by having a source link to it's source

There needs to be an indicator in the source that it is a derivation so that the user and application understand that e.g.:

0 @s1@ SOUR
1 TITL Derivation from original
1 _DERIVEDFROM @S2@

That way it is also easy to find the "master" by recursively going back through the _DERIVEDFROM pointers until there isn't one. A generic pointer of SOUR gives no indication of the context - it could mean its derived from, a component of, supplied by, referenced in etc etc the other source

if you are referring to the hypothesis/conclusions being N-tier, then I have a different but also simple model for that

Your simplification is similar to what GEDCOM does now. And the same problem as above occurs in that there is no context for the source reference - is it positive or negative evidence? do all the sources refer to all the fact fields or just some of them? how does the application know what the NOTE is? Is it a conclusion/proof which references the sources? or just a working hypothesis? or just a descriptive narrative of the fact?

Without context neither the application nor the reader can be sure what was intended.

@lkessler
Copy link

Essy:

I don't understand what you mean. Whether or not you use the SOUR tag or a _DERIVEDFROM tag makes no difference. They mean the same thing. They simply mean that the source of this source was that source. If you are going to start pigeonholing where something came from into such fine divisions as "derived from", "component of", "supplied by", "referenced in" and who knows how many more, then you're going to make the task that researchers and repositories will have to catalog their materials much more onerous, mainly because it is going to be extremely difficult to define those terms clearly enough that everyone will use them in the exactly the same way. You'll only introduce inconsistency and confusion. It is enough simply to have any derived piece of data point to where it was derived from, because the ability to go to that original source is what is needed.

My simplification was simply to indicate how the evidence for the hypothesis/conclusions can be referenced. The NOTE tag could be a _CNCL (conclusion) tag or a _ASMP (assumption) tag if you wish.

For simplicity and to make a point, I left out the detail that goes under a source (that yes is currently in GEDCOM as well), e.g. to reference Source Detail (specific information in a source) and to indicate positive/negative evidence plus anything else you want, which is basically the misnamed SOURCE_CITATION entity in GEDCOM e.g.:

3 SOUR @s11@
4 PAGE < WHERE_WITHIN_SOURCE>
4 EVEN < EVENT_TYPE_CITED_FROM>
5 ROLE < ROLE_IN_EVENT>
4 < NOTE_STRUCTURE>
4 QUAY < CERTAINTY_ASSESSMENT>

Now I'm not saying that the above has everything needed, but it is an excellent starting point.

Louis

@EssyGreen
Copy link

@lkessler

Whether or not you use the SOUR tag or a _DERIVEDFROM tag makes no difference. They mean the same thing.

No they don't ... one (_DERIVEDFROM) has context which gives it meaning; the other (SOUR) just says the type of object which is being referenced.

you're going to make the task that researchers and repositories will have to catalog their materials much more onerous

And why is that a bad thing?

it is going to be extremely difficult to define those terms clearly enough that everyone will use them in the exactly the same way

We seem to be using DC which has already done this (although I have some reservations about it's wholesale adoption)

You'll only introduce inconsistency and confusion.

Why is it more confusing to have defined a specific context/meaning than not to have defined it (and hence left the meaning as ambiguous)?

the ability to go to that original source is what is needed.

Indeed but the source will just tell me about itself. It cannot possibly know about the context in which it was referenced.

@ttwetmore
Copy link

The awkwardness in @lkessler 's solution is based on his rejection of the persona concept, so he is forced to try to make a strict Conclusion Model approach (e.g., GEDCOM) seem to handle evidence, sources and conclusions in a reasonable way. Because he doesn't use personas, all the evidence (the actual facts derived from the sources) that would be in the personas must either be placed in source records or in general notes or not be in the database at all.

The idea of putting content (e.g., actual evidence) inside the source records is the only way that persons who reject the persona idea can get Record Model information into their databases.

We might need another issue here entitled "Where Do We Store Our Evidence?" I asked this question on soc.genealogy.computing last year and it generated a long and interesting thread. In the GEDCOMX model that evidence is stored primarily in persona records which then refer to source records, very proper source records that do what source records are supposed to do, refer to where the evidence can be found. There are many other ways to answer the question. @lkessler 's approach is one of those alternatives, placing the evidence inside the source records and forcing all person records to be conclusions only. Other approaches are simply to leave the evidence out of the database, that is, to only place conclusions in the database, and depend upon anyone using the database to look at the source records and go get the evidence on their own. Others suggested a "dual program" approach, using a commercial genealogy desktop system to store their conclusions, and another more general purpose database to store their evidence, then finding some way to link from their genealogical database to their evidence database.

I believe the perfect solution to the "where do we store our evidence" question is in the persona record. And I am very happy to see that the industry as a whole has also chosen that concept as the necessary core concept.

I have faith that the GEDCOMX model will avoid the folly of removing the Record Model level of data. The fact that seems obvious to nearly everyone, is that the persona concept has become the lingua franca core concept of Family Search, Ancestry.com and all the other major service providers. Personas are the currency of modern genealogy. Personas are what will flow and actually already do flow from genealogical service providers to genealogical clients as the result of queries and searches. Modern genealogical client programs must be able to accept personas if they wish to provide their users with access to the modern service providers. @lkessler 's model requires a client program to accept persona records from a service provider, and them immediately disembowel them, artificially placing some of their information into source records that will be a nightmare to maintain, and placing other parts of the information into conclusion facts in conclusion persons, along with little notes that attempt to explain what was done and are also a nightmare to maintain.

My only desire beyond the current GEDCOMX model is for the model to unify the Record and Conclusion models into a more integrated whole that can handle N-tier structured person clusters made up of evidence personas and conclusion persons.

@EssyGreen
Copy link

My preference is that we keep two models but the entities in both should inherit from common objects (i.e. a Person and Persona should derive from a Common Person and have the same properties; a Fact whether in the Record Model or the Conclusion Model should have the same attributes; ditto Relationships, Roles etc) so that when a researcher publishes or uploads or shares with someone their research, it can be used the other end as a (secondary) source.

@lkessler
Copy link

Tom is fine with his opinion, but that is all it is, his opinion. I happen to have a different one that believes that the best place to store our evidence is in the source details attached to the source record. I do not want to have the data ripped apart into multiple derivatives to attach them to countless persona that we poor developers will have to attempt to reassemble and present to users in an understandable way.

So let it be said that there's the two viewpoints, and please don't let Tom's bitter attacks to my way of thinking sway you to think that multiple levels of persona are the only solution.

Essy: You said:
"Indeed but the source will just tell me about itself. It cannot possibly know about the context in which it was referenced."

That is correct. The source should only tell you about itself. All the subjective information about how it was accessed, the context in which it was referenced, and how it was used as evidence to come to your assumption or conclusion should be part of the Conclusion Model, not the Record Model.

Louis

@jralls
Copy link
Contributor

jralls commented Feb 18, 2012

I happen to have a different one that believes that the best place to store our evidence is in the source details attached to the source record.

That is correct. The source should only tell you about itself. All the subjective information about how it was accessed, the context in which it was referenced, and how it was used as evidence to come to your assumption or conclusion should be part of the Conclusion Model, not the Record Model.

Louis

Those assertions seem to be contradictory.

The information about how a source was accessed and the context in which it is found (along with its provenance) are important attributes of the source itself. A careful researcher will note that information along with how to find the source again (the reference information), and those notes should be kept together with the reference information as part of the source object.

I agree that the extraction of evidence from the source is conclusional and belongs in the conclusion model -- but I also think that FamilySearch has a different view: They are, after all, in the business of providing source records, and in order to index the sources so that we can find them they have to put at least some of the source's evidence into machine-readable form. None of them (that I've found yet, anyway) has explicitly said so, but I suspect that the Record model is designed for that purpose.

@lkessler
Copy link

John:

How a person accesses a source for his/her conclusions is important in how their conclusions came to be. It has no bearing whatsoever on the source itself.

If you have a source derivative, then it should point to the source it comes from (as I gave an example of earlier in this issue), and then yes, it should along with that link to its source, give the information about how the source derivative was derived. But there should be no subjective information in it.

When I access a source, I want to know only how the source was derived - again, just the facts. I don't care about how other people accessed the source. I only care how they accessed it if I am looking at their conclusions in the Conclusion Model so that I can evaluate if they accessed it in a manner that they were able to properly get the data so that I can assess the validity of the conclusion.

So I'm saying that "how a source was accessed" should not be with the source or source details in the Record Model. It should be with the the Conclusion Model where the source is used as evidence. This is why I somewhat like the separation of the Record and Conclusion Models. It perfectly delinates the difference, being that Records are "just the facts" and the Conclusions are the conjecture and assemblage of conclusions.

I hope that FamilySearch originally separated these two models because of this idea, and so that the Record Model could be handed to repositories to standardize and make their data globally available. I could see genealogy programs using this Record Model to go forth (with an API or whatever) and access online data from repositories to download the relevant source details as evidence that will be stored locally (in the Record Model format) for inclusion into their database.

Louis

@ttwetmore
Copy link

Louis, Your opinion is to not use the persona concept for record data, but to store record data in source records. If an item of evidence includes data on five persons and an event, then your source record for that evidence will have to hold the facts about the five persons and the event. Is that what you are suggesting as an alternative to the persona idea?

However, when searching for data on those persons on genealogical data servers, the data is going to be returned as persona records. That is the type of input that modern client programs are going to have to deal with. If the client programs do not support persona type records, client programs are going to have to immediately patch those personas into source records and conclusion persons already in the client database. Do you think that is the right thing for client programs to do? Is that what Behold is going to do? Do you think it proper for client programs to modify source records, either already in the client's database, or imported along with the persona records, as the result of importing personas? Do you think that will be an easy thing to do? Will the users have to get involved?

Please try to explain your comment that things get ripped apart into multiple derivatives when using personas. That seems meaningless, and I interpret it to be an example of FUD. The data comes in the form of personas, and should stay in the form of personas. What gets ripped apart?

we poor developers will have to attempt to reassemble and present to users in an understandable way.

Would it be unfair of me to interpret this statement to mean that a big reason that you object to personas is that you think personas and the concept of "person record clusters" would be hard for a developer to deal with? If this is a concern I think it can be allayed. Think of a persona record, and a person conclusion record, and a cluster of person records (2-tier, 3-tier, n-tier) as specializations of an abstract Person class. In displaying information about these three specializations, the methods have to be different, but there is no real difficulty in the implementation. Yes the person cluster requires extra software when the user wants to view it in its dynamic research-based context, but don't you think that this is a necessary thing for software that supports the research process to do?

@EssyGreen
Copy link

@lkessler

I share your concerns about how the developer will have to make sense of it for a user and I think this is a disadvantage of the N-tier approach - it's easy to model but much less easy to make something useful out of it. However, I think the benefits will outweigh the problems in the long run.

All the subjective information about how it was accessed, the context in which it was referenced, and how it was used as evidence to come to your assumption or conclusion should be part of the Conclusion Model, not the Record Model.

In a purist sense I would agree with you but if we allow the source to be broken into fragments in the Record Model then this in itself is subjective interpretation albeit within the scope of the one source. Indeed the source itself may be a secondary one anyway and/or it may reference other sources, so what is the "original" source and what is "subjective interpretation"? My requirement is that each source (however it is represented) has properties to indicate what it was derived from and what it is a component of (this latter being for fragmented interpretations which are a part of something bigger). These could be shoe-horned into a single Source (as was the case with old GEDCOM) but I believe it will be easier and simpler to allow sources within sources within sources ad infinitum. This also allows the user to structure their sources in the same way that the originals are structured in reality (e.g. a transcription of a census entry is a derivative of a real census entry which is a component of the district entries which are components of the whole year census etc).

An interpretation of a source by breaking it down into Facts, Persons etc (in the Record Model) is just one step further than a transcription.

The Persons and Facts that are re-constructed by a researcher are no different from these - the data model is the same but the context is different. Imagine a shop (=source) with lots of furniture (=facts, personas) .... You buy a selection of furniture from different show-rooms and put them together in your house. Does it stop being furniture just because it's now in your house instead of the shop?

@EssyGreen
Copy link

@ttwetmore

If an item of evidence includes data on five persons and an event, then your source record for that evidence will have to hold the facts about the five persons and the event.

I don't have a problem with this ... I now have an extra source (CreatedBy me) which contains references to the other sources (just like a book might do) and which contains my interpretations of the relationships, facts etc between the 5 people ... the only difference is that I call my source a "Family Tree" (aka a GEDCOM file)

when searching for data on those persons on genealogical data servers, the data is going to be returned as persona records

Hmmm methinks my local archives won't be doing this in a hurry. They have their own established formats already. OK so the likes of Ancestry might jump on the band wagon ... but even if they did, I personally would throw away their "interpretation" and do my own based on the image copy they have. The level of errors would soar if I were to trust what the web-servers interpret!!

@lkessler
Copy link

Tom:

Yes, the record data is in the source records. It could be transcibed information as text fields. It could be OCR'd info as a PDF document. It could be links to multimedia files. They could even index the names, places, dates and events in the source record so that the information is discoverable. But the raw data in its entirety should be as complete as possible and as close to its raw form as possible.

I envision that when searching for the data on genealogical data servers, you could do it in a Google-like fashion - that can search through the text fields and find the most relevant sources data for you, or you could do it in a Steve Morse One Step Search type of fashion to go through the discoverable indexed fields from the source records encompassing smart searches using Soundex or distance between cities, etc. What would be returned would be the source records that are most pertinent to your search - not persona records.

Client programs will never modify source records. They are the facts and can stay in their own Record Model structure. All the conclusions go in the Conclusion Model. Now what FamilySearch will want to build is the compendium of everyone's Conclusion Models. What will be discoverable from that will be what conclusions in the combined world family tree were based on a specific record from the Record Model. That will allow you to find other people who used the same record who you may be able to share information with. I think it's all a wonderful idea.

My example of where this works and where the data gets ripped apart when using personas is for something like a ship's record or census data, which I presented to you a few weeks ago on the BetterGEDCOM wiki in my Flintstone example: http://bettergedcom.wikispaces.com/message/view/Data/48419278?o=40

p.s. I have no Fear, Uncertainty, or Doubt about this.

Louis

@ttwetmore
Copy link

Louis,

You have said that keeping the Record Model and the Conclusion Model separate makes sense. However, the GEDCOMX Record Model places the record data about persons into persona records. Does this mean that you would envision a substantial change in the GEDCOMX Record Model?

@ttwetmore
Copy link

Louis,

Thanks for your clarifications. I have said my piece supporting the persona record as one of the key data constructs for the next generation of genealogical servers and clients. And you are making your points also about how you think record level evidence should be recorded. For now that is enough. I have no concerns over the direction that must and will be taken.

@lkessler
Copy link

Tom,

Thanks for being open-minded enough to let me say my piece.

In answer to your question about the GEDCOMX Record Model, I did see the inclusion of persona and relationship entities in the model. I have no problem with that as long as all they are doing are trying to disaggregate the facts into fields to make searching (a la Steve Morse) easier. However no assumptions or conclusions should be included in the Record Model. Reading the model details, I believe their intention is to separate out the assumptions and conclusions into the Conclusion model. So I'm fine with this.

Louis

@ttwetmore
Copy link

Louis,

A most interesting response. It implies you will add a GEDCOMX import function for Behold that accepts persona records! Will you also add a GEDCOMX export function that writes persona records? This is encouraging news since you will definitely come to appreciate the value of persona records!

@EssyGreen
Copy link

I am very interested in approaches that would support N-tiers better than I have been able to imagine it. Would you mind describing these alternative N-tier approaches you envisage?

I am not pretending to envisage it better than you - just differently. I already gave an explanation of how this could be achieved in #149:

by creating new trees/files for the different possibilities which can then be used/referenced as sources if/when a conclusion is reached in the original file. Since with GEDCOMX we will now be able to handle recursive sources I don't see why we should add the complexity into the base model.

The only requirement needed for this is that Personas can be linked to Persons - which I'm assuming is coming anyway (although I agree it hasn't emerged in the code yet) or the Personas in the Record Model become somewhat useless.

@ttwetmore
Copy link

Concerning the view that my N-tier approach is complex, its entire impact on a model is the addition of a single 1-to-many relationship to the person record. Could there possibly be a simpler approach? Is there a concern about this? Or does the concern relate to the user interface? What am I missing?

@ttwetmore
Copy link

@EssyGreen

The only requirement needed for this is that Personas can be linked to Persons - which I'm assuming is coming anyway (although I agree it hasn't emerged in the code yet) or the Personas in the Record Model become somewhat useless.

My solution has always been to link Persons to multiple Persons (I leave Personas out as an extraneous concept). This supports 1-tier, 2-tier, 3-tier, ..., N-tier structures. We seem to be in near agreement. But now I'm wondering what in the world you thought I was proposing!

@joshhansen
Copy link
Author

Well, it's both gratifying and frightening to see the issue I filed take on such a life of its own! Great discussion, interesting viewpoints, lots of enthusiasm to get things right.

I'm sure there are a million thoughts I could share, but here are the main ones:

  1. My understanding is that, as a result of this issue and subsequent discussion, @stoicflame now plans to take the record model out of GedcomX, but to continue developing it internally at FamilySearch and with FamilySearch's partners. (Correct me if that's not what you said.) I feel that this would be a serious mistake. All of the arguments I made for why the Record/Conclusion model duplication is a Bad Thing still apply, except that now half of the model is submerged from the public. That means there will be parts of the genealogical research process that the GedcomX model is unable to model effectively (document transcription). There will still be duplication of effort, the need to maintain two toolsets, the creation of data in two incompatible formats and the need to convert between them.
  2. If we really want GedcomX to be able to model genealogical research and conclusions, we most have a mechanism for modeling assertions. Though the GDM seems to be much-vilified around here, there was a reason the professional genealogists wanted to model assertions. There must be an entity representing an assertion of fact in order for the reasoning that led to a particular conclusion to be meaningfully modeled. Assertions also provide a unified mechanism for citing sources and giving attribution for any statement. Right now GedcomX lets Researcher A make an assertion and say "I'm fairly confident about this." But can Researcher B chime in and say, "Actually, I disagree with Researcher A's assertion"? Can he make an assertion and indicate that it derives from the assertion of somebody else? Unless I'm misunderstanding the current model, this sort of back-and-forth between researchers isn't possible. That's unfortunate, because I think it could unlock a social dynamic by which assertions are made, then evaluated, then revised, etc., until consensus emerges. I'm certainly not saying GedcomX needs to reproduce the Assertion stuff from the GDM. But we need some sort of assertion, and there is a general and (in my opinion) elegant way to do this if we were to commit to RDF as a data model and use "reification" to make statements about statements. It's also vital that the evaluation of assertions be separate from the assertions themselves, so that more than one person can render judgment on the merits of a particular argument.
  3. @ttwetmore and @jralls both complain of the GDM's "over-normalization". I agree that the GDM is a nasty mess, but let's not make normalization into the villain in our data modeling approach. Normalization isn't just a way of avoiding redundancy or database update anomalies. It also facilitates extensibility. For example, if GedcomX always records names as strings then there's no possibility of richer name representations being introduced without shoehorning them into the string format and getting people to understand your new encoding. But if Name is factored out as its own class (as happily it is in GedcomX), then other types of Name can be introduced as subclasses (or as additional properties of Name) in a way that keeps the usual semantics of Name intact, but also provides additional information. If something isn't modeled as a first-class entity, it becomes much harder to make statements about it specifically.

@ttwetmore
Copy link

  1. I don’t believe there are two models to separate. I believe they should simply be merged. Some person records, if they hold the evidence taken from a single item in a source, play the role of persona. Other person records, if they hold conclusions made from many items of evidence are the person records intended by the Conclusion Model. The GEDCOMX model must be able to hold personas. If removing the Record Model implies loosing the persona concept, it’s a mistake.
  2. A persona level person record contains a reference to its source. That reference need not be a pure pointer; it can have attributes for surety, and attributes for location in source, etc. Therefore the triple made up of a persona, its source reference, and its source record completely specify its assertion and its citation. There is no need for a separate concept object called an assertion. We have it covered. Stuff about RDF and reification, is IMHO, overkill for a genealogical data model. If anyone thinks that the N-tier concept is too esoteric for application developers and users, try to get them to understand and use assertions about assertions.
  3. The data model, the databases used for backing stores, the external file formats used for external archives and transport, and the software objects used by running software, are the four major ways that genealogical entities show up in computer representations. None of these require normalization. I don’t believe normalization facilitates extensibility. However, an unnormalized document-based database (e.g., MongoDB) does facilitate extensibility because it is formally schema-less, though it still has all the advantages of a database with schemas. Such a database can support indexing and querying just as effectively and possibly with better performance than a classic normalized RDBMS.

@jralls
Copy link
Contributor

jralls commented Feb 26, 2012

However, an unnormalized document-based database (e.g., MongoDB) does facilitate extensibility

+1

Genealogy is document-based. It does not lend itself to being chopped up into little pieces like accounting data.

Stuff about RDF and reification, is IMHO, overkill for a genealogical data model.

I don't know about overkill, but RDF is an implementation detail. We're still working on what to implement, and bringing in things like RDF now is just confusing.

Therefore the triple made up of a persona, its source reference, and its source record completely specify its assertion and its citation

This I don't agree with. I think hanging all assertions/conclusions on the person object loses most of the context that drives genealogical analysis. For example, it's not terribly interesting that a person was enumerated in the census (guess what I was doing this afternoon). What's interesting is who else was enumerated in the household and who are the neighbors.

I don’t believe there are two models to separate. I believe they should simply be merged.

+10. I think that's the original proposal, eh?

@jralls
Copy link
Contributor

jralls commented Feb 26, 2012

If we really want GedcomX to be able to model genealogical research and conclusions, we most have a mechanism for modeling assertions. Though the GDM seems to be much-vilified around here, there was a reason the professional genealogists wanted to model assertions. There must be an entity representing an assertion of fact in order for the reasoning that led to a particular conclusion to be meaningfully modeled. Assertions also provide a unified mechanism for citing sources and giving attribution for any statement.

+1

Right now GedcomX lets Researcher A make an assertion and say "I'm fairly confident about this." But can Researcher B chime in and say, "Actually, I disagree with Researcher A's assertion"?

That's really hard, and probably beyond the scope of GedcomX. Even if you were to use a GedcomX file here on Github as your medium (coincidentally, Dick Eastman was motivated by the recent Wired article to comment on using Github for collaborative genealogy ) you will still have to deal with the edit war problem.

Github offers a solution, of course, and we're using it now -- but that won't be captured in the GedcomX file itself, it will be in the issue discussion referenced in the Git change message.

Perhaps worthy of a separate issue.

@lkessler
Copy link

John said:

I think hanging all assertions/conclusions on the person object loses most of the context that drives genealogical analysis. For example, it's not terribly interesting that a person was enumerated in the census.

+1

@EssyGreen
Copy link

@ttwetmore

Concerning the view that my N-tier approach is complex, its entire impact on a model is the addition of a single 1-to-many relationship to the person record. Could there possibly be a simpler approach?

Infinite recursion is easy to model but difficult to make sense of. For example, as a user/researcher I would not want my name index of people I'm researching to be a list of every name in every source document - if I want that view I can simply list all Personas. To prevent this the application needs to reduce your ALIAses down to their roots to separate the key ones from the duplicates and that is virtually impossible to do given the infinite recursion. Similarly, it makes it extremely difficult to validate since your conclusions are spread amongst a multitude of Person fragments.

As a researcher I am trying to model the real world. To that end my Persons represent real people who I am researching and trying to come to conclusions on. They do not/should not represent fragmented bits of source data. The place for that is within the interpretation of each Source.

@joshhansen

@stoicflame now plans to take the record model out of GedcomX, but to continue developing it internally at FamilySearch and with FamilySearch's partners. (Correct me if that's not what you said.) I feel that this would be a serious mistake.

That is my impression too but I can understand their need and some things we just have to accept. If the target audience for the Record Model is the web-publishers then we can't really complain but, as consumers of the model, we can work out how it will impact and can be used within research applications.

@ttwetmore

I don’t believe there are two models to separate. I believe they should simply be merged.

We have the Record Model. Signed, sealed, done thing (bar some tweaks). If an application doesn't find it useful then just treat it like any other media file that might be used as a source. (Personally I think that the Record objects are useful because they allow the researcher to interpret a source document into a "mini-tree" solely within the context of that source.)

What we should be fighting for (or not) here, is the retention of the Conclusion Model as a separate entity. Does the Record Model enable researchers to publish and/or exchange their research data? No. OK, since we can't influence the Record Model then we need a Conclusion Model (which may use/reference/include the Record Model).

@EssyGreen
Copy link

Right now GedcomX lets Researcher A make an assertion and say "I'm fairly confident about this." But can Researcher B chime in and say, "Actually, I disagree with Researcher A's assertion"?

That's really hard, and probably beyond the scope of GedcomX.

Actually if you allow for derivative sources (see #136) then it can be done fairly easily but the "edit war" is a problem (see #151)

@ttwetmore
Copy link

Infinite recursion is easy to model but difficult to make sense of.

A real database would be 1-tier and 2-tier 99% of the time and would be 3-tier essentially the rest of the time. Infinite recursion is no where near infinite, and is easy to make sense of.

For example, as a user/researcher I would not want my name index of people I'm researching to be a list of every name in every source document - if I want that view I can simply list all Personas.

An index is something you search when you need to find something; why would you not put the things you need to search for in your index? Wouldn’t you want to search for every name form a person might have been documented under so you can go immediately to the evidence with the names in that form?

To prevent this the application needs to reduce your ALIAses down to their roots to separate the key ones from the duplicates and that is virtually impossible to do given the infinite recursion.

There is no need to reduce anything to their roots, and infinite recursion has nothing to do with this. What you call virtually impossible are simple matching algorithms I’ve been writing for a decade. What you may not realize is that it is the recursion that makes this so simple to do. It makes the user interface easy, it makes the algorithms easy, it makes the conception of the model easy.

Similarly, it makes it extremely difficult to validate since your conclusions are spread amongst a multitude of Person fragments.

They are not spread out amongst a multitude of fragments. They are organized into a tight tree structure that exactly matches the decisions and conclusions you made in deciding which of your source records refer to each of your persons. Your conclusions are organized for you in the best possible manner. Writing proof statements in an N-tier system is a dream come true.

As a researcher I am trying to model the real world. To that end my Persons represent real people who I am researching and trying to come to conclusions on. They do not/should not represent fragmented bits of source data. The place for that is within the interpretation of each Source.

This is the argument that every person record in a database should represent a real individual. I’ve called this the conclusion-only argument for twenty years. This is also @lkessler’s anti-persona argument. He too wants to put the evidence in the source records. Conclusion-only desk top programs essentially stopped all advancement in genealogical software twenty-five years. There are essentially no differences between desktop systems today. They all compete by the slickness of how you enter data, how they claim to support citations, whether or not you can add photos, whether or not you can tweet your relatives and whether or not you can inspect data from on-line services. Whoop-tee-doo. There has been little advancement in the support of the processes of actually doing genealogy during this time. Maybe my N-tier approach is not that answer, but continuing to stick to a conclusion-only model has a decades old history of not being the answer.

@EssyGreen
Copy link

@ttwetmore

OK so if it's so easy then why, in the "20 years" you've been obsessed with your version of N-tier, haven't you yet written that killer app using existing GEDCOM - since it nigh as dammit supports exactly the type of thing you require with its ALIAs links?

@ttwetmore
Copy link

@EssyGreen
Good question. I have written this application in a non-genealogical domain, where billions of person records are extracted via NLP from the world wide web, and then combined based on many properties, but primarily on their names, the companies they work for, the positions they hold at those companies, and their locations. Since there are billions of records in this application, I wrote matching algorithms that build up the N-tier structures automatically by finding efficient ways of dealing with the O(n-squared) comparison issues by using many different comparison and combination phases. As the billions of original records (personas in the genealogical terminology) are combined down to a few million business professional profiles (called conclusion persons in genealogical terminology), the N-tier structures that build up show the combination history. The properties of the final individuals, called the person's profile in this application, are automatically computed from the personas in the N-tier structure. In this application the N-tier structures can grow to over ten tiers, and celebrity persons (e.g., Bill Gates, who is mentioned throughout the web) may have over 100,000 "personas" in their final N-tier structure.

In the genealogical application one would not use algorithms to automatically combine the persona records into N-tier structures, but the combination algorithms would be converted to make high-liklyhood suggestions of which personas match others personas or person-trees (shaking leaf algorithms), leaving it up to the user to accept the suggestions or not.

During the five years I worked on this problem, I implemented the solution three times, each time refining ideas. The first implementation was 2-tiered, written in C++, and used a highly normalized relational database. The 2-tiers lost all history of the combination, which made tuning the combination algorithms nearly impossible. The final implementation was N-tiered, written in Java, and used a document database with full text indexing. I wrote software to visualize the N-tier structures. The main purpose of the visualization was to aid me in tuning the combination phases. In a genealogical application the visualization would be used to help users manipulate their data (i.e., proceed with the genealogical research process).

To see the results of these algorithms see the website ZoomInfo and search for a few names of people you know in industry. Every profile you see is automatically generated on the spot from an N-tier structure of person records that the combination algorithms described above have built. This application is fully automatic. No human being ever creates or modifies these profiles.

I took the job at this company because I had been interested in the genealogical application of these ideas for a long time, and working for this company seemed the best way to get access to a bulk of data sufficient to truly test out the algorithmic ideas, and to experiment and refine those ideas (and get paid). I am now semi-retired and able to spend some time working on the purely genealogical applications of these ideas, which I call DeadEnds.

You can argue whether the ZoomInfo application is sufficiently similar to any problem in the genealogical domain, that even talking about it makes any sense. I see that application as analogous to the genealogical research problem. Others may see no resemblance at all. But I would like to counter some of the concerns that an N-tier approach is conceptually or practically difficult to work with. If it can be made to work effectively in a world where there are billions of records, it can certainly be made to work in applications that use orders of magnitude fewer persona records.

@lkessler
Copy link

Tom,

Thank you for providing the background that provides the foundation of your thinking behind your N-tier persona-based system.

Let me say I'm very impressed, and I can see many applications for it, especially in artificial intelligence (which is another of my interests).

I can see it being used as an excellent way to get smart matches for people in large online databases, like Ancestry's "shaky leaf".

But in real life genealogy, I don't believe people want to follow chains of conclusions through persona to persona to get back to the source data. Doing that would properly document each step in a conclusion, but to understand the reasoning, every step must be followed and thought through individually.

I think instead, every conclusion needs linkage to all the source data (both supporting and contrary) that is used to come to that conclusion. This way, to interpret the conclusion, one need only do one evaluation of all the source data together that it references (i.e. the source data that is used as evidence to derive the conclusion).

Should a new item of source data come about, it could be simply added to the already linked source data and the conclusion can be revised if needed. If each "snapshot" of the conclusion is kept in a history file, then the history of how the current conclusion came about can be easily accessed.

The other part I don't agree with in your model is your making everything N-tier at the persona level. All conclusions don't occur at the persona level. They also occur at the individual event and fact level, at the family event and fact level, at the releationship level between parents and children and husbands and wives and events and their witnesses. Your system would work if all we were trying to do was to identify conclusion people, but genealogical research does more than that.

Thank you for telling us about Zoominfo. It definitely shows the sort of system that your N-tier persona based methodology can work, and work well in. Maybe FamilySearch might want to implement it for their smart matching for their New Family Tree.

But I don't think the place for it is a new GEDCOM standard.

Louis

@ttwetmore
Copy link

@lkessler
Thanks for your very kind words. I can certainly understand how the algorithms developed for the automatic business application might seem to have little application to non-automatic, user-driven genealogy, and if I am wrong about all this stuff, then they don't.

Note, however, that the only support required from GEDCOMX to allow the possibility of handling these N-tier person structures is a single person->person* relationship in the person record. Small cost for future potential. As @EssyGreen pointed out, the ALIA tag of GEDCOM, is sufficient, when used in a strict way, for this.

And if GEDCOMX does not support the idea, it is trivial to add in an update version, if future brains deem it worthwhile.

I feel honored that other people deeply concerned about genealogical data models have been willing to read my ideas and comment cogently upon them.

@lkessler
Copy link

Tom,

I agree. One tag, like ALIA would handle the connections.

But all programs today assume the people being transferred are conclusion people. There would also need to be some indication that the persona are not conclusion people. Otherwise they may all be included in reports or indexes, and showing 40 people with the same name but all with slightly different information would be quite confusing.

If GEDCOMX wants to support this structure, then they'd have to make sure that programs not implementing it could still input data sets containing it, process the rest of the data their way, and then export their modified data along with the non-processed persona data so that the persona linkages are still valid.

I don't know how that can be guaranteed. What if a conclusion person is deleted? They'll lose the linkages to the 1st level personas, and those will all become top level.

And if a new person is added, they'll have no persona linkages, so the data will become incomplete.

And this is a tremendous example of what the challenges GEDCOMX has. Any developer who includes some new data structure in his program will be challenged, no matter what standard is developed, to have other programs pass their data through properly.

Louis

@EssyGreen
Copy link

I think @lkessler said it all :) An impressive application but not what I personally would want to use as my genealogical research software.

the only support required from GEDCOMX to allow the possibility of handling these N-tier person structures is a single person->person* relationship in the person record. Small cost for future potential.

It might seem small but it is an unnecessary complication which will result in data loss, ambiguity and confusion. I maintain my point that the same could be done with the existing model by traversing the Person-Persona links rather than taking a short cut and omitting the (in my opinion) important Persona records.

@ttwetmore
Copy link

@EssyGreen

It might seem small but it is an unnecessary complication which will result in data loss, ambiguity and confusion. I maintain my point that the same could be done with the existing model by traversing the Person-Persona links rather than taking a short cut and omitting the (in my opinion) important Persona records.

You speak of data loss, ambiguity and confusion as if you understand how the N-tier approach causes them. Since major goals of the N-tier approach are specifically to prevent data loss, and to control ambiguity and confusion, all of which occur in a conclusion-only system, we are on different wavelengths. If you could explain how you see the N-tier approach causing these shameful things I would be interested in learning it.

I don't understand your comments about traversing person-persona links, short cuts or omitting important persona records. Can you explain the shortcuts you think I am proposing, and the important persona records I am proposing to ignore? My approach is usually criticized for keeping too many persona records, not for ignoring them!

I welcome criticisms of my proposals, since I learn so much from others' ideas, but it would be helpful if I could understand the criticisms well enough to reply. These comments seem so non-germane that I can't figure out what you are trying to say.

@ttwetmore
Copy link

@lkessler

But all programs today assume the people being transferred are conclusion people. There would also need to be some indication that the persona are not conclusion people. Otherwise they may all be included in reports or indexes, and showing 40 people with the same name but all with slightly different information would be quite confusing.

This is absolutely correct! And the criteria to decide is very simple. Any person record that is pointed to by a person record higher up in an N-tier structure is not, by definition, a conclusion person. Every person record that is not pointed to by a person higher up in an N-tier structure is, by definition, a conclusion person. These are fluid definitions that change as the user fiddles with the structures.

There is an interesting implication of this. Every newly added persona record is a conclusion person, even though we hope that it will eventually get placed into a growing structure. But this gives the user interface exactly what it needs to see -- all the structure roots and all the stand-alone person records represent the current “state of your research,” the proper set of persons to be visualizing.

Note that the user interface must also give easy access to seeing the contents of the N-tier structures, since the user must be able to reckon with the information at this level.

If GEDCOMX wants to support this structure, then they'd have to make sure that programs not implementing it could still input data sets containing it, process the rest of the data their way, and then export their modified data along with the non-processed persona data so that the persona linkages are still valid.

Certainly the GEDCOMX standard will have to explain this.

I don't know how that can be guaranteed. What if a conclusion person is deleted? They'll lose the linkages to the 1st level personas, and those will all become top level.

When a conclusion person is deleted, it was a root of an N-tier structure. All the person records one level down in that tier are suddenly transformed into conclusion persons. Isn’t this precisely what it means to remove a conclusion person? It means that you have decided that your earlier decision to bring together the data “below that person” into an individual was wrong. You want those persons below you to now re-enter into the research dance once more, to be combined in other ways that better represent your corrected conclusions.

And if a new person is added, they'll have no persona linkages, so the data will become incomplete.

Exactly! But they are not incomplete. They are simply stand alone records. If they are legitimate conclusion persons they can remain in that state forever (they are simply 1-tier persons, perfectly legitimate in an N-tier system). If they are personas in the traditional sense they will soon or eventually be placed into a structure under a conclusion person.

And this is a tremendous example of what the challenges GEDCOMX has. Any developer who includes some new data structure in his program will be challenged, no matter what standard is developed, to have other programs pass their data through properly.

Exactly. Change forces change. If the change is ultimately good then the pain caused by the change will be worth it. If not not. But this is how progress progresses.

@EssyGreen
Copy link

@ttwetmore

This could go on and on and on endlessly. Can we just agree to disagree?

We have both made our arguments and ultimately it will be up to Ryan to decide.

I suspect you will get your Person->Person links simply because it is similar to GEDCOM ALIAses and because it is easy to see the benefits for the social-networking aspects of genealogy.

If so, you will be able to finalise your dream and actually utilise DeadEnds.

Personally, I will not be using it (either as a developer or as a genealogist).

@ttwetmore
Copy link

@EssyGreen
I was hoping you would try to explain your latest criticisms since they make no sense to me, but I'm fine with just ending the discussion here.

@EssyGreen
Copy link

@ttwetmore

OK, I'd hate you to think I couldn't explain so here goes back into the fray:

Data loss - will occur when importing into a system which does not adhere to your specific implementation and yet needs/wants to ensure data integrity.

Ambiguity - will occur because it is not clear what the link implies - does it mean that Person A is proven to be Person B (in which case where is the proof/evidence and why are they not condensed into a single Person representing the real person in the real world) or does it mean Person A looks like it might be Person B but needs further research (in which case do I put the next bit of research against Person A or Person B). Also, you argued elsewhere that the order of discovery was important and hence Person A = Person B in your model not the same thing as Person B = Person A so does the link really mean "Person A (who was discovered first) is thought to be the same as Person B (who was discovered later)". If a user then attaches the reverse link then this statement no longer makes any sense. Should the application allow this or not? (Rhetorical question - I'm just trying to explain the problem - you won't be there to give the answers when the developer has to make the decision)

Confusion - most (if not all) genealogists think of people in their tree as representing real-world people whose lives they are trying to re-construct. Your model has no such thing as a real-world person because the things representing that person's life are fragmented. I think most users would desperately miss being able to see their Persons as whole people.

Complexity - this comes from the confusion above since it would be the responsibility of the developer to pull your fragments together into a model which resembles the real-world again. This would mean repeatedly iterating through all your Persons to try to establish which ones were the real/base whilst avoiding circular relationships. It's tricky but it can be done but then we still haven't got to the end of it because we need to then merge say the Names (after all a user would want to see that Freda Bloggs' maiden name was Smith). Again it can be done but the application would be focusing all it's energy on re-constructing. The re-construction should be the job of the researcher not the application.

I firmly believe that genealogists are trying to re-create the real-world. So the primary objects should be modelled on the real world (ie Persons/People). Your model is a model of the interconnectivity of references to people in sources. That is not the same thing.

And this leads nicely back to the subject of this post ... I personally would use the Record Model to show how representations of people as they were recorded in particular sources and I would use the Conclusion Model to model the 'real-world' people that genealogists are trying to re-construct. The Person and Persona are the same objects (they are both representations of people) in different contexts (with different functional needs) but both are needed.

@ttwetmore
Copy link

@EssyGreen
Thanks for taking the time. It is about time to let this drop, but since all your concerns are mislaid I'll make very quick responses to them:

Data loss -- criticism unfair -- it has nothing to do with the model, only its acceptance.

Ambiguity -- there is never ambiguity -- the sub person relationship always means believed to be the same person because of ..., where the because of ... is supplied by a conclusion statement or a proof statement.

Confusion -- the top level person in a cluster always represents the conclusion person. In 99.9% of the cases data will be 1 and 2 tier so exactly as today. The users of NFS have no trouble with 2 tier, because the UI makes it seamless.

Complexity -- unfair -- your criticism revolves around the assumption that developers are incompetent and some odd misconceptions that the model requires repeated activities and reconstructions and that circular relationships are difficult to prevent. Merging names? Never happens.

The N-tier model merges the record and conclusion models with the best features of both. I am sorry I have made that difficult to see.

@EssyGreen
Copy link

@ttwetmore - I have replied in #149 since I think this thread is getting swamped with N-tier when it is actually attempting to address a completely different issue. You already have 3 threads on N-tier so let's try to keep our debate in those rather than letting it bleed so profusely elsewhere.

@stoicflame
Copy link
Member

@joshhansen

My understanding is that, as a result of this issue and subsequent discussion, @stoicflame now plans to take the record model out of GedcomX, but to continue developing it internally at FamilySearch and with FamilySearch's partners. (Correct me if that's not what you said.)

Actually, the plan was to put it in a separate--but public--project where its initial scope would be limited to bulk exchange of field-based record image extraction. I don't deny that--at FamilySearch--it might become the primary means of publishing derivative source information, but we don't have the resources to promote it as a broad industry standard right now.

So we'd like to focus first on getting the "core" project right and promoting it as a standard. The goal for this "core" project is to define a model and serialization format for exchanging the components of the proof standard as specified by the genealogical research process (see #141) in a standard way.

A lot of this is based on resource constraints. We've got hard requirements to meet some specific deadlines for the sharing of this field-based record data. And we have a limited amount of resources for getting it done. Because of these limitations, we don't have as much room to accommodate a broad community influence on it. So we'd rather not pretend it's a community standard if we don't have the means to treat it as such. Unfortunate, yes, but those are the realities.

It's different for this "core" project. We're committed to seeing it through as a real community-supported, broadly-adopted standard.

@EssyGreen
Copy link

@stoicflame - Many thanks for that clarification. I think that's actually great news :)

@stoicflame
Copy link
Member

I'd just like to say thanks to everybody who can contributed to this thread to help us understand and articulate the goals, scope, and context of the different models (conclusion, record) we were proposing.

I hope things are much more clear now:

http://familysearch.github.com/gedcomx/2012/03/23/gedcomx-identity.html

With the projects now separated, we're going to close this issue and move on to the (many) other high-priority issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants