-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge Record and Conclusion Models #138
Comments
+1 No, make that +10. |
++++++1 |
Great job, @joshhansen. Thanks for writing this up. This will provide a great point of discussion about this issue. We need a wider audience and the alternate point of view clearly articulated before we can make a clear decision. I'll work on that. |
Marking this issue as priority 1. We need to resolve this sooner rather than later. |
Personally, I don't think it is bad if the Record Model and Conclusion Model are separate. The Record Model should be made up of "just the facts" and information that repositories put together about the source material that they have. They can then use this Record Model for posting and transmitting their information. The Conclusion Model should be interpretation goes. By keeping these separate, and using the facts/interpretation distinction, the parts that go where can be easily defined. |
I have found that I generally disagree with Louis, and in this case there is no exception. Except for the fact that Louis is the most anti-persona person I have yet come across in discussions, and if he is now embracing the separation of the record and conclusion models he is tacitly accepting the need for the persona concept. For this I do applaud. It is clear that in a multi-tiered system of the type I have advocated for nearly twenty years, the lower tiers hold data from the records (the evidence) and the higher tiers hold the conclusions. Thus a single model encompasses both the record level and the conclusion level in a seamless manner. As I have explained in excruciating detail a number of times, an N-tiered model is both considerably SIMPLER and considerably MORE POWERFUL than a dual record and conclusion model. And the real fact is that 95% or more of the time the N-tier system would be used with 2 tiers, resulting in the basic record/evidence and conclusion level model anyway. It is not too surprising to me that Family Search would begin with a black and white, record and conclusion vision for a model, because it is so much the nature of their environment. They deal in billions of records, and they deal in large constructed pedigrees. At these levels genealogical data does seem starkly divisible into two extremes. The rest of us deal with a more complex world, in which much of our data about people come from sources that cannot be cataloged as either low level records or as high level conclusions. And when we have gone far enough back, when we can no longer easily follow clear ancestral lines back in time, we must grow up and deal with a world consisting of a perplexing array of records and evidence, that must be fit together, and refit, and refit again, to find the best matches between that evidence and an assumed, but never to be fully known world of possible real human beings whose traces we perceive in the records. The N-tier model is the only way to hold our ideas, our pending decisions, our conclusions, in a way that traces our thoughts, and in a way that we can extricate ourselves from later as we either discover errors, or change our minds, or find new evidence that turns old conclusions on their ears. |
Tom and I actually agree on most things. But our disagreement regarding Personas and an N-tiered system are well-documented on the BetterGEDCOM wiki, so there's no need to rehash that over again in the GEDCOM-X community. I went to RootsTech and expounded on my ideas for source-based data entry and evidence/conclusion modeling. No program today does that properly, and there was a lot of interest in the ideas I had and in my plans to implement this in my program Behold. This will be done by taking the raw source details, which will hopefully be able to be taken from repositories that store their data in a "just-the-facts" Record Model, and give the user tools to allow them to find related events, people, places and dates that are relevant to their research. The user will use these source details that are found as evidence and will use them to add their assumptions/conclusions to their own research data Conclusion Model. That's my thinking and why I believe it is fine to leave the two models separate, if desired. I can handle them separate or together, but I think the repositories need that factual Record Model for them to store their data. |
I think what you are saying @lkessler is that we need to be able to distinguish between the "original" sources (raw source details) from "interpretations" and I would agree with you on that ... the problem is that the "just-the-facts" records are often (if not usually) just someone else's interpretation and hence are not really "originals" at all but derivations. As these records are built on and interpreted we get layers and layers of interpretations and as genealogists we need to peel back the layers as far as we can. If the layers don't exist then we can't. Like you (I think), when I am downloading/accessing records from Repositories I would want the derivative which is closest to the original (e.g. I want the digital image from Ancestry not their transcription or their interpretation into "Facts" and/or "Roles" etc) but there may be times when a derivative is useful (e.g. a translation) providing that the information about the source it was derived from is also provided and since that source may itself be a derivative it is simplest to model this as a multi-layered structure. The important thing for me is that the provenance trail is kept intact. From a commercial point of view I suspect that most on-line record providers will stick to a single layer where the digital image(s), transcription and fragmentation into searchable fields is presented as a single "Record" with a pre-formatted Bibliographic citation as the only link to the original. Similary, research software suppliers will use a model tailored to their own USP and will tend to do as little as possible to comply with whatever de facto standard is out there (necessarily focusing on import/export). Where GEDCOMX can/must add value is in providing a "best practice" genealogical standard which will encourage and enable quality research. Best practice depends on reaching plausible conclusions by making and investigating hypotheses based on interpretation(s) of information from a wide range of sources. Since we can never be 100% sure of the past, any "conclusion" is in itself a source for further research. Ergo we must have an N-tier approach to support the recursion. |
but I think the repositories need that factual Record Model for them to store their data. Combining the Record and Conclusion Models into an N-tiered, seamless model, provides the same lower, persona-based layer that the Record Model provides. It requires no changes to repository data. Why choose a more complex model when a simpler model with more power is available? Parsimony is the best policy. |
Essy: I feel that sources that are derivations of other sources should not be treated as layers. They can be much more simply handled by having a source link to it's source. In GEDCOMish this would be: 0 @s1@ SOUR 0 @S2@ SOUR This can be chained as deep as necessary. This will do what you want, and do it simply. If you want to call that an N-tier approach, then okay. But if you are referring to the hypothesis/conclusions being N-tier, then I have a different but also simple model for that. In GEDCOMish this might be: 1 BIRT Let's say some new information in another source comes along. You find it supports the 2nd source and now you change your conclusion: 1 BIRT The N-tier if you want to call it that is simply the Change (or Undo) history of that Event. It documents the complete history of your assumption/conclusion over time, and what additional sources you added to come to the conclusion at each step. This is what I feel needs to be implemented, is as simple as you can get it, and handles every case. Everything here is currently possible in GEDCOM 5.5.1 except the Source of the source. Louis |
There needs to be an indicator in the source that it is a derivation so that the user and application understand that e.g.: 0 @s1@ SOUR That way it is also easy to find the "master" by recursively going back through the _DERIVEDFROM pointers until there isn't one. A generic pointer of SOUR gives no indication of the context - it could mean its derived from, a component of, supplied by, referenced in etc etc the other source
Your simplification is similar to what GEDCOM does now. And the same problem as above occurs in that there is no context for the source reference - is it positive or negative evidence? do all the sources refer to all the fact fields or just some of them? how does the application know what the NOTE is? Is it a conclusion/proof which references the sources? or just a working hypothesis? or just a descriptive narrative of the fact? Without context neither the application nor the reader can be sure what was intended. |
Essy: I don't understand what you mean. Whether or not you use the SOUR tag or a _DERIVEDFROM tag makes no difference. They mean the same thing. They simply mean that the source of this source was that source. If you are going to start pigeonholing where something came from into such fine divisions as "derived from", "component of", "supplied by", "referenced in" and who knows how many more, then you're going to make the task that researchers and repositories will have to catalog their materials much more onerous, mainly because it is going to be extremely difficult to define those terms clearly enough that everyone will use them in the exactly the same way. You'll only introduce inconsistency and confusion. It is enough simply to have any derived piece of data point to where it was derived from, because the ability to go to that original source is what is needed. My simplification was simply to indicate how the evidence for the hypothesis/conclusions can be referenced. The NOTE tag could be a _CNCL (conclusion) tag or a _ASMP (assumption) tag if you wish. For simplicity and to make a point, I left out the detail that goes under a source (that yes is currently in GEDCOM as well), e.g. to reference Source Detail (specific information in a source) and to indicate positive/negative evidence plus anything else you want, which is basically the misnamed SOURCE_CITATION entity in GEDCOM e.g.: 3 SOUR @s11@ Now I'm not saying that the above has everything needed, but it is an excellent starting point. Louis |
No they don't ... one (_DERIVEDFROM) has context which gives it meaning; the other (SOUR) just says the type of object which is being referenced.
And why is that a bad thing?
We seem to be using DC which has already done this (although I have some reservations about it's wholesale adoption)
Why is it more confusing to have defined a specific context/meaning than not to have defined it (and hence left the meaning as ambiguous)?
Indeed but the source will just tell me about itself. It cannot possibly know about the context in which it was referenced. |
The awkwardness in @lkessler 's solution is based on his rejection of the persona concept, so he is forced to try to make a strict Conclusion Model approach (e.g., GEDCOM) seem to handle evidence, sources and conclusions in a reasonable way. Because he doesn't use personas, all the evidence (the actual facts derived from the sources) that would be in the personas must either be placed in source records or in general notes or not be in the database at all. The idea of putting content (e.g., actual evidence) inside the source records is the only way that persons who reject the persona idea can get Record Model information into their databases. We might need another issue here entitled "Where Do We Store Our Evidence?" I asked this question on soc.genealogy.computing last year and it generated a long and interesting thread. In the GEDCOMX model that evidence is stored primarily in persona records which then refer to source records, very proper source records that do what source records are supposed to do, refer to where the evidence can be found. There are many other ways to answer the question. @lkessler 's approach is one of those alternatives, placing the evidence inside the source records and forcing all person records to be conclusions only. Other approaches are simply to leave the evidence out of the database, that is, to only place conclusions in the database, and depend upon anyone using the database to look at the source records and go get the evidence on their own. Others suggested a "dual program" approach, using a commercial genealogy desktop system to store their conclusions, and another more general purpose database to store their evidence, then finding some way to link from their genealogical database to their evidence database. I believe the perfect solution to the "where do we store our evidence" question is in the persona record. And I am very happy to see that the industry as a whole has also chosen that concept as the necessary core concept. I have faith that the GEDCOMX model will avoid the folly of removing the Record Model level of data. The fact that seems obvious to nearly everyone, is that the persona concept has become the lingua franca core concept of Family Search, Ancestry.com and all the other major service providers. Personas are the currency of modern genealogy. Personas are what will flow and actually already do flow from genealogical service providers to genealogical clients as the result of queries and searches. Modern genealogical client programs must be able to accept personas if they wish to provide their users with access to the modern service providers. @lkessler 's model requires a client program to accept persona records from a service provider, and them immediately disembowel them, artificially placing some of their information into source records that will be a nightmare to maintain, and placing other parts of the information into conclusion facts in conclusion persons, along with little notes that attempt to explain what was done and are also a nightmare to maintain. My only desire beyond the current GEDCOMX model is for the model to unify the Record and Conclusion models into a more integrated whole that can handle N-tier structured person clusters made up of evidence personas and conclusion persons. |
My preference is that we keep two models but the entities in both should inherit from common objects (i.e. a Person and Persona should derive from a Common Person and have the same properties; a Fact whether in the Record Model or the Conclusion Model should have the same attributes; ditto Relationships, Roles etc) so that when a researcher publishes or uploads or shares with someone their research, it can be used the other end as a (secondary) source. |
Tom is fine with his opinion, but that is all it is, his opinion. I happen to have a different one that believes that the best place to store our evidence is in the source details attached to the source record. I do not want to have the data ripped apart into multiple derivatives to attach them to countless persona that we poor developers will have to attempt to reassemble and present to users in an understandable way. So let it be said that there's the two viewpoints, and please don't let Tom's bitter attacks to my way of thinking sway you to think that multiple levels of persona are the only solution. Essy: You said: That is correct. The source should only tell you about itself. All the subjective information about how it was accessed, the context in which it was referenced, and how it was used as evidence to come to your assumption or conclusion should be part of the Conclusion Model, not the Record Model. Louis |
Louis Those assertions seem to be contradictory. The information about how a source was accessed and the context in which it is found (along with its provenance) are important attributes of the source itself. A careful researcher will note that information along with how to find the source again (the reference information), and those notes should be kept together with the reference information as part of the source object. I agree that the extraction of evidence from the source is conclusional and belongs in the conclusion model -- but I also think that FamilySearch has a different view: They are, after all, in the business of providing source records, and in order to index the sources so that we can find them they have to put at least some of the source's evidence into machine-readable form. None of them (that I've found yet, anyway) has explicitly said so, but I suspect that the Record model is designed for that purpose. |
John: How a person accesses a source for his/her conclusions is important in how their conclusions came to be. It has no bearing whatsoever on the source itself. If you have a source derivative, then it should point to the source it comes from (as I gave an example of earlier in this issue), and then yes, it should along with that link to its source, give the information about how the source derivative was derived. But there should be no subjective information in it. When I access a source, I want to know only how the source was derived - again, just the facts. I don't care about how other people accessed the source. I only care how they accessed it if I am looking at their conclusions in the Conclusion Model so that I can evaluate if they accessed it in a manner that they were able to properly get the data so that I can assess the validity of the conclusion. So I'm saying that "how a source was accessed" should not be with the source or source details in the Record Model. It should be with the the Conclusion Model where the source is used as evidence. This is why I somewhat like the separation of the Record and Conclusion Models. It perfectly delinates the difference, being that Records are "just the facts" and the Conclusions are the conjecture and assemblage of conclusions. I hope that FamilySearch originally separated these two models because of this idea, and so that the Record Model could be handed to repositories to standardize and make their data globally available. I could see genealogy programs using this Record Model to go forth (with an API or whatever) and access online data from repositories to download the relevant source details as evidence that will be stored locally (in the Record Model format) for inclusion into their database. Louis |
Louis, Your opinion is to not use the persona concept for record data, but to store record data in source records. If an item of evidence includes data on five persons and an event, then your source record for that evidence will have to hold the facts about the five persons and the event. Is that what you are suggesting as an alternative to the persona idea? However, when searching for data on those persons on genealogical data servers, the data is going to be returned as persona records. That is the type of input that modern client programs are going to have to deal with. If the client programs do not support persona type records, client programs are going to have to immediately patch those personas into source records and conclusion persons already in the client database. Do you think that is the right thing for client programs to do? Is that what Behold is going to do? Do you think it proper for client programs to modify source records, either already in the client's database, or imported along with the persona records, as the result of importing personas? Do you think that will be an easy thing to do? Will the users have to get involved? Please try to explain your comment that things get ripped apart into multiple derivatives when using personas. That seems meaningless, and I interpret it to be an example of FUD. The data comes in the form of personas, and should stay in the form of personas. What gets ripped apart? we poor developers will have to attempt to reassemble and present to users in an understandable way. Would it be unfair of me to interpret this statement to mean that a big reason that you object to personas is that you think personas and the concept of "person record clusters" would be hard for a developer to deal with? If this is a concern I think it can be allayed. Think of a persona record, and a person conclusion record, and a cluster of person records (2-tier, 3-tier, n-tier) as specializations of an abstract Person class. In displaying information about these three specializations, the methods have to be different, but there is no real difficulty in the implementation. Yes the person cluster requires extra software when the user wants to view it in its dynamic research-based context, but don't you think that this is a necessary thing for software that supports the research process to do? |
I share your concerns about how the developer will have to make sense of it for a user and I think this is a disadvantage of the N-tier approach - it's easy to model but much less easy to make something useful out of it. However, I think the benefits will outweigh the problems in the long run.
In a purist sense I would agree with you but if we allow the source to be broken into fragments in the Record Model then this in itself is subjective interpretation albeit within the scope of the one source. Indeed the source itself may be a secondary one anyway and/or it may reference other sources, so what is the "original" source and what is "subjective interpretation"? My requirement is that each source (however it is represented) has properties to indicate what it was derived from and what it is a component of (this latter being for fragmented interpretations which are a part of something bigger). These could be shoe-horned into a single Source (as was the case with old GEDCOM) but I believe it will be easier and simpler to allow sources within sources within sources ad infinitum. This also allows the user to structure their sources in the same way that the originals are structured in reality (e.g. a transcription of a census entry is a derivative of a real census entry which is a component of the district entries which are components of the whole year census etc). An interpretation of a source by breaking it down into Facts, Persons etc (in the Record Model) is just one step further than a transcription. The Persons and Facts that are re-constructed by a researcher are no different from these - the data model is the same but the context is different. Imagine a shop (=source) with lots of furniture (=facts, personas) .... You buy a selection of furniture from different show-rooms and put them together in your house. Does it stop being furniture just because it's now in your house instead of the shop? |
I don't have a problem with this ... I now have an extra source (CreatedBy me) which contains references to the other sources (just like a book might do) and which contains my interpretations of the relationships, facts etc between the 5 people ... the only difference is that I call my source a "Family Tree" (aka a GEDCOM file)
Hmmm methinks my local archives won't be doing this in a hurry. They have their own established formats already. OK so the likes of Ancestry might jump on the band wagon ... but even if they did, I personally would throw away their "interpretation" and do my own based on the image copy they have. The level of errors would soar if I were to trust what the web-servers interpret!! |
Tom: Yes, the record data is in the source records. It could be transcibed information as text fields. It could be OCR'd info as a PDF document. It could be links to multimedia files. They could even index the names, places, dates and events in the source record so that the information is discoverable. But the raw data in its entirety should be as complete as possible and as close to its raw form as possible. I envision that when searching for the data on genealogical data servers, you could do it in a Google-like fashion - that can search through the text fields and find the most relevant sources data for you, or you could do it in a Steve Morse One Step Search type of fashion to go through the discoverable indexed fields from the source records encompassing smart searches using Soundex or distance between cities, etc. What would be returned would be the source records that are most pertinent to your search - not persona records. Client programs will never modify source records. They are the facts and can stay in their own Record Model structure. All the conclusions go in the Conclusion Model. Now what FamilySearch will want to build is the compendium of everyone's Conclusion Models. What will be discoverable from that will be what conclusions in the combined world family tree were based on a specific record from the Record Model. That will allow you to find other people who used the same record who you may be able to share information with. I think it's all a wonderful idea. My example of where this works and where the data gets ripped apart when using personas is for something like a ship's record or census data, which I presented to you a few weeks ago on the BetterGEDCOM wiki in my Flintstone example: http://bettergedcom.wikispaces.com/message/view/Data/48419278?o=40 p.s. I have no Fear, Uncertainty, or Doubt about this. Louis |
Louis, You have said that keeping the Record Model and the Conclusion Model separate makes sense. However, the GEDCOMX Record Model places the record data about persons into persona records. Does this mean that you would envision a substantial change in the GEDCOMX Record Model? |
Louis, Thanks for your clarifications. I have said my piece supporting the persona record as one of the key data constructs for the next generation of genealogical servers and clients. And you are making your points also about how you think record level evidence should be recorded. For now that is enough. I have no concerns over the direction that must and will be taken. |
Tom, Thanks for being open-minded enough to let me say my piece. In answer to your question about the GEDCOMX Record Model, I did see the inclusion of persona and relationship entities in the model. I have no problem with that as long as all they are doing are trying to disaggregate the facts into fields to make searching (a la Steve Morse) easier. However no assumptions or conclusions should be included in the Record Model. Reading the model details, I believe their intention is to separate out the assumptions and conclusions into the Conclusion model. So I'm fine with this. Louis |
Louis, A most interesting response. It implies you will add a GEDCOMX import function for Behold that accepts persona records! Will you also add a GEDCOMX export function that writes persona records? This is encouraging news since you will definitely come to appreciate the value of persona records! |
I am not pretending to envisage it better than you - just differently. I already gave an explanation of how this could be achieved in #149:
The only requirement needed for this is that Personas can be linked to Persons - which I'm assuming is coming anyway (although I agree it hasn't emerged in the code yet) or the Personas in the Record Model become somewhat useless. |
Concerning the view that my N-tier approach is complex, its entire impact on a model is the addition of a single 1-to-many relationship to the person record. Could there possibly be a simpler approach? Is there a concern about this? Or does the concern relate to the user interface? What am I missing? |
The only requirement needed for this is that Personas can be linked to Persons - which I'm assuming is coming anyway (although I agree it hasn't emerged in the code yet) or the Personas in the Record Model become somewhat useless. My solution has always been to link Persons to multiple Persons (I leave Personas out as an extraneous concept). This supports 1-tier, 2-tier, 3-tier, ..., N-tier structures. We seem to be in near agreement. But now I'm wondering what in the world you thought I was proposing! |
Well, it's both gratifying and frightening to see the issue I filed take on such a life of its own! Great discussion, interesting viewpoints, lots of enthusiasm to get things right. I'm sure there are a million thoughts I could share, but here are the main ones:
|
|
+1 Genealogy is document-based. It does not lend itself to being chopped up into little pieces like accounting data.
I don't know about overkill, but RDF is an implementation detail. We're still working on what to implement, and bringing in things like RDF now is just confusing.
This I don't agree with. I think hanging all assertions/conclusions on the person object loses most of the context that drives genealogical analysis. For example, it's not terribly interesting that a person was enumerated in the census (guess what I was doing this afternoon). What's interesting is who else was enumerated in the household and who are the neighbors.
+10. I think that's the original proposal, eh? |
+1
That's really hard, and probably beyond the scope of GedcomX. Even if you were to use a GedcomX file here on Github as your medium (coincidentally, Dick Eastman was motivated by the recent Wired article to comment on using Github for collaborative genealogy ) you will still have to deal with the edit war problem. Github offers a solution, of course, and we're using it now -- but that won't be captured in the GedcomX file itself, it will be in the issue discussion referenced in the Git change message. Perhaps worthy of a separate issue. |
John said:
+1 |
Infinite recursion is easy to model but difficult to make sense of. For example, as a user/researcher I would not want my name index of people I'm researching to be a list of every name in every source document - if I want that view I can simply list all Personas. To prevent this the application needs to reduce your ALIAses down to their roots to separate the key ones from the duplicates and that is virtually impossible to do given the infinite recursion. Similarly, it makes it extremely difficult to validate since your conclusions are spread amongst a multitude of Person fragments. As a researcher I am trying to model the real world. To that end my Persons represent real people who I am researching and trying to come to conclusions on. They do not/should not represent fragmented bits of source data. The place for that is within the interpretation of each Source.
That is my impression too but I can understand their need and some things we just have to accept. If the target audience for the Record Model is the web-publishers then we can't really complain but, as consumers of the model, we can work out how it will impact and can be used within research applications.
We have the Record Model. Signed, sealed, done thing (bar some tweaks). If an application doesn't find it useful then just treat it like any other media file that might be used as a source. (Personally I think that the Record objects are useful because they allow the researcher to interpret a source document into a "mini-tree" solely within the context of that source.) What we should be fighting for (or not) here, is the retention of the Conclusion Model as a separate entity. Does the Record Model enable researchers to publish and/or exchange their research data? No. OK, since we can't influence the Record Model then we need a Conclusion Model (which may use/reference/include the Record Model). |
Actually if you allow for derivative sources (see #136) then it can be done fairly easily but the "edit war" is a problem (see #151) |
Infinite recursion is easy to model but difficult to make sense of. A real database would be 1-tier and 2-tier 99% of the time and would be 3-tier essentially the rest of the time. Infinite recursion is no where near infinite, and is easy to make sense of. For example, as a user/researcher I would not want my name index of people I'm researching to be a list of every name in every source document - if I want that view I can simply list all Personas. An index is something you search when you need to find something; why would you not put the things you need to search for in your index? Wouldn’t you want to search for every name form a person might have been documented under so you can go immediately to the evidence with the names in that form? To prevent this the application needs to reduce your ALIAses down to their roots to separate the key ones from the duplicates and that is virtually impossible to do given the infinite recursion. There is no need to reduce anything to their roots, and infinite recursion has nothing to do with this. What you call virtually impossible are simple matching algorithms I’ve been writing for a decade. What you may not realize is that it is the recursion that makes this so simple to do. It makes the user interface easy, it makes the algorithms easy, it makes the conception of the model easy. Similarly, it makes it extremely difficult to validate since your conclusions are spread amongst a multitude of Person fragments. They are not spread out amongst a multitude of fragments. They are organized into a tight tree structure that exactly matches the decisions and conclusions you made in deciding which of your source records refer to each of your persons. Your conclusions are organized for you in the best possible manner. Writing proof statements in an N-tier system is a dream come true. As a researcher I am trying to model the real world. To that end my Persons represent real people who I am researching and trying to come to conclusions on. They do not/should not represent fragmented bits of source data. The place for that is within the interpretation of each Source. This is the argument that every person record in a database should represent a real individual. I’ve called this the conclusion-only argument for twenty years. This is also @lkessler’s anti-persona argument. He too wants to put the evidence in the source records. Conclusion-only desk top programs essentially stopped all advancement in genealogical software twenty-five years. There are essentially no differences between desktop systems today. They all compete by the slickness of how you enter data, how they claim to support citations, whether or not you can add photos, whether or not you can tweet your relatives and whether or not you can inspect data from on-line services. Whoop-tee-doo. There has been little advancement in the support of the processes of actually doing genealogy during this time. Maybe my N-tier approach is not that answer, but continuing to stick to a conclusion-only model has a decades old history of not being the answer. |
OK so if it's so easy then why, in the "20 years" you've been obsessed with your version of N-tier, haven't you yet written that killer app using existing GEDCOM - since it nigh as dammit supports exactly the type of thing you require with its ALIAs links? |
@EssyGreen In the genealogical application one would not use algorithms to automatically combine the persona records into N-tier structures, but the combination algorithms would be converted to make high-liklyhood suggestions of which personas match others personas or person-trees (shaking leaf algorithms), leaving it up to the user to accept the suggestions or not. During the five years I worked on this problem, I implemented the solution three times, each time refining ideas. The first implementation was 2-tiered, written in C++, and used a highly normalized relational database. The 2-tiers lost all history of the combination, which made tuning the combination algorithms nearly impossible. The final implementation was N-tiered, written in Java, and used a document database with full text indexing. I wrote software to visualize the N-tier structures. The main purpose of the visualization was to aid me in tuning the combination phases. In a genealogical application the visualization would be used to help users manipulate their data (i.e., proceed with the genealogical research process). To see the results of these algorithms see the website ZoomInfo and search for a few names of people you know in industry. Every profile you see is automatically generated on the spot from an N-tier structure of person records that the combination algorithms described above have built. This application is fully automatic. No human being ever creates or modifies these profiles. I took the job at this company because I had been interested in the genealogical application of these ideas for a long time, and working for this company seemed the best way to get access to a bulk of data sufficient to truly test out the algorithmic ideas, and to experiment and refine those ideas (and get paid). I am now semi-retired and able to spend some time working on the purely genealogical applications of these ideas, which I call DeadEnds. You can argue whether the ZoomInfo application is sufficiently similar to any problem in the genealogical domain, that even talking about it makes any sense. I see that application as analogous to the genealogical research problem. Others may see no resemblance at all. But I would like to counter some of the concerns that an N-tier approach is conceptually or practically difficult to work with. If it can be made to work effectively in a world where there are billions of records, it can certainly be made to work in applications that use orders of magnitude fewer persona records. |
Tom, Thank you for providing the background that provides the foundation of your thinking behind your N-tier persona-based system. Let me say I'm very impressed, and I can see many applications for it, especially in artificial intelligence (which is another of my interests). I can see it being used as an excellent way to get smart matches for people in large online databases, like Ancestry's "shaky leaf". But in real life genealogy, I don't believe people want to follow chains of conclusions through persona to persona to get back to the source data. Doing that would properly document each step in a conclusion, but to understand the reasoning, every step must be followed and thought through individually. I think instead, every conclusion needs linkage to all the source data (both supporting and contrary) that is used to come to that conclusion. This way, to interpret the conclusion, one need only do one evaluation of all the source data together that it references (i.e. the source data that is used as evidence to derive the conclusion). Should a new item of source data come about, it could be simply added to the already linked source data and the conclusion can be revised if needed. If each "snapshot" of the conclusion is kept in a history file, then the history of how the current conclusion came about can be easily accessed. The other part I don't agree with in your model is your making everything N-tier at the persona level. All conclusions don't occur at the persona level. They also occur at the individual event and fact level, at the family event and fact level, at the releationship level between parents and children and husbands and wives and events and their witnesses. Your system would work if all we were trying to do was to identify conclusion people, but genealogical research does more than that. Thank you for telling us about Zoominfo. It definitely shows the sort of system that your N-tier persona based methodology can work, and work well in. Maybe FamilySearch might want to implement it for their smart matching for their New Family Tree. But I don't think the place for it is a new GEDCOM standard. Louis |
@lkessler Note, however, that the only support required from GEDCOMX to allow the possibility of handling these N-tier person structures is a single person->person* relationship in the person record. Small cost for future potential. As @EssyGreen pointed out, the ALIA tag of GEDCOM, is sufficient, when used in a strict way, for this. And if GEDCOMX does not support the idea, it is trivial to add in an update version, if future brains deem it worthwhile. I feel honored that other people deeply concerned about genealogical data models have been willing to read my ideas and comment cogently upon them. |
Tom, I agree. One tag, like ALIA would handle the connections. But all programs today assume the people being transferred are conclusion people. There would also need to be some indication that the persona are not conclusion people. Otherwise they may all be included in reports or indexes, and showing 40 people with the same name but all with slightly different information would be quite confusing. If GEDCOMX wants to support this structure, then they'd have to make sure that programs not implementing it could still input data sets containing it, process the rest of the data their way, and then export their modified data along with the non-processed persona data so that the persona linkages are still valid. I don't know how that can be guaranteed. What if a conclusion person is deleted? They'll lose the linkages to the 1st level personas, and those will all become top level. And if a new person is added, they'll have no persona linkages, so the data will become incomplete. And this is a tremendous example of what the challenges GEDCOMX has. Any developer who includes some new data structure in his program will be challenged, no matter what standard is developed, to have other programs pass their data through properly. Louis |
I think @lkessler said it all :) An impressive application but not what I personally would want to use as my genealogical research software.
It might seem small but it is an unnecessary complication which will result in data loss, ambiguity and confusion. I maintain my point that the same could be done with the existing model by traversing the Person-Persona links rather than taking a short cut and omitting the (in my opinion) important Persona records. |
It might seem small but it is an unnecessary complication which will result in data loss, ambiguity and confusion. I maintain my point that the same could be done with the existing model by traversing the Person-Persona links rather than taking a short cut and omitting the (in my opinion) important Persona records. You speak of data loss, ambiguity and confusion as if you understand how the N-tier approach causes them. Since major goals of the N-tier approach are specifically to prevent data loss, and to control ambiguity and confusion, all of which occur in a conclusion-only system, we are on different wavelengths. If you could explain how you see the N-tier approach causing these shameful things I would be interested in learning it. I don't understand your comments about traversing person-persona links, short cuts or omitting important persona records. Can you explain the shortcuts you think I am proposing, and the important persona records I am proposing to ignore? My approach is usually criticized for keeping too many persona records, not for ignoring them! I welcome criticisms of my proposals, since I learn so much from others' ideas, but it would be helpful if I could understand the criticisms well enough to reply. These comments seem so non-germane that I can't figure out what you are trying to say. |
But all programs today assume the people being transferred are conclusion people. There would also need to be some indication that the persona are not conclusion people. Otherwise they may all be included in reports or indexes, and showing 40 people with the same name but all with slightly different information would be quite confusing. This is absolutely correct! And the criteria to decide is very simple. Any person record that is pointed to by a person record higher up in an N-tier structure is not, by definition, a conclusion person. Every person record that is not pointed to by a person higher up in an N-tier structure is, by definition, a conclusion person. These are fluid definitions that change as the user fiddles with the structures. There is an interesting implication of this. Every newly added persona record is a conclusion person, even though we hope that it will eventually get placed into a growing structure. But this gives the user interface exactly what it needs to see -- all the structure roots and all the stand-alone person records represent the current “state of your research,” the proper set of persons to be visualizing. Note that the user interface must also give easy access to seeing the contents of the N-tier structures, since the user must be able to reckon with the information at this level. If GEDCOMX wants to support this structure, then they'd have to make sure that programs not implementing it could still input data sets containing it, process the rest of the data their way, and then export their modified data along with the non-processed persona data so that the persona linkages are still valid. Certainly the GEDCOMX standard will have to explain this. I don't know how that can be guaranteed. What if a conclusion person is deleted? They'll lose the linkages to the 1st level personas, and those will all become top level. When a conclusion person is deleted, it was a root of an N-tier structure. All the person records one level down in that tier are suddenly transformed into conclusion persons. Isn’t this precisely what it means to remove a conclusion person? It means that you have decided that your earlier decision to bring together the data “below that person” into an individual was wrong. You want those persons below you to now re-enter into the research dance once more, to be combined in other ways that better represent your corrected conclusions. And if a new person is added, they'll have no persona linkages, so the data will become incomplete. Exactly! But they are not incomplete. They are simply stand alone records. If they are legitimate conclusion persons they can remain in that state forever (they are simply 1-tier persons, perfectly legitimate in an N-tier system). If they are personas in the traditional sense they will soon or eventually be placed into a structure under a conclusion person. And this is a tremendous example of what the challenges GEDCOMX has. Any developer who includes some new data structure in his program will be challenged, no matter what standard is developed, to have other programs pass their data through properly. Exactly. Change forces change. If the change is ultimately good then the pain caused by the change will be worth it. If not not. But this is how progress progresses. |
This could go on and on and on endlessly. Can we just agree to disagree? We have both made our arguments and ultimately it will be up to Ryan to decide. I suspect you will get your Person->Person links simply because it is similar to GEDCOM ALIAses and because it is easy to see the benefits for the social-networking aspects of genealogy. If so, you will be able to finalise your dream and actually utilise DeadEnds. Personally, I will not be using it (either as a developer or as a genealogist). |
@EssyGreen |
OK, I'd hate you to think I couldn't explain so here goes back into the fray: Data loss - will occur when importing into a system which does not adhere to your specific implementation and yet needs/wants to ensure data integrity. Ambiguity - will occur because it is not clear what the link implies - does it mean that Person A is proven to be Person B (in which case where is the proof/evidence and why are they not condensed into a single Person representing the real person in the real world) or does it mean Person A looks like it might be Person B but needs further research (in which case do I put the next bit of research against Person A or Person B). Also, you argued elsewhere that the order of discovery was important and hence Person A = Person B in your model not the same thing as Person B = Person A so does the link really mean "Person A (who was discovered first) is thought to be the same as Person B (who was discovered later)". If a user then attaches the reverse link then this statement no longer makes any sense. Should the application allow this or not? (Rhetorical question - I'm just trying to explain the problem - you won't be there to give the answers when the developer has to make the decision) Confusion - most (if not all) genealogists think of people in their tree as representing real-world people whose lives they are trying to re-construct. Your model has no such thing as a real-world person because the things representing that person's life are fragmented. I think most users would desperately miss being able to see their Persons as whole people. Complexity - this comes from the confusion above since it would be the responsibility of the developer to pull your fragments together into a model which resembles the real-world again. This would mean repeatedly iterating through all your Persons to try to establish which ones were the real/base whilst avoiding circular relationships. It's tricky but it can be done but then we still haven't got to the end of it because we need to then merge say the Names (after all a user would want to see that Freda Bloggs' maiden name was Smith). Again it can be done but the application would be focusing all it's energy on re-constructing. The re-construction should be the job of the researcher not the application. I firmly believe that genealogists are trying to re-create the real-world. So the primary objects should be modelled on the real world (ie Persons/People). Your model is a model of the interconnectivity of references to people in sources. That is not the same thing. And this leads nicely back to the subject of this post ... I personally would use the Record Model to show how representations of people as they were recorded in particular sources and I would use the Conclusion Model to model the 'real-world' people that genealogists are trying to re-construct. The Person and Persona are the same objects (they are both representations of people) in different contexts (with different functional needs) but both are needed. |
@EssyGreen Data loss -- criticism unfair -- it has nothing to do with the model, only its acceptance. Ambiguity -- there is never ambiguity -- the sub person relationship always means believed to be the same person because of ..., where the because of ... is supplied by a conclusion statement or a proof statement. Confusion -- the top level person in a cluster always represents the conclusion person. In 99.9% of the cases data will be 1 and 2 tier so exactly as today. The users of NFS have no trouble with 2 tier, because the UI makes it seamless. Complexity -- unfair -- your criticism revolves around the assumption that developers are incompetent and some odd misconceptions that the model requires repeated activities and reconstructions and that circular relationships are difficult to prevent. Merging names? Never happens. The N-tier model merges the record and conclusion models with the best features of both. I am sorry I have made that difficult to see. |
@ttwetmore - I have replied in #149 since I think this thread is getting swamped with N-tier when it is actually attempting to address a completely different issue. You already have 3 threads on N-tier so let's try to keep our debate in those rather than letting it bleed so profusely elsewhere. |
Actually, the plan was to put it in a separate--but public--project where its initial scope would be limited to bulk exchange of field-based record image extraction. I don't deny that--at FamilySearch--it might become the primary means of publishing derivative source information, but we don't have the resources to promote it as a broad industry standard right now. So we'd like to focus first on getting the "core" project right and promoting it as a standard. The goal for this "core" project is to define a model and serialization format for exchanging the components of the proof standard as specified by the genealogical research process (see #141) in a standard way. A lot of this is based on resource constraints. We've got hard requirements to meet some specific deadlines for the sharing of this field-based record data. And we have a limited amount of resources for getting it done. Because of these limitations, we don't have as much room to accommodate a broad community influence on it. So we'd rather not pretend it's a community standard if we don't have the means to treat it as such. Unfortunate, yes, but those are the realities. It's different for this "core" project. We're committed to seeing it through as a real community-supported, broadly-adopted standard. |
@stoicflame - Many thanks for that clarification. I think that's actually great news :) |
I'd just like to say thanks to everybody who can contributed to this thread to help us understand and articulate the goals, scope, and context of the different models (conclusion, record) we were proposing. I hope things are much more clear now: http://familysearch.github.com/gedcomx/2012/03/23/gedcomx-identity.html With the projects now separated, we're going to close this issue and move on to the (many) other high-priority issues. |
Executive Summary
The record and conclusion models actually model the same domain but have been artificially separated. Thus, GedcomX isn't even interchangeable with itself in spite of trying to be a widely useful interchange format. For almost all classes in the record model there is a corresponding, parallel class in the conclusion model. Merging these parallel classes together into a single model will make GedcomX easier to understand, easier to implement, and more powerful as a representation of genealogical data. In the remainder of this issue, I explain what's wrong with the current situation, why it's so harmful, and how the problem can be resolved.
What's Wrong
First, some quotes for context:
and
As it stands, the GedcomX record model does not model the "content of an artifact." A record model that actually modeled "the content of an artifact" would do things like specify its dimensions, textual transcription, identifying marks, etc. Instead, the record model represents conclusions drawn from the content of an artifact. For example, the claim "John Smith was born 1 Jan 1930", though supported by the contents of Mr. Smith's birth certificate, is a conclusion a researcher drew based on that certificate. Conclusions such as this that are made on the basis of artifact contents are just another kind of conclusion.
In its current form, GedcomX tries to model one aspect of conclusion metadata (whether or not a fact was concluded based on the contents of a document or artifact) not by allowing for this to be represented in the metadata classes themselves, but rather by duplicating the entire set of data classes and metadata classes and declaring that metadata dealing with this new set represent conclusions drawn directly from a document. As a result, the record and conclusion data models are two separate but almost exactly parallel models of the same domain. The distinction upon which this duplication is justified is essentially arbitrary, treating a special kind of conclusion as if it were so distinctive that it must be modeled as an entirely different domain.
Why It's Harmful
The model duplication that exists in the current GedcomX specification adds to user confusion ("What's the difference between a person and a persona?"), complicates the task of implementing the standard (twice as many entities to represent), and reduces the utility of data represented using GedcomX (a persona transcribed from a record is not necessarily comparable to a corresponding person in a pedigree, even if they actually represent the same individual).
Resolution
Instead of making a complete copy of the data and metadata classes, this distinction can be much more parsimoniously modeled by simply enriching the metadata model. I propose modeling the genealogy domain as a set of core entity types (person, place, event, date/time, document, etc.) and a vocabulary for making statements about such entities (e.g. person X was born in place Y), combined with a metadata vocabulary for justifying these statements, recording the reasoning behind them, and showing who exactly is making the claims (e.g. researcher A claims/asserts/believes that person X was born in place Y because of evidence found in document Z). This lends itself to a two-part model, one for making statements about the core entities (data), another for making statements about those statements (metadata).
Rather than embedding Facts within the entities they are about, a general Fact class should be created that can represent claims of fact about any entity type. For example, in Turtle syntax:
(A similar approach is described in my RootsTech Family History Technology Workshop paper.)
An appropriate resolution to this issue would involve either 1) merging the record and conclusion models and perhaps refactoring the result into data and metadata models, or 2) giving a convincing argument for why the current state of affairs is necessary, including specific use cases that could not be modeled using a single merged model. The burden of evidence that would justify the duplication of a substantial subset of the GedcomX vocabulary is, in my opinion, fairly high, given the cost in user confusion, implementation difficulty, and data format utility mentioned above.
See Also
Issue #131 "A Persona IS a Person"
Conclusion Record Distinction
The text was updated successfully, but these errors were encountered: