Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tips in phylogenies/*/{summary,posterior}.trees should be renamed into glottocodes #227

Closed
SimonGreenhill opened this issue Jan 16, 2019 · 30 comments
Assignees

Comments

@SimonGreenhill
Copy link
Collaborator

The trees processed into {summary,posterior}.trees should have tips as glottocodes. The current website has downloads with the raw taxa labels which adds a barrier for researchers to use them.

This could be done in the import stage or at the initial commit stage. The latter is probably better.

@xrotwang
Copy link
Collaborator

xrotwang commented Jan 16, 2019 via email

@SimonGreenhill
Copy link
Collaborator Author

yeah, but having the data in the repository inconsistent to the website is a bit annoying. We have the original trees & names in each phylogenies ./originals/ directory, so I think having the two files imported {summary, posterior} glottotipped makes more sense.

@xrotwang
Copy link
Collaborator

xrotwang commented Jan 16, 2019 via email

@SimonGreenhill
Copy link
Collaborator Author

Hmm. true. Perhaps a pydplace's check command could check that all tips in the tree are listed in the taxa.csv and vice versa?

I've realised why we haven't renamed in the past, as what do we do with non-dplace tips? we should probably drop them now to make 'ease of use' better. This will require a bit of processing, but we can do this with the ete3 library. I'll work up a pull request on one language and see how it looks.

@SimonGreenhill SimonGreenhill self-assigned this Jan 17, 2019
@xrotwang
Copy link
Collaborator

xrotwang commented Jan 17, 2019 via email

@xrotwang
Copy link
Collaborator

xrotwang commented Jan 17, 2019 via email

@xrotwang
Copy link
Collaborator

xrotwang commented Jan 17, 2019 via email

@SimonGreenhill
Copy link
Collaborator Author

SimonGreenhill commented Jan 22, 2019

Ok, thinking more about implementing this. I think that the first step is to emphasise the released data versions more visible (see D-PLACE/dplace2#10), these release versions and the versions on the website should have the tips renamed to glottocodes.

In general, I like the idea of having renamed tips in the data repository for simplicity, but can see that it adds a maintenance headache (i.e. if I edit taxa.csv then these changes need to propagate into the tree files).

Solutions:

  1. Update the Makefiles to rename tips to glottocode.

    • Benefits: consistency of data between curation repository and releases, and people can easily clone the repository and use it.
    • Drawbacks: possibility of taxa.csv:summary.trees mismatches on revisions, but this could be alleviated by pydplace check?
  2. Update the Makefiles to rename tips to glottocode and revise the release/website-update procedure to run all the Makefiles.

    • Benefits: consistency of data between curation repository and releases, and people can easily clone the repository and use it.
    • Drawbacks: complexity of build process to solve the change propagation problem ( this could just be something like find . -name Makefile | xargs -n 1 make clean && make, with a step to cd into the correct working dir). Slower release process.
  3. Keep the trees in the repository as-is, and rename on release/website-update.

    • Benefits: easy
    • Drawbacks: mismatch between dplace-data and dplace website, which may cause problems for users.

I think I like 1 > 2 >> 3. Anything I'm missing? thoughts?

@SimonGreenhill
Copy link
Collaborator Author

..ok, that's weird. Your last few comments weren't showing before despite being 5 days old?! This is a problem though. Do we rename tips in all trees (incl. glottotrees) to xd_ids?

@xrotwang
Copy link
Collaborator

I think the whole xd_id thing in D-PLACE is a bit half-cooked. I think D-PLACE should either commit to identifying societies across society sets, or not doing this. Since the decision will probably go towards not doing it, I'd say - to be consistent (and transparent) - renaming tree tips would have to be done per society set. So since this will introduce at least three renamed versions of each phylogeny, I'd be much more in favor of 3 above, because the distinction between released versions and in-curation data would be clearly visible.

@xrotwang
Copy link
Collaborator

Oh and if we go for renaming per-society set, we should probably rename to society IDs, not glottocodes.

@kirbykat
Copy link
Collaborator

@xrotwang @SimonGreenhill - societies with the same xd_id always have the same glottocode

@kirbykat
Copy link
Collaborator

kirbykat commented Jan 22, 2019

@xrotwang @SimonGreenhill The xd_id thing is not really half-cooked, it was the solution I came to after struggling for a long time with how to deal with one-to-many relationships between societies in different society sets, but I'm open to other solutions. How do we link one society in the EA to to 2 societies in Binford, if all three have the same glottocode, and that glottocode is ALSO shared with a fourth EA society, which does NOT map to the 2 Binford societies?

xd_ids - one to many example

@kirbykat
Copy link
Collaborator

@xrotwang @SimonGreenhill - I just edited the comment/figure for clarity (not sure if you get notified about that).

@kirbykat
Copy link
Collaborator

@xrotwang @SimonGreenhill - In a perfect world, this would be possible using glottolog_id + sets of geographic polygons (so, in the above example, points or polygons for Binford's xd_id=A societies would be located within the geographic boundaries of a polygon of Murdock's xd_id=A), but the reality is that our geolocation data are not good enough for this.

@kirbykat
Copy link
Collaborator

kirbykat commented Jan 22, 2019

@xrotwang @SimonGreenhill - I know you are not asking me, but think I would vote for glottocodes at tips. Originally I had tried to link xd_ids, which is technically more accurate in cases where a tree tip is labeled with language, but the word lists linked to that tip were actually collected from a dialect/sub-group of the language that is a better match for one D-PLACE society over another . However, I now think this is too hard to maintain - society-language links are constantly being updated, etc., so xd_id to tip matches would similarly need to be updated.

Ideally, tree tips should be the highest-resolution glottocode possible (so, dialect rather than language, if appropriate).

Then, for each tip on a tree, I envision a tip ---> society matching procedure that would go something like:

  1. Given the glottocode with which this tip is labeled, is there one or more direct D-PLACE society matches? If yes, great. Match(es) made.
    If NO, then EITHER
    2a. If the answer to (1) is no, AND if that tip glottocode was a dialect-level glottocode, AND if there is no other tip in the tree that is labeled with a dialect of the same parent language that can be matched to a D-PLACE society, THEN the node above the dialect-level split becomes the "new tip". (This node would receive the parent glottocode of the dialect-labeled tip, retrieved using a look-up table if not already embedded in the tree). Repeat step 1.
    OR
    2b. If the answer to (1) is no, AND if the tip's glottocode was already a language-level glottocode, then it's possible that there are D-PLACE societies that have dialect-level glottocodes that should be matched to the language-level tip (i.e., the language of the tip is their parent). SO, after retrieving the language-level glottocode of all societies in D-PLACE that are linked to a dialect-level glottocode, repeat Step 1 within those language-level glottocodes..
    .3. If Step 2 does not result in a D-PLACE society match for a given tip, there is no match.

Some considerations:

  • What to do when there is more than one D-PLACE society that could be matched to a tip? Randomly select one? Select the society with non-missing value(s) for the trait(s) you are interested in? (I'm sure there is already a system in place for this -- I know Cara had a system for dealing with this in her analyses).

@xrotwang
Copy link
Collaborator

@kirbykat sorry for the wording. I guess what I struggle with is the public role xd_id should play. I.e. if it is well thought through we should probably make its existence more visible. Presumably, I'm to blame for hiding it in the web app, because I didn't fully wrap my head around it yet - or thought that there are already enough identifiers for societies.

@xrotwang
Copy link
Collaborator

As far as I can tell, most of this complexity comes only into the picture, when trying to compare societies across society/data sets, right? And I think what @kirbykat describes is totally appropriate:

I know Cara had a system for dealing with this in her analyses

I.e. D-PLACE - the dataset - should stay away from decisions that may only make full sense within a particular analysis.

So, since most (all?) phylogenies in D-PLACE come from linguistic data, mapping tree tips to most specific Glottocode would make sense. I'm not sure, though, this would fulfil the requirement of making the phylogenies as easy to use as possible, because there's still an analysis-specific mapping process needed.

@xrotwang
Copy link
Collaborator

So I guess what I meant with "half-cooked" is that xd_id seems somewhat analysis-specific to me.

@kirbykat
Copy link
Collaborator

@xrotwang No worries re: wording! Here is my visualized procedure for tip mapping. Maybe obvious from above, and maybe there are better ways. But in case not clear, hopefully this illustrates what I mean.
tree_tip_matching_procedure

@xrotwang
Copy link
Collaborator

@kirbykat I think I understand what you mean (and I think I have seen code implementing this algorithm in the old app :) ). This algorithm already assumes a chosen set of D-PLACE societies to pick from, though. But whether this set is all D-PLACE societies or one D-PLACE society set is the question where xd_id comes in. And I would argue, that this question cannot be answered without the context of a particular analysis/research question.

So while I think it would be a good idea, to implement this algorithm as an example in the D-PLACE cookbook, I don't think it should be used to create any "easy-to-use" representation of a phylogeny.

@kirbykat
Copy link
Collaborator

kirbykat commented Jan 22, 2019

@xrotwang Yes, I agree. Actually I don't even think xd_id is relevant in the case you mention above (if I understand right), but I agree society set is critical. I would say xd_id is most useful (1) for updating language matches (I use it for this all the time - I can apply one update to all matched societies), instead of updating the language match of EACH corresponding society in each society set) and (2) for joining (merging?) data across sets. For example, I have a subsistence dataset I have been working with, for which I chose [all EA societies] + [any Binford societies with an xd_id not in the EA] + [any WNAI societies with an xd_id not in the EA or Binford]. (The SCCS is entirely contained within the EA, so not relevant in this example).

@kirbykat
Copy link
Collaborator

@xrotwang re: deciding on which society set to link to the tree, i agree this is question dependent!

@xrotwang
Copy link
Collaborator

So basically xd_id provides a more fine-grained partition of the set of all D-PLACE societies than glottocode does; but whether that can be exploited depends on the context. E.g. in some context same xd_id can mean "pretty much the same", in others not so much.

@xrotwang
Copy link
Collaborator

If so, then our current treatment of xd_id seems fine:

  • Use it in dplace-data for curation tasks - e.g. updating language matches across society sets.
  • Use it in the web app to create rather generic "see also" links to societies with the same xd_id.

@kirbykat
Copy link
Collaborator

@xrotwang Well, yes to "a more fine-grained partition of the set than glottocode", but xd_id is our best estimate of "equivalent societies" among society sets, so as long as YEAR is the same for both data points, a single xd_id should refer to "pretty much the same" society.

The tricky cases (which are rare, fortunately!) are the "one-to-many equivalent societies" across society sets. This is a result of how different authors "lumped" or "split" cultural units.

In these rare "one-to-many equivalent societies" cases, someone interested in joining data from different society sets has to decide how they will combine the multiple estimates. For example, in the case of the blue "xd_id=A" circles in the figure above, if I get subsistence data from Murdock's EA, and want to join that with information on hunting group size from Binford, I will have to decide which of two possible estimates of hunting group size I will use. Best case scenario the two estimates are the same (both have hunting groups of size "5" persons), in which case no need to think more about it.

Also, as mentioned earlier, the other key consideration when joining data for a single xd_id is YEAR of observation. Typically you want the years for different data points to be as similar as possible.

@kirbykat
Copy link
Collaborator

@xrotwang - Yes, I agree with your last comment!

@xrotwang
Copy link
Collaborator

Yes, the cases where multiple societies in the same society set have the same xd_id are the reason I wouldn't want to call the xd_id relation "equivalence". Equivalence without transitivity is a bit weird, and D-PLACE didn't want to go as far as actually identifying these societies - and adding, maybe, a "settlements" attribute to the resulting society, listing what was considered different by the source.

@kirbykat
Copy link
Collaborator

kirbykat commented Jan 22, 2019

Yes. And the problem with listing what was considered different by different sources is that often we have no idea why one chose to group units while another chose to split.

@SimonGreenhill
Copy link
Collaborator Author

@kirbykat -- there's lots of useful stuff in here. Can you write this up in a document somewhere please?

I'm going to close this issue for now as I think it's easy to mess up and no-one is really asking for this right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants