Tips in phylogenies/*/{summary,posterior}.trees should be renamed into glottocodes #227

SimonGreenhill · 2019-01-16T09:01:13Z

The trees processed into {summary,posterior}.trees should have tips as glottocodes. The current website has downloads with the raw taxa labels which adds a barrier for researchers to use them.

This could be done in the import stage or at the initial commit stage. The latter is probably better.

xrotwang · 2019-01-16T09:18:25Z

Yes. We can certainly create some easy-to-use versions of the data upon release at least. Maybe not upon commit. This would also raise the awareness that released versions of the data is what we want people to use. Alternatively or as a first step we can include renaming taxa via phyltr, e.g., in the cookbook. Am Mi., 16. Jan. 2019, 10:01 hat Simon J Greenhill <[email protected]> geschrieben:

…

The trees processed into {summary,posterior}.trees should have tips as glottocodes. The current website has downloads with the raw taxa labels which adds a barrier for researchers to use them. This could be done in the import stage or at the initial commit stage. The latter is probably better. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#227>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1HKB08CDOc6P4RgT5riD8eQKEVV74Gks5vDupZgaJpZM4aCmX_> .

SimonGreenhill · 2019-01-16T11:33:07Z

yeah, but having the data in the repository inconsistent to the website is a bit annoying. We have the original trees & names in each phylogenies ./originals/ directory, so I think having the two files imported {summary, posterior} glottotipped makes more sense.

xrotwang · 2019-01-16T11:46:53Z

We'd have to make sure the mapping in taxa.csv is injective, though. Otherwise, updates to Glottolog may not be possible to be applied correctly. Am Mi., 16. Jan. 2019, 12:33 hat Simon J Greenhill <[email protected]> geschrieben:

…

yeah, but having the data in the repository inconsistent to the website is a bit annoying. We have the original trees & names in each phylogenies ./originals/ directory, so I think having the two files imported {summary, posterior} glottotipped makes more sense. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#227 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1HKEqVx_f7GAXHz59muCOlmU2vTmDfks5vDw3zgaJpZM4aCmX_> .

SimonGreenhill · 2019-01-17T08:29:17Z

Hmm. true. Perhaps a pydplace's check command could check that all tips in the tree are listed in the taxa.csv and vice versa?

I've realised why we haven't renamed in the past, as what do we do with non-dplace tips? we should probably drop them now to make 'ease of use' better. This will require a bit of processing, but we can do this with the ete3 library. I'll work up a pull request on one language and see how it looks.

xrotwang · 2019-01-17T08:42:48Z

I can do the pruning, too. The web app does this as well, so I already have code for it. Am Do., 17. Jan. 2019, 09:29 hat Simon J Greenhill <[email protected]> geschrieben:

…

Hmm. true. Perhaps a pydplace's check command could check that all tips in the tree are listed in the taxa.csv and vice versa? I've realised why we haven't renamed in the past, as what do we do with non-dplace tips? we should probably drop them now to make 'ease of use' better. This will require a bit of processing, but we can do this with the ete3 library. I'll work up a pull request on one language and see how it looks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#227 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1HKAZ3JUzYOXRmdKD9wwAa9djMnkzJks5vEDRdgaJpZM4aCmX_> .

xrotwang · 2019-01-17T08:45:43Z

Probably means we should keep two variants of all tree files - original and mapped, or processed, or whatever is a good name. Am Do., 17. Jan. 2019, 09:42 hat Robert Forkel <[email protected]> geschrieben:

…

I can do the pruning, too. The web app does this as well, so I already have code for it. Am Do., 17. Jan. 2019, 09:29 hat Simon J Greenhill < ***@***.***> geschrieben: > Hmm. true. Perhaps a pydplace's check command could check that all tips > in the tree are listed in the taxa.csv and vice versa? > > I've realised why we haven't renamed in the past, as what do we do with > non-dplace tips? we should probably drop them now to make 'ease of use' > better. This will require a bit of processing, but we can do this with the > ete3 library. I'll work up a pull request on one language and see how it > looks. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#227 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AA1HKAZ3JUzYOXRmdKD9wwAa9djMnkzJks5vEDRdgaJpZM4aCmX_> > . >

xrotwang · 2019-01-17T08:52:00Z

There's another issue with renaming. If we rename to glottocodes, this is potentially still not what we want, because societies with the same xd_id may have different glottocodes. So the renamed tree tips would only be compatible with one society set - unless the user does resolve the xd_id mapping. Am Do., 17. Jan. 2019, 09:43 hat Robert Forkel <[email protected]> geschrieben:

…

Probably means we should keep two variants of all tree files - original and mapped, or processed, or whatever is a good name. Am Do., 17. Jan. 2019, 09:42 hat Robert Forkel ***@***.***> geschrieben: > I can do the pruning, too. The web app does this as well, so I already > have code for it. > > Am Do., 17. Jan. 2019, 09:29 hat Simon J Greenhill < > ***@***.***> geschrieben: > >> Hmm. true. Perhaps a pydplace's check command could check that all tips >> in the tree are listed in the taxa.csv and vice versa? >> >> I've realised why we haven't renamed in the past, as what do we do with >> non-dplace tips? we should probably drop them now to make 'ease of use' >> better. This will require a bit of processing, but we can do this with the >> ete3 library. I'll work up a pull request on one language and see how it >> looks. >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#227 (comment)>, >> or mute the thread >> <https://github.com/notifications/unsubscribe-auth/AA1HKAZ3JUzYOXRmdKD9wwAa9djMnkzJks5vEDRdgaJpZM4aCmX_> >> . >> >

SimonGreenhill · 2019-01-22T10:41:08Z

Ok, thinking more about implementing this. I think that the first step is to emphasise the released data versions more visible (see D-PLACE/dplace2#10), these release versions and the versions on the website should have the tips renamed to glottocodes.

In general, I like the idea of having renamed tips in the data repository for simplicity, but can see that it adds a maintenance headache (i.e. if I edit taxa.csv then these changes need to propagate into the tree files).

Solutions:

Update the Makefiles to rename tips to glottocode.
- Benefits: consistency of data between curation repository and releases, and people can easily clone the repository and use it.
- Drawbacks: possibility of taxa.csv:summary.trees mismatches on revisions, but this could be alleviated by pydplace check?
Update the Makefiles to rename tips to glottocode and revise the release/website-update procedure to run all the Makefiles.
- Benefits: consistency of data between curation repository and releases, and people can easily clone the repository and use it.
- Drawbacks: complexity of build process to solve the change propagation problem ( this could just be something like find . -name Makefile | xargs -n 1 make clean && make, with a step to cd into the correct working dir). Slower release process.
Keep the trees in the repository as-is, and rename on release/website-update.
- Benefits: easy
- Drawbacks: mismatch between dplace-data and dplace website, which may cause problems for users.

I think I like 1 > 2 >> 3. Anything I'm missing? thoughts?

SimonGreenhill · 2019-01-22T10:43:13Z

..ok, that's weird. Your last few comments weren't showing before despite being 5 days old?! This is a problem though. Do we rename tips in all trees (incl. glottotrees) to xd_ids?

xrotwang · 2019-01-22T10:49:38Z

I think the whole xd_id thing in D-PLACE is a bit half-cooked. I think D-PLACE should either commit to identifying societies across society sets, or not doing this. Since the decision will probably go towards not doing it, I'd say - to be consistent (and transparent) - renaming tree tips would have to be done per society set. So since this will introduce at least three renamed versions of each phylogeny, I'd be much more in favor of 3 above, because the distinction between released versions and in-curation data would be clearly visible.

xrotwang · 2019-01-22T10:51:31Z

Oh and if we go for renaming per-society set, we should probably rename to society IDs, not glottocodes.

kirbykat · 2019-01-22T11:03:13Z

@xrotwang @SimonGreenhill - societies with the same xd_id always have the same glottocode

kirbykat · 2019-01-22T11:13:46Z

@xrotwang @SimonGreenhill The xd_id thing is not really half-cooked, it was the solution I came to after struggling for a long time with how to deal with one-to-many relationships between societies in different society sets, but I'm open to other solutions. How do we link one society in the EA to to 2 societies in Binford, if all three have the same glottocode, and that glottocode is ALSO shared with a fourth EA society, which does NOT map to the 2 Binford societies?

kirbykat · 2019-01-22T11:19:28Z

@xrotwang @SimonGreenhill - I just edited the comment/figure for clarity (not sure if you get notified about that).

kirbykat · 2019-01-22T11:22:42Z

@xrotwang @SimonGreenhill - In a perfect world, this would be possible using glottolog_id + sets of geographic polygons (so, in the above example, points or polygons for Binford's xd_id=A societies would be located within the geographic boundaries of a polygon of Murdock's xd_id=A), but the reality is that our geolocation data are not good enough for this.

kirbykat · 2019-01-22T11:50:06Z

@xrotwang @SimonGreenhill - I know you are not asking me, but think I would vote for glottocodes at tips. Originally I had tried to link xd_ids, which is technically more accurate in cases where a tree tip is labeled with language, but the word lists linked to that tip were actually collected from a dialect/sub-group of the language that is a better match for one D-PLACE society over another . However, I now think this is too hard to maintain - society-language links are constantly being updated, etc., so xd_id to tip matches would similarly need to be updated.

Ideally, tree tips should be the highest-resolution glottocode possible (so, dialect rather than language, if appropriate).

Then, for each tip on a tree, I envision a tip ---> society matching procedure that would go something like:

Given the glottocode with which this tip is labeled, is there one or more direct D-PLACE society matches? If yes, great. Match(es) made.
If NO, then EITHER
2a. If the answer to (1) is no, AND if that tip glottocode was a dialect-level glottocode, AND if there is no other tip in the tree that is labeled with a dialect of the same parent language that can be matched to a D-PLACE society, THEN the node above the dialect-level split becomes the "new tip". (This node would receive the parent glottocode of the dialect-labeled tip, retrieved using a look-up table if not already embedded in the tree). Repeat step 1.
OR
2b. If the answer to (1) is no, AND if the tip's glottocode was already a language-level glottocode, then it's possible that there are D-PLACE societies that have dialect-level glottocodes that should be matched to the language-level tip (i.e., the language of the tip is their parent). SO, after retrieving the language-level glottocode of all societies in D-PLACE that are linked to a dialect-level glottocode, repeat Step 1 within those language-level glottocodes..
.3. If Step 2 does not result in a D-PLACE society match for a given tip, there is no match.

Some considerations:

What to do when there is more than one D-PLACE society that could be matched to a tip? Randomly select one? Select the society with non-missing value(s) for the trait(s) you are interested in? (I'm sure there is already a system in place for this -- I know Cara had a system for dealing with this in her analyses).

xrotwang · 2019-01-22T12:21:17Z

@kirbykat sorry for the wording. I guess what I struggle with is the public role xd_id should play. I.e. if it is well thought through we should probably make its existence more visible. Presumably, I'm to blame for hiding it in the web app, because I didn't fully wrap my head around it yet - or thought that there are already enough identifiers for societies.

xrotwang · 2019-01-22T12:29:39Z

As far as I can tell, most of this complexity comes only into the picture, when trying to compare societies across society/data sets, right? And I think what @kirbykat describes is totally appropriate:

I know Cara had a system for dealing with this in her analyses

I.e. D-PLACE - the dataset - should stay away from decisions that may only make full sense within a particular analysis.

So, since most (all?) phylogenies in D-PLACE come from linguistic data, mapping tree tips to most specific Glottocode would make sense. I'm not sure, though, this would fulfil the requirement of making the phylogenies as easy to use as possible, because there's still an analysis-specific mapping process needed.

xrotwang · 2019-01-22T12:38:58Z

So I guess what I meant with "half-cooked" is that xd_id seems somewhat analysis-specific to me.

kirbykat · 2019-01-22T12:58:56Z

@xrotwang No worries re: wording! Here is my visualized procedure for tip mapping. Maybe obvious from above, and maybe there are better ways. But in case not clear, hopefully this illustrates what I mean.

xrotwang · 2019-01-22T13:09:55Z

@kirbykat I think I understand what you mean (and I think I have seen code implementing this algorithm in the old app :) ). This algorithm already assumes a chosen set of D-PLACE societies to pick from, though. But whether this set is all D-PLACE societies or one D-PLACE society set is the question where xd_id comes in. And I would argue, that this question cannot be answered without the context of a particular analysis/research question.

So while I think it would be a good idea, to implement this algorithm as an example in the D-PLACE cookbook, I don't think it should be used to create any "easy-to-use" representation of a phylogeny.

kirbykat · 2019-01-22T13:14:59Z

@xrotwang Yes, I agree. Actually I don't even think xd_id is relevant in the case you mention above (if I understand right), but I agree society set is critical. I would say xd_id is most useful (1) for updating language matches (I use it for this all the time - I can apply one update to all matched societies), instead of updating the language match of EACH corresponding society in each society set) and (2) for joining (merging?) data across sets. For example, I have a subsistence dataset I have been working with, for which I chose [all EA societies] + [any Binford societies with an xd_id not in the EA] + [any WNAI societies with an xd_id not in the EA or Binford]. (The SCCS is entirely contained within the EA, so not relevant in this example).

kirbykat · 2019-01-22T13:17:22Z

@xrotwang re: deciding on which society set to link to the tree, i agree this is question dependent!

xrotwang · 2019-01-22T13:30:22Z

So basically xd_id provides a more fine-grained partition of the set of all D-PLACE societies than glottocode does; but whether that can be exploited depends on the context. E.g. in some context same xd_id can mean "pretty much the same", in others not so much.

xrotwang · 2019-01-22T13:32:56Z

If so, then our current treatment of xd_id seems fine:

Use it in dplace-data for curation tasks - e.g. updating language matches across society sets.
Use it in the web app to create rather generic "see also" links to societies with the same xd_id.

kirbykat · 2019-01-22T13:53:12Z

@xrotwang Well, yes to "a more fine-grained partition of the set than glottocode", but xd_id is our best estimate of "equivalent societies" among society sets, so as long as YEAR is the same for both data points, a single xd_id should refer to "pretty much the same" society.

The tricky cases (which are rare, fortunately!) are the "one-to-many equivalent societies" across society sets. This is a result of how different authors "lumped" or "split" cultural units.

In these rare "one-to-many equivalent societies" cases, someone interested in joining data from different society sets has to decide how they will combine the multiple estimates. For example, in the case of the blue "xd_id=A" circles in the figure above, if I get subsistence data from Murdock's EA, and want to join that with information on hunting group size from Binford, I will have to decide which of two possible estimates of hunting group size I will use. Best case scenario the two estimates are the same (both have hunting groups of size "5" persons), in which case no need to think more about it.

Also, as mentioned earlier, the other key consideration when joining data for a single xd_id is YEAR of observation. Typically you want the years for different data points to be as similar as possible.

kirbykat · 2019-01-22T13:54:57Z

@xrotwang - Yes, I agree with your last comment!

xrotwang · 2019-01-22T14:27:49Z

Yes, the cases where multiple societies in the same society set have the same xd_id are the reason I wouldn't want to call the xd_id relation "equivalence". Equivalence without transitivity is a bit weird, and D-PLACE didn't want to go as far as actually identifying these societies - and adding, maybe, a "settlements" attribute to the resulting society, listing what was considered different by the source.

kirbykat · 2019-01-22T14:30:29Z

Yes. And the problem with listing what was considered different by different sources is that often we have no idea why one chose to group units while another chose to split.

SimonGreenhill · 2020-05-26T12:39:38Z

@kirbykat -- there's lots of useful stuff in here. Can you write this up in a document somewhere please?

I'm going to close this issue for now as I think it's easy to mess up and no-one is really asking for this right now.

SimonGreenhill self-assigned this Jan 17, 2019

SimonGreenhill closed this as completed May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips in phylogenies/*/{summary,posterior}.trees should be renamed into glottocodes #227

Tips in phylogenies/*/{summary,posterior}.trees should be renamed into glottocodes #227

SimonGreenhill commented Jan 16, 2019

xrotwang commented Jan 16, 2019 via email

SimonGreenhill commented Jan 16, 2019

xrotwang commented Jan 16, 2019 via email

SimonGreenhill commented Jan 17, 2019

xrotwang commented Jan 17, 2019 via email

xrotwang commented Jan 17, 2019 via email

xrotwang commented Jan 17, 2019 via email

SimonGreenhill commented Jan 22, 2019 •

edited

Loading

SimonGreenhill commented Jan 22, 2019

xrotwang commented Jan 22, 2019

xrotwang commented Jan 22, 2019

kirbykat commented Jan 22, 2019

kirbykat commented Jan 22, 2019 •

edited

Loading

kirbykat commented Jan 22, 2019

kirbykat commented Jan 22, 2019

kirbykat commented Jan 22, 2019 •

edited

Loading

xrotwang commented Jan 22, 2019

xrotwang commented Jan 22, 2019

xrotwang commented Jan 22, 2019

kirbykat commented Jan 22, 2019

xrotwang commented Jan 22, 2019

kirbykat commented Jan 22, 2019 •

edited

Loading

kirbykat commented Jan 22, 2019

xrotwang commented Jan 22, 2019

xrotwang commented Jan 22, 2019

kirbykat commented Jan 22, 2019

kirbykat commented Jan 22, 2019

xrotwang commented Jan 22, 2019

kirbykat commented Jan 22, 2019 •

edited

Loading

SimonGreenhill commented May 26, 2020

Tips in phylogenies/*/{summary,posterior}.trees should be renamed into glottocodes #227

Tips in phylogenies/*/{summary,posterior}.trees should be renamed into glottocodes #227

Comments

SimonGreenhill commented Jan 16, 2019

xrotwang commented Jan 16, 2019 via email

SimonGreenhill commented Jan 16, 2019

xrotwang commented Jan 16, 2019 via email

SimonGreenhill commented Jan 17, 2019

xrotwang commented Jan 17, 2019 via email

xrotwang commented Jan 17, 2019 via email

xrotwang commented Jan 17, 2019 via email

SimonGreenhill commented Jan 22, 2019 • edited Loading

SimonGreenhill commented Jan 22, 2019

xrotwang commented Jan 22, 2019

xrotwang commented Jan 22, 2019

kirbykat commented Jan 22, 2019

kirbykat commented Jan 22, 2019 • edited Loading

kirbykat commented Jan 22, 2019

kirbykat commented Jan 22, 2019

kirbykat commented Jan 22, 2019 • edited Loading

xrotwang commented Jan 22, 2019

xrotwang commented Jan 22, 2019

xrotwang commented Jan 22, 2019

kirbykat commented Jan 22, 2019

xrotwang commented Jan 22, 2019

kirbykat commented Jan 22, 2019 • edited Loading

kirbykat commented Jan 22, 2019

xrotwang commented Jan 22, 2019

xrotwang commented Jan 22, 2019

kirbykat commented Jan 22, 2019

kirbykat commented Jan 22, 2019

xrotwang commented Jan 22, 2019

kirbykat commented Jan 22, 2019 • edited Loading

SimonGreenhill commented May 26, 2020

SimonGreenhill commented Jan 22, 2019 •

edited

Loading

kirbykat commented Jan 22, 2019 •

edited

Loading

kirbykat commented Jan 22, 2019 •

edited

Loading

kirbykat commented Jan 22, 2019 •

edited

Loading

kirbykat commented Jan 22, 2019 •

edited

Loading