-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tips in phylogenies/*/{summary,posterior}.trees should be renamed into glottocodes #227
Comments
Yes. We can certainly create some easy-to-use versions of the data upon
release at least. Maybe not upon commit. This would also raise the
awareness that released versions of the data is what we want people to use.
Alternatively or as a first step we can include renaming taxa via phyltr,
e.g., in the cookbook.
Am Mi., 16. Jan. 2019, 10:01 hat Simon J Greenhill <[email protected]>
geschrieben:
… The trees processed into {summary,posterior}.trees should have tips as
glottocodes. The current website has downloads with the raw taxa labels
which adds a barrier for researchers to use them.
This could be done in the import stage or at the initial commit stage. The
latter is probably better.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#227>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA1HKB08CDOc6P4RgT5riD8eQKEVV74Gks5vDupZgaJpZM4aCmX_>
.
|
yeah, but having the data in the repository inconsistent to the website is a bit annoying. We have the original trees & names in each phylogenies ./originals/ directory, so I think having the two files imported {summary, posterior} glottotipped makes more sense. |
We'd have to make sure the mapping in taxa.csv is injective, though.
Otherwise, updates to Glottolog may not be possible to be applied
correctly.
Am Mi., 16. Jan. 2019, 12:33 hat Simon J Greenhill <[email protected]>
geschrieben:
… yeah, but having the data in the repository inconsistent to the website is
a bit annoying. We have the original trees & names in each phylogenies
./originals/ directory, so I think having the two files imported {summary,
posterior} glottotipped makes more sense.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#227 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA1HKEqVx_f7GAXHz59muCOlmU2vTmDfks5vDw3zgaJpZM4aCmX_>
.
|
Hmm. true. Perhaps a pydplace's I've realised why we haven't renamed in the past, as what do we do with non-dplace tips? we should probably drop them now to make 'ease of use' better. This will require a bit of processing, but we can do this with the ete3 library. I'll work up a pull request on one language and see how it looks. |
I can do the pruning, too. The web app does this as well, so I already
have code for it.
Am Do., 17. Jan. 2019, 09:29 hat Simon J Greenhill <[email protected]>
geschrieben:
… Hmm. true. Perhaps a pydplace's check command could check that all tips
in the tree are listed in the taxa.csv and vice versa?
I've realised why we haven't renamed in the past, as what do we do with
non-dplace tips? we should probably drop them now to make 'ease of use'
better. This will require a bit of processing, but we can do this with the
ete3 library. I'll work up a pull request on one language and see how it
looks.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#227 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA1HKAZ3JUzYOXRmdKD9wwAa9djMnkzJks5vEDRdgaJpZM4aCmX_>
.
|
Probably means we should keep two variants of all tree files - original and
mapped, or processed, or whatever is a good name.
Am Do., 17. Jan. 2019, 09:42 hat Robert Forkel <[email protected]>
geschrieben:
… I can do the pruning, too. The web app does this as well, so I already
have code for it.
Am Do., 17. Jan. 2019, 09:29 hat Simon J Greenhill <
***@***.***> geschrieben:
> Hmm. true. Perhaps a pydplace's check command could check that all tips
> in the tree are listed in the taxa.csv and vice versa?
>
> I've realised why we haven't renamed in the past, as what do we do with
> non-dplace tips? we should probably drop them now to make 'ease of use'
> better. This will require a bit of processing, but we can do this with the
> ete3 library. I'll work up a pull request on one language and see how it
> looks.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#227 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AA1HKAZ3JUzYOXRmdKD9wwAa9djMnkzJks5vEDRdgaJpZM4aCmX_>
> .
>
|
There's another issue with renaming. If we rename to glottocodes, this is
potentially still not what we want, because societies with the same xd_id
may have different glottocodes. So the renamed tree tips would only be
compatible with one society set - unless the user does resolve the xd_id
mapping.
Am Do., 17. Jan. 2019, 09:43 hat Robert Forkel <[email protected]>
geschrieben:
… Probably means we should keep two variants of all tree files - original
and mapped, or processed, or whatever is a good name.
Am Do., 17. Jan. 2019, 09:42 hat Robert Forkel ***@***.***>
geschrieben:
> I can do the pruning, too. The web app does this as well, so I already
> have code for it.
>
> Am Do., 17. Jan. 2019, 09:29 hat Simon J Greenhill <
> ***@***.***> geschrieben:
>
>> Hmm. true. Perhaps a pydplace's check command could check that all tips
>> in the tree are listed in the taxa.csv and vice versa?
>>
>> I've realised why we haven't renamed in the past, as what do we do with
>> non-dplace tips? we should probably drop them now to make 'ease of use'
>> better. This will require a bit of processing, but we can do this with the
>> ete3 library. I'll work up a pull request on one language and see how it
>> looks.
>>
>> —
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <#227 (comment)>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/AA1HKAZ3JUzYOXRmdKD9wwAa9djMnkzJks5vEDRdgaJpZM4aCmX_>
>> .
>>
>
|
Ok, thinking more about implementing this. I think that the first step is to emphasise the released data versions more visible (see D-PLACE/dplace2#10), these release versions and the versions on the website should have the tips renamed to glottocodes. In general, I like the idea of having renamed tips in the data repository for simplicity, but can see that it adds a maintenance headache (i.e. if I edit taxa.csv then these changes need to propagate into the tree files). Solutions:
I think I like 1 > 2 >> 3. Anything I'm missing? thoughts? |
..ok, that's weird. Your last few comments weren't showing before despite being 5 days old?! This is a problem though. Do we rename tips in all trees (incl. glottotrees) to xd_ids? |
I think the whole |
Oh and if we go for renaming per-society set, we should probably rename to society IDs, not glottocodes. |
@xrotwang @SimonGreenhill - societies with the same xd_id always have the same glottocode |
@xrotwang @SimonGreenhill The xd_id thing is not really half-cooked, it was the solution I came to after struggling for a long time with how to deal with one-to-many relationships between societies in different society sets, but I'm open to other solutions. How do we link one society in the EA to to 2 societies in Binford, if all three have the same glottocode, and that glottocode is ALSO shared with a fourth EA society, which does NOT map to the 2 Binford societies? |
@xrotwang @SimonGreenhill - I just edited the comment/figure for clarity (not sure if you get notified about that). |
@xrotwang @SimonGreenhill - In a perfect world, this would be possible using glottolog_id + sets of geographic polygons (so, in the above example, points or polygons for Binford's xd_id=A societies would be located within the geographic boundaries of a polygon of Murdock's xd_id=A), but the reality is that our geolocation data are not good enough for this. |
@xrotwang @SimonGreenhill - I know you are not asking me, but think I would vote for glottocodes at tips. Originally I had tried to link xd_ids, which is technically more accurate in cases where a tree tip is labeled with language, but the word lists linked to that tip were actually collected from a dialect/sub-group of the language that is a better match for one D-PLACE society over another . However, I now think this is too hard to maintain - society-language links are constantly being updated, etc., so xd_id to tip matches would similarly need to be updated. Ideally, tree tips should be the highest-resolution glottocode possible (so, dialect rather than language, if appropriate). Then, for each tip on a tree, I envision a tip ---> society matching procedure that would go something like:
Some considerations:
|
@kirbykat sorry for the wording. I guess what I struggle with is the public role |
As far as I can tell, most of this complexity comes only into the picture, when trying to compare societies across society/data sets, right? And I think what @kirbykat describes is totally appropriate:
I.e. D-PLACE - the dataset - should stay away from decisions that may only make full sense within a particular analysis. So, since most (all?) phylogenies in D-PLACE come from linguistic data, mapping tree tips to most specific Glottocode would make sense. I'm not sure, though, this would fulfil the requirement of making the phylogenies as easy to use as possible, because there's still an analysis-specific mapping process needed. |
So I guess what I meant with "half-cooked" is that |
@xrotwang No worries re: wording! Here is my visualized procedure for tip mapping. Maybe obvious from above, and maybe there are better ways. But in case not clear, hopefully this illustrates what I mean. |
@kirbykat I think I understand what you mean (and I think I have seen code implementing this algorithm in the old app :) ). This algorithm already assumes a chosen set of D-PLACE societies to pick from, though. But whether this set is all D-PLACE societies or one D-PLACE society set is the question where So while I think it would be a good idea, to implement this algorithm as an example in the D-PLACE cookbook, I don't think it should be used to create any "easy-to-use" representation of a phylogeny. |
@xrotwang Yes, I agree. Actually I don't even think xd_id is relevant in the case you mention above (if I understand right), but I agree society set is critical. I would say xd_id is most useful (1) for updating language matches (I use it for this all the time - I can apply one update to all matched societies), instead of updating the language match of EACH corresponding society in each society set) and (2) for joining (merging?) data across sets. For example, I have a subsistence dataset I have been working with, for which I chose [all EA societies] + [any Binford societies with an xd_id not in the EA] + [any WNAI societies with an xd_id not in the EA or Binford]. (The SCCS is entirely contained within the EA, so not relevant in this example). |
@xrotwang re: deciding on which society set to link to the tree, i agree this is question dependent! |
So basically |
If so, then our current treatment of
|
@xrotwang Well, yes to "a more fine-grained partition of the set than glottocode", but xd_id is our best estimate of "equivalent societies" among society sets, so as long as YEAR is the same for both data points, a single xd_id should refer to "pretty much the same" society. The tricky cases (which are rare, fortunately!) are the "one-to-many equivalent societies" across society sets. This is a result of how different authors "lumped" or "split" cultural units. In these rare "one-to-many equivalent societies" cases, someone interested in joining data from different society sets has to decide how they will combine the multiple estimates. For example, in the case of the blue "xd_id=A" circles in the figure above, if I get subsistence data from Murdock's EA, and want to join that with information on hunting group size from Binford, I will have to decide which of two possible estimates of hunting group size I will use. Best case scenario the two estimates are the same (both have hunting groups of size "5" persons), in which case no need to think more about it. Also, as mentioned earlier, the other key consideration when joining data for a single xd_id is YEAR of observation. Typically you want the years for different data points to be as similar as possible. |
@xrotwang - Yes, I agree with your last comment! |
Yes, the cases where multiple societies in the same society set have the same |
Yes. And the problem with listing what was considered different by different sources is that often we have no idea why one chose to group units while another chose to split. |
@kirbykat -- there's lots of useful stuff in here. Can you write this up in a document somewhere please? I'm going to close this issue for now as I think it's easy to mess up and no-one is really asking for this right now. |
The trees processed into {summary,posterior}.trees should have tips as glottocodes. The current website has downloads with the raw taxa labels which adds a barrier for researchers to use them.
This could be done in the import stage or at the initial commit stage. The latter is probably better.
The text was updated successfully, but these errors were encountered: