Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UX: Extra entries in curie_map of written TSV #514

Open
joeflack4 opened this issue Apr 7, 2024 · 4 comments
Open

UX: Extra entries in curie_map of written TSV #514

joeflack4 opened this issue Apr 7, 2024 · 4 comments

Comments

@joeflack4
Copy link
Collaborator

joeflack4 commented Apr 7, 2024

Overview

When I write SSSOM to TSV, I'd like it to only include entries in curie_map where the prefixes are actually used in 1 or more places in the mapping set. However, extra entries are appearing.

Case 1: When passing a Converter and metadata

I expected 4 entries, but got 10.

I also mentioned this in: #513. I don't necessarily mind these extra entries in there, as they are some popular and relevant namespaces. The extra ones I got were: owl, sssom, and oboInOwl, rdf, rdfs, and semapv. I'd suggest that we could possibly add some parameterization for this. Stick with the default of either including these important namespaces or not, and then a parameter to allow for the opposite. Also, IDK if this is really a sssom-py issue or a curies issue.

Case 2: When not passing metadata, but no Converter

I expected 4 entries, but got 1,547.

Possible solutions

Nico wrote:

...if the massive OBO context was used to infer prefixes, we should automatically call clean_prefix_map().

Additional details

FYI:

Results, based on various means:

  1. n=4. This is the amount of entries I want, and it is the amount that I get when using some custom code I wrote. ordo-icd11.sssom - joes ad hoc way.tsv.zip
  2. n=10. This is the amount of entries I get when first instantiating a Converter and passing that to MappingSetDataFrame. ordo-icd11.sssom - with converter.tsv.zip
  3. n=1,547. This is the amount of entries I get when instantiating a MappingSetDataFrame passing in my metadata but no Converter. ordo-icd11.sssom - no converter.tsv.zip
@matentzn
Copy link
Collaborator

matentzn commented Apr 7, 2024

  1. owl, sssom, and oboInOwl, rdf, rdfs, and semapv are all built in prefixes and are therefore added by default (especially sssom and semapv which are nearly always needed.
  2. There is a method on MappingSetPrefixMap called clean_prefix_map() which removes all prefixes from the curie map that are not used. Try to see if it gets rid of some of the buildin ones?

@joeflack4
Copy link
Collaborator Author

joeflack4 commented Apr 7, 2024

For (2), right--I forgot to do msdf.clean_prefix_map().

Do we really want the default behavior to be, that when no converter is passed that we include all of these 1,547 entries? Not to mention the related sub-issues of #513: (2) Incorrect curie_map (it leaves out prefixes that are in metadata), and (3) UX: Should automatically instantiate Converter.

My preference would be that clean_prefix_map() should be automatic, and if we want to have a parameter that adds tons of namespaces, we can add that.

@matentzn
Copy link
Collaborator

matentzn commented Apr 8, 2024

Do we really want the default behavior to be, that when no converter is passed that we include all of these 1,547 entries?

i. I think you are right. Can you update the OC to request that, if the massive OBO context was used to infer prefixes, we should automatically call clean_prefix_map()?

Incorrect curie_map (it leaves out prefixes that are in metadata),

ii. If this is true is a bug, add as action item to OC.

Should automatically instantiate Converter.

iii. This must already be the case..

@joeflack4
Copy link
Collaborator Author

joeflack4 commented Apr 8, 2024

(i) Done! (ii) Done!

(iii) It probably is, but let me rephrase: Should correctly instantiate Converter.
This is related to #513 so let me add what I mean there (this comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants