Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(website): Create a standardized sequence name with the format: {country}/{AccessionVersion}/{date} #2246

Merged
merged 20 commits into from
Jul 9, 2024

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Jul 4, 2024

resolves #1487

preview URL: https://standardize-sequence-name.loculus.org/

Summary

This adds a concatenation function to the preprocessing pod which can be used to generate the sequence display name.

Screenshot

image

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Jul 4, 2024
@anna-parker anna-parker changed the title Add concatenate function to preprocessing. feat(website): Create a standardized sequence name with the format: {country}/{AccessionVersion}/{date} Jul 4, 2024
@anna-parker anna-parker marked this pull request as ready for review July 5, 2024 13:31
Copy link
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

@emmahodcroft
Copy link
Member

Just to make public my comment to Anna in a DM, I asked that this be updated so that if there isn't a date or location, the leading/trailing / is excluded, just in case that could be something that could cause problems for other processing steps (for users, not us) downstream -- some alignment/phylo programs are stupidly picky.

So, to avoid: /PP_1010101/2023 or USA/PP_1010384/

@corneliusroemer
Copy link
Contributor

Great stuff!

Because we discussed that the sequence name would not be unchangeable and is more of a display name, we should warn users somehow that they must not use the sequence names for things that require stability (like annotations, or referring to it in publications)

Any thoughts on how we should add that if at all?

Should we prefix the display name with "Display name:" for example in the UI?

1 similar comment
@corneliusroemer
Copy link
Contributor

Great stuff!

Because we discussed that the sequence name would not be unchangeable and is more of a display name, we should warn users somehow that they must not use the sequence names for things that require stability (like annotations, or referring to it in publications)

Any thoughts on how we should add that if at all?

Should we prefix the display name with "Display name:" for example in the UI?

@theosanderson
Copy link
Member

IMO we should aim for our plans on the display name to settle down fairly quickly and therefore we don't need to explicitly do anything in the UI, but could include something in the docs (outside this PR)

@emmahodcroft
Copy link
Member

Should we prefix the display name with "Display name:" for example in the UI?

Would second this

@corneliusroemer
Copy link
Contributor

IMO we should aim for our plans on the display name to settle down fairly quickly and therefore we don't need to explicitly do anything in the UI, but could include something in the docs (outside this PR)

Even if we don't change the format of the display name in the future, the content can still change, as we might curate dates etc. Maybe we'll change country display names, from Turkey to Türkiye etc. To be able to do so easily without users' workflows breaking, we should tell them explicitly that display names are not stable.

I think a middle ground is tricky here. If we don't tell people from the start that they should not use the display name as a key, they will do so.

One of the painful learnings of the Nextstrain team has been to not use sequence/strain names as primary keys. The problem is that it's the intuitive thing to do, but it causes inevitable issues down the line - because display names don't come with uniqueness guarantees for example (the way ours would be structured now they would be unique, but that is just the current implementation and not something to rely on.

@theosanderson
Copy link
Member

My understanding is that when we submit to NCBI we will provide the display name as the name of the sequence, so it will persist there until we have a new version. I think we should probably aim to get to a place where the display name is fixed per version (to provide consistency with NCBI). I see changeable display names as a stop gap until we sort out things like a lab-prefix and the incorporation of sample IDs.

Personally I would like to ultimately see our sequence pages focus on the display name, which I see as the human-readable sequence title.
image

To me the logic of saying that we need to label display name here would imply:

image

@corneliusroemer
Copy link
Contributor

(we plan to submit to ENA not NCBI directly)

My understanding is that when we submit to NCBI we will provide the display name as the name of the sequence, so it will persist there until we have a new version.

  • If we submit it (I'm not sure we've decided whether we will, as we haven't gone into mappings from our metadata to INSDC equivalents yet) the display name would be submitted as part of the biosample.
  • (Bio)Samples are unversioned AFAICT
  • I quickly checked what we could map a display name to, if we wanted to. It seems we would map to sample_name or sample_title.
Google Chrome 2024-07-08 19 42 08

I think we should probably aim to get to a place where the display name is fixed per version (to provide consistency with NCBI). I see changeable display names as a stop gap until we sort out things like a lab-prefix and the incorporation of sample IDs.

  • Consistency is impossible because biosample has no version
  • Display name is generated by preprocessing. We can't promise it won't change within a version because we explicitly decided to allow changing of metadata without requiring a version bump.
  • If we change displayName format by changing prepro algorithm, it will change all displayNames for all versions.

Personally I would like to ultimately see our sequence pages focus on the display name, which I see as the human-readable sequence title.

I agree, that adding the Display Name: prefix isn't pretty and that we should focus on the display name on seqdetails rather than the accession.

What I want to avoid is that we don't say up front that the displayName in particular isn't something that's stable. We should heavily discourage it's use for any programmatic purposes.

I'm also not suggesting we never stabilize the display name format. But right now it isn't, and so we need to make that clear. As on the seqDetails page we don't display the display name as normal metadata, i.e. there's no label in contrast to other metadata field.

I think in this regard there's a big difference between Nextstrain and Loculus/Pathoplexus. We don't need to show "displayName" in Auspice, because Auspice is an end result, it's not used to look up sample metadata and it's output is not ingested by downstream analysis (usually).

It's easy to stabilize something, but it's hard to destabilize a property others rely on. I'm not saying we must prefix with "Display Name:" - as long as we state clearly to downstream users that the strain name is not stable and can change over time.

Regarding consistency with INSDC: this is broken from the start, or are you suggesting we will not generate display names for ingested data and use the biosample title instead? This would look like that: Isolate 723 from Bangladeshi#2 RWS 2011-10-10.

As we can't change the title/sample name of ingested sequences, the only consistency we could ever achieve is for original, non-ingested data.

@anna-parker
Copy link
Contributor Author

anna-parker commented Jul 8, 2024

For some strange reason this PR increases the number of west nile sequences by ~100. I have no clue why but I don't want to merge until I find out why. For future:
image
vs
image

Update: MT125057.1 is an example of a sequence which has not aligned but is still on this branch.

@anna-parker
Copy link
Contributor Author

Update: the last commit fixed the issue - there was an issue when I merged in main.

@anna-parker anna-parker merged commit 0a96d23 into main Jul 9, 2024
12 checks passed
@anna-parker anna-parker deleted the standardize_sequence_names branch July 9, 2024 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate standardized sequence names
4 participants