feat(website): Create a standardized sequence name with the format: {country}/{AccessionVersion}/{date} #2246

anna-parker · 2024-07-04T15:08:41Z

resolves #1487

preview URL: https://standardize-sequence-name.loculus.org/

Summary

This adds a concatenation function to the preprocessing pod which can be used to generate the sequence display name.

Screenshot

Add display_name to INSDC header just to see what it looks like

Fix little bug

preprocessing/nextclade/src/loculus_preprocessing/prepro.py

…llow specification of concatenation order.

kubernetes/loculus/templates/_preprocessingFromValues.tpl

kubernetes/loculus/values.yaml

theosanderson

Very nice

website/src/components/SequenceDetailsPage/getDataTableData.ts

emmahodcroft · 2024-07-08T07:47:32Z

Just to make public my comment to Anna in a DM, I asked that this be updated so that if there isn't a date or location, the leading/trailing / is excluded, just in case that could be something that could cause problems for other processing steps (for users, not us) downstream -- some alignment/phylo programs are stupidly picky.

So, to avoid: /PP_1010101/2023 or USA/PP_1010384/

corneliusroemer · 2024-07-08T07:53:34Z

Great stuff!

Because we discussed that the sequence name would not be unchangeable and is more of a display name, we should warn users somehow that they must not use the sequence names for things that require stability (like annotations, or referring to it in publications)

Any thoughts on how we should add that if at all?

Should we prefix the display name with "Display name:" for example in the UI?

corneliusroemer · 2024-07-08T07:54:16Z

Great stuff!

Because we discussed that the sequence name would not be unchangeable and is more of a display name, we should warn users somehow that they must not use the sequence names for things that require stability (like annotations, or referring to it in publications)

Any thoughts on how we should add that if at all?

Should we prefix the display name with "Display name:" for example in the UI?

theosanderson · 2024-07-08T08:10:09Z

IMO we should aim for our plans on the display name to settle down fairly quickly and therefore we don't need to explicitly do anything in the UI, but could include something in the docs (outside this PR)

emmahodcroft · 2024-07-08T08:44:30Z

Should we prefix the display name with "Display name:" for example in the UI?

Would second this

corneliusroemer · 2024-07-08T15:37:38Z

IMO we should aim for our plans on the display name to settle down fairly quickly and therefore we don't need to explicitly do anything in the UI, but could include something in the docs (outside this PR)

Even if we don't change the format of the display name in the future, the content can still change, as we might curate dates etc. Maybe we'll change country display names, from Turkey to Türkiye etc. To be able to do so easily without users' workflows breaking, we should tell them explicitly that display names are not stable.

I think a middle ground is tricky here. If we don't tell people from the start that they should not use the display name as a key, they will do so.

One of the painful learnings of the Nextstrain team has been to not use sequence/strain names as primary keys. The problem is that it's the intuitive thing to do, but it causes inevitable issues down the line - because display names don't come with uniqueness guarantees for example (the way ours would be structured now they would be unique, but that is just the current implementation and not something to rely on.

theosanderson · 2024-07-08T15:56:35Z

My understanding is that when we submit to NCBI we will provide the display name as the name of the sequence, so it will persist there until we have a new version. I think we should probably aim to get to a place where the display name is fixed per version (to provide consistency with NCBI). I see changeable display names as a stop gap until we sort out things like a lab-prefix and the incorporation of sample IDs.

Personally I would like to ultimately see our sequence pages focus on the display name, which I see as the human-readable sequence title.

To me the logic of saying that we need to label display name here would imply:

…by specifying type in a separate argument.

corneliusroemer · 2024-07-08T18:01:04Z

(we plan to submit to ENA not NCBI directly)

My understanding is that when we submit to NCBI we will provide the display name as the name of the sequence, so it will persist there until we have a new version.

If we submit it (I'm not sure we've decided whether we will, as we haven't gone into mappings from our metadata to INSDC equivalents yet) the display name would be submitted as part of the biosample.
(Bio)Samples are unversioned AFAICT
I quickly checked what we could map a display name to, if we wanted to. It seems we would map to sample_name or sample_title.

I think we should probably aim to get to a place where the display name is fixed per version (to provide consistency with NCBI). I see changeable display names as a stop gap until we sort out things like a lab-prefix and the incorporation of sample IDs.

Consistency is impossible because biosample has no version
Display name is generated by preprocessing. We can't promise it won't change within a version because we explicitly decided to allow changing of metadata without requiring a version bump.
If we change displayName format by changing prepro algorithm, it will change all displayNames for all versions.

Personally I would like to ultimately see our sequence pages focus on the display name, which I see as the human-readable sequence title.

I agree, that adding the Display Name: prefix isn't pretty and that we should focus on the display name on seqdetails rather than the accession.

What I want to avoid is that we don't say up front that the displayName in particular isn't something that's stable. We should heavily discourage it's use for any programmatic purposes.

I'm also not suggesting we never stabilize the display name format. But right now it isn't, and so we need to make that clear. As on the seqDetails page we don't display the display name as normal metadata, i.e. there's no label in contrast to other metadata field.

I think in this regard there's a big difference between Nextstrain and Loculus/Pathoplexus. We don't need to show "displayName" in Auspice, because Auspice is an end result, it's not used to look up sample metadata and it's output is not ingested by downstream analysis (usually).

It's easy to stabilize something, but it's hard to destabilize a property others rely on. I'm not saying we must prefix with "Display Name:" - as long as we state clearly to downstream users that the strain name is not stable and can change over time.

Regarding consistency with INSDC: this is broken from the start, or are you suggesting we will not generate display names for ingested data and use the biosample title instead? This would look like that: Isolate 723 from Bangladeshi#2 RWS 2011-10-10.

As we can't change the title/sample name of ingested sequences, the only consistency we could ever achieve is for original, non-ingested data.

anna-parker · 2024-07-08T20:17:56Z

For some strange reason this PR increases the number of west nile sequences by ~100. I have no clue why but I don't want to merge until I find out why. For future:

vs

Update: MT125057.1 is an example of a sequence which has not aligned but is still on this branch.

anna-parker · 2024-07-09T08:21:58Z

Update: the last commit fixed the issue - there was an issue when I merged in main.

Add concatenate function to preprocessing.

a637799

anna-parker added the preview Triggers a deployment to argocd label Jul 4, 2024

anna-parker added 5 commits July 4, 2024 17:13

Update values.yaml

378ba28

Add display_name to INSDC header just to see what it looks like

Update processing_functions.py

3410d73

Fix little bug

Add displayNames below loculus accession.

87df918

If displayName input is None use empty string instead.

8655d9f

Make displayName italics

14a3de6

anna-parker changed the title ~~Add concatenate function to preprocessing.~~ feat(website): Create a standardized sequence name with the format: {country}/{AccessionVersion}/{date} Jul 4, 2024

anna-parker commented Jul 4, 2024

View reviewed changes

preprocessing/nextclade/src/loculus_preprocessing/prepro.py Outdated Show resolved Hide resolved

anna-parker added 4 commits July 5, 2024 14:46

Little config updates

b6b7e44

Function clean up

10ff373

Make values.yaml more logical: allow setting args in values.yaml to a…

3a844aa

…llow specification of concatenation order.

Add documentation.

e9f6edf

anna-parker marked this pull request as ready for review July 5, 2024 13:31

anna-parker added 3 commits July 5, 2024 15:34

Fix little bug.

1d580bd

Fix little config bug

ce01807

Fix required issue.

1024f12

anna-parker requested review from emmahodcroft, corneliusroemer and chaoran-chen July 5, 2024 15:08

theosanderson reviewed Jul 5, 2024

View reviewed changes

kubernetes/loculus/templates/_preprocessingFromValues.tpl Outdated Show resolved Hide resolved

theosanderson reviewed Jul 5, 2024

View reviewed changes

kubernetes/loculus/values.yaml Outdated Show resolved Hide resolved

theosanderson reviewed Jul 5, 2024

View reviewed changes

kubernetes/loculus/values.yaml Outdated Show resolved Hide resolved

theosanderson approved these changes Jul 5, 2024

View reviewed changes

theosanderson reviewed Jul 5, 2024

View reviewed changes

website/src/components/SequenceDetailsPage/getDataTableData.ts Outdated Show resolved Hide resolved

anna-parker added 2 commits July 8, 2024 19:44

Add suggestions

61fe732

Let concatenate function take multiple input values of the same type …

7fb8f54

…by specifying type in a separate argument.

anna-parker added 4 commits July 8, 2024 20:11

Merge branch 'main' into standardize_sequence_names

20c1b10

Add changes in prepro I forgot to commit.

ca6904d

Fix weird else if bug (I thought I fixed this before - odd)

497654b

Add display name to header

70f9e76

Fix None vs not error.

cd1c9d8

anna-parker merged commit 0a96d23 into main Jul 9, 2024
12 checks passed

anna-parker deleted the standardize_sequence_names branch July 9, 2024 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(website): Create a standardized sequence name with the format: {country}/{AccessionVersion}/{date} #2246

feat(website): Create a standardized sequence name with the format: {country}/{AccessionVersion}/{date} #2246

anna-parker commented Jul 4, 2024 •

edited

Loading

theosanderson left a comment

emmahodcroft commented Jul 8, 2024

corneliusroemer commented Jul 8, 2024

corneliusroemer commented Jul 8, 2024

theosanderson commented Jul 8, 2024

emmahodcroft commented Jul 8, 2024

corneliusroemer commented Jul 8, 2024

theosanderson commented Jul 8, 2024

corneliusroemer commented Jul 8, 2024

anna-parker commented Jul 8, 2024 •

edited

Loading

anna-parker commented Jul 9, 2024

feat(website): Create a standardized sequence name with the format: {country}/{AccessionVersion}/{date} #2246

feat(website): Create a standardized sequence name with the format: {country}/{AccessionVersion}/{date} #2246

Conversation

anna-parker commented Jul 4, 2024 • edited Loading

Summary

Screenshot

theosanderson left a comment

Choose a reason for hiding this comment

emmahodcroft commented Jul 8, 2024

corneliusroemer commented Jul 8, 2024

corneliusroemer commented Jul 8, 2024

theosanderson commented Jul 8, 2024

emmahodcroft commented Jul 8, 2024

corneliusroemer commented Jul 8, 2024

theosanderson commented Jul 8, 2024

corneliusroemer commented Jul 8, 2024

anna-parker commented Jul 8, 2024 • edited Loading

anna-parker commented Jul 9, 2024

anna-parker commented Jul 4, 2024 •

edited

Loading

anna-parker commented Jul 8, 2024 •

edited

Loading