Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add insdc_submission status table to the backend #2078

Closed
anna-parker opened this issue May 31, 2024 · 3 comments · Fixed by #2146
Closed

Add insdc_submission status table to the backend #2078

anna-parker opened this issue May 31, 2024 · 3 comments · Fixed by #2146
Assignees

Comments

@anna-parker
Copy link
Contributor

anna-parker commented May 31, 2024

For INSDC/ENA we want to have a pod (similar to the prepro pod) that will tackle the submission to ENA and handle issues, however we will need to have backend endpoints that give this pod the data that needs to be submitted and will also store the submission status and the new ENA accessions. This means adding 1-2 tables to the backend and and 2+ endpoints.

Note that in order to submit to ENA we require:

  • a unique project (this will be linked to the group metadata) - with its own accession value (actually ENA produces 2 accessions but they can be used interchangably)
  • a unique sample (this will be linked to the sequence metadata) - with its own accession value
    and then finally we will need to submit the actual sequence data (an analysis in ENA with its own unique accession) using these two(+) accession values.

Ideally, we will create 2 tables in our postgres DB mapping:

  1. each loculus group accession to its insdc_submission_status and project/study accession value in insdc
  2. each loculus sequence accession to its insdc_submission_status for the metadata, the sample accession and insdc_submission_status for the the sequence and the analysis accession value.

Probably we would like to have the status fields: PENDING, PROCESSING, COMPLETED and FAILED.
We might also want to store the number of attempts.

The Kotlin code

  • Send Values in Parallel (coroutines is an option for this) to the submit pod which will upload to ENA (most likely a python wrapper which will forward values to ENA) and use Concurrency Control: Use database transactions (Postgres should handle this) or row-level locking to ensure that no two processes or threads pick up the same item from the table at the same time:
val response = client.sendAsync(request, HttpResponse.BodyHandlers.ofString()).await()
  • Make sure processing is Idempotent: If the same request is sent multiple times, it should only be seen in ENA once - we will probably need to handle this in the submission wrapper (i.e. the submission pod)- we will fail a submission after a certain about of time, if it has status failed we will first check it has not actually been submitted to ENA (and the passing back of the accession failed) before trying again.
  • Add Retry Logic: This is where keeping track of attempts can be useful, we could implement exponential backoff.
fun handleRetry(id: Int, value: String) {
    val attempts = getRetryAttempts(id)
    if (attempts < MAX_ATTEMPTS) {
        delay((2.0.pow(attempts) * 1000).toLong())
        processItem(id, value)
    } else {
        markAsFailed(id)
    }
}
@anna-parker anna-parker self-assigned this May 31, 2024
@chaoran-chen
Copy link
Member

Should the backend send data to ENA or would we again have a separate script for that? @corneliusroemer and I were thinking about the latter where a dedicated ENA upload service (see #331) would fetch data from the backend that haven't been submitted and pass the INSDC submission status and accession information back to the backend.

Hereby, instead of having dedicated INSDC accession and status columns in the database, we could, more generally, introduce "managed metadata" which are associated with each sequence and can be set by the INSDC submission service (but not directly by the submitters). The advantage of that concept is (1) the submission service can be developed more independently and decide that it wants to store without modifying the database schema and (2) it can be re-used for other things in the future (e.g., if we want to submit the sequences not only to INSDC but also to another service).

@anna-parker
Copy link
Contributor Author

Ah sorry if this is unclear - this is all with the idea to have a pod (most likely with a snakemake pipeline) that will wrap submissions to ENA - this is just a preliminary list of requirements for the backend endpoints that such a pipeline would require - I will make that clearer

@anna-parker
Copy link
Contributor Author

@chaoran-chen I like the idea of having less structured submission metadata fields to enable upload to multiple databases. But I still think it might be good to have two tables (one for sequence submission status and one for group submission status) as I think this is a common structure across databases.

Maybe I could create tables which have a submission metadata column which contains a dictionary that we can add any type of information to? I do think keeping the submission status in a table (in the same way as for preprocessing) is a good design idea. Also, after submission to ENA we want to add the genbank accession to the sequence view page - so we will still have to structure the metadata in a specific manner so that we can retrieve this value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants