Add insdc_submission status table to the backend #2078

anna-parker · 2024-05-31T15:58:21Z

For INSDC/ENA we want to have a pod (similar to the prepro pod) that will tackle the submission to ENA and handle issues, however we will need to have backend endpoints that give this pod the data that needs to be submitted and will also store the submission status and the new ENA accessions. This means adding 1-2 tables to the backend and and 2+ endpoints.

Note that in order to submit to ENA we require:

a unique project (this will be linked to the group metadata) - with its own accession value (actually ENA produces 2 accessions but they can be used interchangably)
a unique sample (this will be linked to the sequence metadata) - with its own accession value
and then finally we will need to submit the actual sequence data (an analysis in ENA with its own unique accession) using these two(+) accession values.

Ideally, we will create 2 tables in our postgres DB mapping:

each loculus group accession to its insdc_submission_status and project/study accession value in insdc
each loculus sequence accession to its insdc_submission_status for the metadata, the sample accession and insdc_submission_status for the the sequence and the analysis accession value.

Probably we would like to have the status fields: PENDING, PROCESSING, COMPLETED and FAILED.
We might also want to store the number of attempts.

The Kotlin code

Send Values in Parallel (coroutines is an option for this) to the submit pod which will upload to ENA (most likely a python wrapper which will forward values to ENA) and use Concurrency Control: Use database transactions (Postgres should handle this) or row-level locking to ensure that no two processes or threads pick up the same item from the table at the same time:

val response = client.sendAsync(request, HttpResponse.BodyHandlers.ofString()).await()

Make sure processing is Idempotent: If the same request is sent multiple times, it should only be seen in ENA once - we will probably need to handle this in the submission wrapper (i.e. the submission pod)- we will fail a submission after a certain about of time, if it has status failed we will first check it has not actually been submitted to ENA (and the passing back of the accession failed) before trying again.
Add Retry Logic: This is where keeping track of attempts can be useful, we could implement exponential backoff.

fun handleRetry(id: Int, value: String) {
    val attempts = getRetryAttempts(id)
    if (attempts < MAX_ATTEMPTS) {
        delay((2.0.pow(attempts) * 1000).toLong())
        processItem(id, value)
    } else {
        markAsFailed(id)
    }
}

chaoran-chen · 2024-06-01T18:37:13Z

Should the backend send data to ENA or would we again have a separate script for that? @corneliusroemer and I were thinking about the latter where a dedicated ENA upload service (see #331) would fetch data from the backend that haven't been submitted and pass the INSDC submission status and accession information back to the backend.

Hereby, instead of having dedicated INSDC accession and status columns in the database, we could, more generally, introduce "managed metadata" which are associated with each sequence and can be set by the INSDC submission service (but not directly by the submitters). The advantage of that concept is (1) the submission service can be developed more independently and decide that it wants to store without modifying the database schema and (2) it can be re-used for other things in the future (e.g., if we want to submit the sequences not only to INSDC but also to another service).

anna-parker · 2024-06-02T13:27:58Z

Ah sorry if this is unclear - this is all with the idea to have a pod (most likely with a snakemake pipeline) that will wrap submissions to ENA - this is just a preliminary list of requirements for the backend endpoints that such a pipeline would require - I will make that clearer

anna-parker · 2024-06-02T13:41:25Z

@chaoran-chen I like the idea of having less structured submission metadata fields to enable upload to multiple databases. But I still think it might be good to have two tables (one for sequence submission status and one for group submission status) as I think this is a common structure across databases.

Maybe I could create tables which have a submission metadata column which contains a dictionary that we can add any type of information to? I do think keeping the submission status in a table (in the same way as for preprocessing) is a good design idea. Also, after submission to ENA we want to add the genbank accession to the sequence view page - so we will still have to structure the metadata in a specific manner so that we can retrieve this value.

anna-parker self-assigned this May 31, 2024

anna-parker mentioned this issue Jun 26, 2024

feat(backend): Add an end-point and metadata table for results of ENA Submission #2146

Merged

7 tasks

anna-parker closed this as completed in #2146 Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add insdc_submission status table to the backend #2078

Add insdc_submission status table to the backend #2078

anna-parker commented May 31, 2024 •

edited

Loading

chaoran-chen commented Jun 1, 2024

anna-parker commented Jun 2, 2024

anna-parker commented Jun 2, 2024

Add insdc_submission status table to the backend #2078

Add insdc_submission status table to the backend #2078

Comments

anna-parker commented May 31, 2024 • edited Loading

The Kotlin code

chaoran-chen commented Jun 1, 2024

anna-parker commented Jun 2, 2024

anna-parker commented Jun 2, 2024

anna-parker commented May 31, 2024 •

edited

Loading