Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

Open
9 of 16 tasks
karatugo opened this issue Jul 10, 2024 · 37 comments
Open
9 of 16 tasks
Assignees

Comments

@karatugo
Copy link
Member

karatugo commented Jul 10, 2024

We need to develop a robust and scalable data ingest/ETL (Extract, Transform, Load) pipeline to facilitate the reading of eQTL (expression Quantitative Trait Loci) data from FTP sources, indexing it into a MongoDB database, and serving it via an API. This pipeline will ensure efficient data extraction, transformation, and retrieval to support downstream analysis and querying through a web service.

  • Scalable data ingest/ETL pipeline
  • Read from FTP sources
  • Ingest with the correct schema
  • Save to MongoDB
  • Index MongoDB - is it automatic? discuss with DBA team
  • Deploy to Sandbox
  • Fix the Sandbox bugs
  • Make ETL pipeline more efficient
  • Add some estimates for mongodb size and job requirements
  • Check writeconcern majority flag
  • Increase wallclock for .all.tsv.gz to 1w and reduce it for .cc.tsv.gz to a few hours
  • Update Confluence docs
  • Discuss prod deployment with DBA team
  • Deploy to Prod
  • Plan for API implementation
  • Implement API
@karatugo
Copy link
Member Author

Files to Index

  • QTD0000*.all.tsv.gz: Contains comprehensive eQTL data. This should be the primary source for indexing.
  • QTD0000*.cc.tsv.gz: Contains specific eQTL data (likely condition-specific or subset). Also useful for indexing.
  • QTD0000*.permuted.tsv.gz: Contains permuted eQTL data for significance testing. Useful for specific analyses but not primary indexing.

Suggested MongoDB Schema

Here's a refined schema to capture the necessary details from these files:

  1. Study Information:

    • study_id: QTD000021
    • study_name: "Sample eQTL Study"
  2. Sample Information:

    • sample_id: Auto-generated or derived from context if available?
  3. eQTL Information:

    • molecular_trait_id: Corresponding trait ID.
    • molecular_trait_object_id: Object ID for the molecular trait.
    • chromosome: Chromosome number.
    • position: Position on the chromosome.
    • ref: Reference allele.
    • alt: Alternative allele.
    • variant: Variant identifier.
    • ma_samples: Minor allele sample count.
    • maf: Minor allele frequency.
    • pvalue: P-value of the association.
    • beta: Effect size.
    • se: Standard error.
    • type: Variant type (e.g., SNP).
    • aan: Additional annotation number.
    • r2: R-squared value.
    • gene_id: Gene identifier.
    • median_tpm: Median TPM (Transcripts Per Million).
    • rsid: Reference SNP ID.
  4. Permuted eQTL Information:

    • p_perm: Permuted p-value.
    • p_beta: Permuted beta value.

Example MongoDB Document Structure

{
  "study_id": "QTD000021",
  "study_name": "Sample eQTL Study",
  "samples": [
    {
      "sample_id": "sample001",
      "eqtls": [
        {
          "molecular_trait_id": "ENSG00000187583",
          "molecular_trait_object_id": "ENSG00000187583",
          "chromosome": "1",
          "position": 14464,
          "ref": "A",
          "alt": "T",
          "variant": "chr1_14464_A_T",
          "ma_samples": 41,
          "maf": 0.109948,
          "pvalue": 0.15144,
          "beta": 0.25567,
          "se": 0.17746,
          "type": "SNP",
          "aan": 42,
          "r2": 382,
          "gene_id": "ENSG00000187583",
          "median_tpm": 0.985,
          "rsid": "rs546169444",
          "permuted": {
            "p_perm": 0.000999001,
            "p_beta": 3.3243e-12
          }
        }
      ]
    }
  ]
}

Steps to Implement

  1. Extract Data:

    • Parse QTD0000*.all.tsv.gz and QTD0000*.cc.tsv.gz to extract eQTL data.
    • Parse QTD0000*.permuted.tsv.gz to extract permuted data and merge with the main eQTL data.
  2. Transform Data:

    • Normalize data fields and structure according to the MongoDB schema.
  3. Load Data:

    • Insert the structured documents into MongoDB.
    • Ensure appropriate indexes on fields such as gene_id, chromosome, position, and variant for efficient querying.
  4. API Development:

    • Develop endpoints for querying the eQTL data based on different parameters.

Indexing Strategy

  • Create indexes on key fields for efficient retrieval:
    • gene_id
    • chromosome
    • position
    • variant
    • rsid

@karatugo
Copy link
Member Author

@karatugo Focus on Mongo indexing, deployment and API development

@karatugo
Copy link
Member Author

Deployment to sandbox is in progress. I was able to run build step successfully. Deploy step has some errors at the moment. I'll prioritise this next week.

@karatugo
Copy link
Member Author

karatugo commented Oct 24, 2024

Sandbox deployment worked with singularity commands but while automating I got the error below.

  • Fix this error and test it in sandbox
FATAL:   could not open image /nfs/public/rw/gwas/deposition/singularity_cache/eqtl-sumstats-service_72de6563bdc84abc0be38ef294c854e3dd30f56e.sif: failed to retrieve path for /nfs/public/rw/gwas/deposition/singularity_cache/eqtl-sumstats-service_72de6563bdc84abc0be38ef294c854e3dd30f56e.sif: lstat /nfs/public: no such file or directory

@karatugo
Copy link
Member Author

Fixed the above error, now working on mongo save failed issue.

@karatugo
Copy link
Member Author

Deployment to sandbox complete.

@karatugo
Copy link
Member Author

karatugo commented Oct 31, 2024

Started a full ingestion yesterday evening. In 16h, with 2 concurrent workers only 2 studies/19 datasets were complete.

  • Need to adjust accordingly:

    • number of workers
    • sbatch wallclock time
    • sbatch memory

    I'll wait until the ingestion is complete to see what we get at the end of 2 days with 2 workers and 8G mem.

@karatugo
Copy link
Member Author

Sent an email to Kaur for the schemas of .permuted files.

@karatugo
Copy link
Member Author

karatugo commented Nov 7, 2024

  • Ignored .permuted files
  • Used local file system rather than ftp protocol
    This is due to bugfixing in sandbox. There were many ftp connection problems.

@karatugo
Copy link
Member Author

karatugo commented Nov 7, 2024

  • fixed docker pull rate error in ingest script (docker-login needed)

@karatugo
Copy link
Member Author

karatugo commented Nov 7, 2024

  • fixed an issue with file last modified date in local fs, demo ingest is running

@karatugo
Copy link
Member Author

karatugo commented Nov 7, 2024

Started a full ingestion yesterday evening. In 16h, with 2 concurrent workers only 2 studies/19 datasets were complete.

* [ ]  Need to adjust accordingly:
  
  * number of workers
  * sbatch wallclock time
  * sbatch memory
  
  I'll wait until the ingestion is complete to see what we get at the end of 2 days with 2 workers and 8G mem.
  • Run 8 concurrent workers with 64G for 2 days

@karatugo
Copy link
Member Author

karatugo commented Nov 7, 2024

* [ ]  Run 8 concurrent workers with 64G for 2 days

Running, will check on monday.

@karatugo
Copy link
Member Author

karatugo commented Nov 7, 2024

I realized that there's a typo in memory, it should be 64G rather than 6G. Restarted.

@karatugo
Copy link
Member Author

35 studies were ingested which seems very few.

@karatugo
Copy link
Member Author

I test another approach using batch sizes of 10000 in mongo.

@sprintell
Copy link
Member

@ala-ebi suggested using Mongo Bulk Operations API to improve the performance.

@karatugo
Copy link
Member Author

karatugo commented Nov 13, 2024

Some results with 10k batch size after 2 days of ingestion.
image

@karatugo
Copy link
Member Author

karatugo commented Nov 13, 2024

@ala-ebi suggested using Mongo Bulk Operations API to improve the performance.

I checked that Write to MongoDB in Batch Mode already uses bulk operations.

image

@karatugo
Copy link
Member Author

karatugo commented Nov 14, 2024

Benchmarking with repartition and coalescing. 1 day. - looks like it doubles the performance and ingests ~2b row in 1 day.

image

@karatugo
Copy link
Member Author

karatugo commented Nov 20, 2024

Started another test run in SLURM.

Update. Made a mistake with resource allocation. Will submit another one shortly.

@karatugo
Copy link
Member Author

karatugo commented Nov 20, 2024

  • Update the sleep in script as 30 min
  • Give a name to wrap command
  • Convert for loop to while
  • Increase concurrent operations to 16

@karatugo
Copy link
Member Author

Started test run but cancelled it as eqtl database is unable to respond.

@karatugo
Copy link
Member Author

The issues with the mongo instance is solved. Started a new test run.

@karatugo
Copy link
Member Author

Sharding is enabled. Started new test run.

  • Check writeconcern majority flag

@karatugo
Copy link
Member Author

Testing for sharding in progress.

@karatugo
Copy link
Member Author

Test for sharding is okay.

@karatugo
Copy link
Member Author

Requests from DBA team & points to discuss with @sprintell

  • Do the collection by collection load and while loading keep checking the chunk number (it should be same on both shards)
  • Make sure the data is not beyond the available space. Remember that you have two shards in the cluster and each shard had 1Tb storage.

@karatugo
Copy link
Member Author

After discussing with DBA team, we decided to run a test run until Monday.

@karatugo
Copy link
Member Author

karatugo commented Dec 16, 2024

For some files ingestion time (2 days) is not enough. I see them failed due to "Wallclock exceeded" error. My suggestion is increase 1w per file, and adjust it based on the file name perhaps (e.g. 1w for .all.tsv.gz and 1d for .cc.tsv.gz)

  • Update wallclock

@karatugo
Copy link
Member Author

Benchmarking Results:

File Name Size Total Ingestion Time
QTS000003/QTD000039/QTD000039.cc.tsv.gz 388M 18h (2024-12-17 09:27:13,836 -- 2024-12-18 03:36:40,222)
QTS000003/QTD000037/QTD000037.cc.tsv.gz 554M 13h (2024-12-17 02:41:44,251 -- 2024-12-17 15:28:29,781)
QTS000027/QTD000514/QTD000514.all.tsv.gz 2.8G 40h (2024-12-12 15:39:15,551 -- 2024-12-14 07:36:02,734)
QTS000039/QTD000645/QTD000645.all.tsv.gz 1.6G 41h (2024-12-12 16:33:58,573 -- 2024-12-14 09:37:51,178)
QTS000003/QTD000038/QTD000038.cc.tsv.gz 312M 14h (2024-12-17 07:33:46,249 -- 2024-12-17 21:48:17,535)

@karatugo
Copy link
Member Author

So far ~70 studies ingested, I estimate their total size as ~50G (70 .cc.tsv.gz files averaging 500M, 3 .all.tsv.gz files averaging ~5G).

Their total storage size in Mongo is close to ~500G.

@karatugo
Copy link
Member Author

In total, we have ~300 .all.tsv.gz files and ~750 .cc.tsv.gz files. Estimated size is 1.875T.

@karatugo
Copy link
Member Author

Increased memory of each job to 4G.
Now we have 24 jobs with 4G mem and time estimate of 1w if file is .all.tsv.gz else 2d.

@karatugo
Copy link
Member Author

Fixed an issue with Spark UI ports. Restarting again.

@karatugo
Copy link
Member Author

karatugo commented Jan 3, 2025

160 total files ingested.
55/300 .all.tsv.gz files ingested.
105/750 .cc.tsv.gz files ingested.

estimated ~330G disk space (5G per .all.tsv.gz file, 500M per .cc.tsv.gz file)
~1.2 TB on MongoDB

@karatugo
Copy link
Member Author

karatugo commented Jan 3, 2025

In 1 week, we ingested 52 .all.tsv.gz and 35 .cc.tsv.gz files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants