-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3
Comments
Files to Index
Suggested MongoDB SchemaHere's a refined schema to capture the necessary details from these files:
Example MongoDB Document Structure{
"study_id": "QTD000021",
"study_name": "Sample eQTL Study",
"samples": [
{
"sample_id": "sample001",
"eqtls": [
{
"molecular_trait_id": "ENSG00000187583",
"molecular_trait_object_id": "ENSG00000187583",
"chromosome": "1",
"position": 14464,
"ref": "A",
"alt": "T",
"variant": "chr1_14464_A_T",
"ma_samples": 41,
"maf": 0.109948,
"pvalue": 0.15144,
"beta": 0.25567,
"se": 0.17746,
"type": "SNP",
"aan": 42,
"r2": 382,
"gene_id": "ENSG00000187583",
"median_tpm": 0.985,
"rsid": "rs546169444",
"permuted": {
"p_perm": 0.000999001,
"p_beta": 3.3243e-12
}
}
]
}
]
} Steps to Implement
Indexing Strategy
|
@karatugo Focus on Mongo indexing, deployment and API development |
Deployment to sandbox is in progress. I was able to run build step successfully. Deploy step has some errors at the moment. I'll prioritise this next week. |
Sandbox deployment worked with singularity commands but while automating I got the error below.
|
Fixed the above error, now working on mongo save failed issue. |
Deployment to sandbox complete. |
Started a full ingestion yesterday evening. In 16h, with 2 concurrent workers only 2 studies/19 datasets were complete.
|
Sent an email to Kaur for the schemas of .permuted files. |
|
|
|
|
Running, will check on monday. |
I realized that there's a typo in memory, it should be 64G rather than 6G. Restarted. |
35 studies were ingested which seems very few. |
I test another approach using batch sizes of 10000 in mongo. |
@ala-ebi suggested using Mongo Bulk Operations API to improve the performance. |
I checked that Write to MongoDB in Batch Mode already uses bulk operations. |
Started another test run in SLURM. Update. Made a mistake with resource allocation. Will submit another one shortly. |
|
Started test run but cancelled it as eqtl database is unable to respond. |
The issues with the mongo instance is solved. Started a new test run. |
Sharding is enabled. Started new test run.
|
Testing for sharding in progress. |
Test for sharding is okay. |
Requests from DBA team & points to discuss with @sprintell
|
After discussing with DBA team, we decided to run a test run until Monday. |
For some files ingestion time (2 days) is not enough. I see them failed due to "Wallclock exceeded" error. My suggestion is increase 1w per file, and adjust it based on the file name perhaps (e.g. 1w for .all.tsv.gz and 1d for .cc.tsv.gz)
|
Benchmarking Results:
|
So far ~70 studies ingested, I estimate their total size as ~50G (70 .cc.tsv.gz files averaging 500M, 3 .all.tsv.gz files averaging ~5G). Their total storage size in Mongo is close to ~500G. |
In total, we have ~300 .all.tsv.gz files and ~750 .cc.tsv.gz files. Estimated size is 1.875T. |
Increased memory of each job to 4G. |
Fixed an issue with Spark UI ports. Restarting again. |
160 total files ingested. estimated ~330G disk space (5G per .all.tsv.gz file, 500M per .cc.tsv.gz file) |
In 1 week, we ingested 52 .all.tsv.gz and 35 .cc.tsv.gz files. |
We need to develop a robust and scalable data ingest/ETL (Extract, Transform, Load) pipeline to facilitate the reading of eQTL (expression Quantitative Trait Loci) data from FTP sources, indexing it into a MongoDB database, and serving it via an API. This pipeline will ensure efficient data extraction, transformation, and retrieval to support downstream analysis and querying through a web service.
The text was updated successfully, but these errors were encountered: