Skip to content

Data Load For Computational Bio & Pharma - Working Group for Rare Disease

License

Notifications You must be signed in to change notification settings

Jacob-Barhak/RareDisease

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Load For Computational Bio & Pharma - Working Group for Rare Disease

This script loads the data used by the mocomakers meetup Computational Bio & Pharma - Working Group for Rare Disease.

The data files should be placed in the data directory. The sources are:

  • approved_compounds.txt - approved compounds for difference diseases - text copied from NIH/NCI web site
  • family_target_ligand_all_gtopdb.csv - info on compounds - downloaded from hugging face
  • OmicsSomaticMutationsMatrixDamaging.csv - the mutation file that holds medication and target pivot table indicating level of mutation - for access to the input data files, please contact Matt Zamora via [email protected].
  • secondary-screen-dose-response-curve-parameters.csv - the response file that has medication effect on disease and target - for access to the input data files, please contact Matt Zamora via [email protected].

The output of this python script will upload the 2 data tables into a sqlite database called rare_disease.db. Please note that the database calculation may need about 2TB free disk on the hard drive - SSD preferred.

It will also create a tables that merge and process the data from both tables and outputs:

  • response - the response data file loaded.
  • mutation_matrix_full - the mutation file loaded.
  • select_genes2 - a list of select genes that have a mutation of level 2.
  • select_mutation_matrix2 - a subset of the mutation matrix containing only genes that have a mutation of level 2.
  • mutated2 - a join of the response file and the subset of the mutation matrix containing only select genes.
  • aggregated_screen_id_gene_tissue_moa_2 - an aggregation of mutated2 that is grouped by screen_id, gene, tissue, moa and calculates some statistics of the response results.
  • select_aggregated_screen_id_gene_tissue_moa_2_0 - a subset of aggregated_screen_id_gene_tissue_moa_2 where mutation value is 0.
  • select_aggregated_screen_id_gene_tissue_moa_2_1 - a subset of aggregated_screen_id_gene_tissue_moa_2 where mutation value is 1.
  • select_aggregated_screen_id_gene_tissue_moa_2_2 - a subset of aggregated_screen_id_gene_tissue_moa_2 where mutation value is 2.
  • diff_aggregated_screen_id_gene_tissue_moa_name_family_types_family_names_approved_2_approved - a joined table containing the difference between select_aggregated_screen_id_gene_tissue_moa_2_2 and select_aggregated_screen_id_gene_tissue_moa_2_0 for each group of screen_id, gene, tissue, moa, name, family_types, family_names, and approved.

The differences in diff_aggregated_screen_id_gene_tissue_moa_2 can be used for further analysis.

LICENSE

CC0

To the extent possible under law, Jacob Barhak has waived all copyright and related or neighboring rights to Data Load For Computational Bio & Pharma - Working Group for Rare Disease This work is published from: Israel.

About

Data Load For Computational Bio & Pharma - Working Group for Rare Disease

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages