Skip to content

Aptitude test for IiSGM Microbial Genomics candidates

License

Notifications You must be signed in to change notification settings

fjosefdz/iisgm-bioinfo-test

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iisgm-bioinfo-test

Aptitude test for IiSGM Microbial Genomics candidates

Instructions

You are required to calculate SNP distance between those 26 M. tuberculosis isolates provided 🦠.

The input files are in VCF 4.2 format, obtained with GATK, but all information about the parameters are included within the file.

Submission

  • You can send the assignment directly to [email protected] in whichever comprised format suits you

  • Or you can create a pull request if you are familiar with github:

    • Fork this repo

    • Clone this repo

    • Upon completion, run the following commands:

      git add .
      git commit -m "done"
      git push origin master
      
    • Create Pull Request

Iteration 0 | Download repository

You can download this repository with the data included using the terminal (CLI) or as .zip

Iteration 1 | Parse VCF files to table/dataframe

In data folder you can find all VCF

Create a function to read a single VCF

Iteration 2 | Extract relevant information from parsed VCF

M. tuberculosis in a haployd organism but those were called (variant calling step) as diploid, hence you will see the usual diploid genotyping (0/0, 0/1, 1/1).

With the correct information analysed, filter the SNPs actually present on each sample, this can be a different function.

Iteration 3 | Combine present SNP into a presence matrix

Merge extracted information into a matrix to keep track of relevant info such as:

  • sample name
  • Position
  • Mutation (Reference allele and Alternate allele)

The preferred format is a binary Presence/Absence matrix

Iteration 4 | Calculate the SNP distance between all samples

Determine the pairwise distance between each pair of samples

Iteration 5 | BONUS - Include INDELS

We have been using the term SNP distance but INDELS are also useful as phylogenetic marker

Add subtle changes to the functions to include INDELS in the distance calculation

Iteration 6 | BONUS - Represent distance in a phylogenetic tree

You can represent this distance in a dendrogram, using any method you find suitable.

To follow along with the matrix format and the use of python, you are encouraged to use Scipy library, specifically linkage and dendrogram

💪

About

Aptitude test for IiSGM Microbial Genomics candidates

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.5%
  • Python 1.5%