Skip to content

Integrating KinFin Proteome Cluster analyses into Genome Browsing environments

License

Notifications You must be signed in to change notification settings

genomehubs/kinfin-integration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Google Summer of Code at the Tree of Life at the Wellcome Sanger Institute

Accepting proposals for Google Summer of Code 2024

KinFin Integration

Integrating KinFin Proteome Cluster analyses into Genome Browsing environments

Analysing how gene families evolve is key to many large scale projects in phylogenetics and genomics. Most eukaryotic species have between 15,000 and 25,000 protein coding genes and toolkits exist to cluster these into protein families based on sequence similarity. However, interrogating these protein families across hundreds of species (and many million proteins) for patterns that reveal biological processes requires new tool development. KinFin is one such tool that takes protein clustering derived from variants of the MCL algorithm and facilitates intensive interrogation of these. This project will update KinFin to efficiently take data from the latest versions of protein clustering toolkits such as OrthoFinder, and deliver an analysis interface to web-available genome browsing or analysis systems such as Ensembl and GenomeHubs. Much of the power of KinFin lies in its innovative visualisation approaches, and development of additional visualisations and analytic outputs will be part of the project.

KinFin plot

KinFin plot of frequencies of protein clusters from 19 species. The peak at a cluster size of ~19 identifies the likely set of one-to-one orthologues in the analysed data.

GenomeHubs integration

KinFin will be integrated into the GenomeHubs toolkit. GenomeHubs is a search-oriented collection of tools interactive genomic data exploration, including Genomes on a Tree (GoaT), an Elasticsearch-based datastore, search engine, and reporting platform, with directly-measured or estimated values for a suite of attributes across all known species. We have extended this approach to include data on assembly features in BoaT, which has a emphasis on using BUSCO loci to allow compararison across assemblies. As we are currently working to include synteny, orthology and gene tree data, integrating KinFin will provide a powerful analysis and visualisation feautures to a system that holds many of the data required by the existing standalone tool.

Refactoring KinFin

This project will develop a fork of KinFin with the aim of refactoring the output code, in particular to make it more compatible with web-based systems. A single KinFin run currently generates a large number of static images and files. A goal of this project is to refactor the code to produce JSON output files that can be rendered in a web-based system. This will also provide an opportunity to extend the code to parse additional clustering outputs. A pull request will be submitted to the main repository once the refactoring is complete.

KinFin as a Service

A second goal of this project is to develop a KinFin as a Service (KaaS) system. This will allow users to upload their own data to be processed by KinFin, and then to download the results. This will allow users to run KinFin on their own data without having to install the software and will provide the backend to support the GenomeHubs integration. In practice, GenomeHubs integration only requires a subset of the full functionality to be made available in this way so KaaS features will be prioritised accordingly.

API integration

Core functionality will be made accessible to GenomeHubs sites via the GenomeHubs API.

Visualisation

KinFin plots will be implemented as interactive React components within the GenomeHubs UI. This will allow taxon set based Kinfin analyses to be run based on queries and selections in GenomeHubs sites such as BoaT and for the results to be linked to individual features and collections of features in other data displays. This aspect of the project offers scope for development of novel visualisations to help users gain further insight into the presented data.

Contributing

We are proposing the GoaT-NLP project as a Google Summer of Code project for 2024. If you are interested in contributing to GoaT-NLP, please read the information provided in the ToL+PaM GSoC 2024 Google Doc and use the information in that document to get in touch with any questions you may have.

Proposals

We will assess applications from potential GSoC contributors on the basis of the proposal. Again, see the ToL+PaM GSoC 2024 Google Doc for more, but broadly, we want to know:

  • how would you approach this project?
  • which technologies would you use and why?
  • what would be the key milestones and when would you reach them?
  • how would you ensure the sustainability of your code beyond the end of the GSoC term?

You should follow the GSoC contibutor guidelines to help structure your proposal. Note that we'd be happy to see a diagram of your suggested implementation and while we have no fixed length limit, we value the ability to identify and focus on the core elements of your proposal and to write concisely.

Resources

About

Integrating KinFin Proteome Cluster analyses into Genome Browsing environments

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published