Skip to content

malariagen/malariagen-data-python

Repository files navigation

malariagen_data - access MalariaGEN public data from Python

This Python package provides convenience methods for accessing public data from MalariaGEN.

Installation

The malariagen_data Python package is available from the Python package index (PyPI) and can be installed via pip, e.g.:

$ pip install malariagen-data

Documentation

Documentation of classes and methods in the public API are available from the following locations:

Release notes

7.0.0

  • Adds the Af1 class providing access to the Anopheles funestus Af1.0 SNP data release.
  • Adds the AnophelesDataResource superclass for the Af1 and Ag3 subclasses.
  • Adds the Pv4 and Pf7 classes for accessing data in the Plasmodium releases.
  • Add Pv4.sample_metadata() and Pf7.sample_metadata() functions for accessing metadata.
  • Add Pv4.variant_calls() and Pf7.variant_calls() functions for accessing variant calls.
  • Add Pv4.genome_features() and Pf7.genome_features() functions for accessing the genome feature annotations.
  • Add Pv4.genome_sequence() and Pf7.genome_sequence() functions for accessing the reference genome sequence.

6.1.1

@@TODO release notes

6.1.0

  • Adds cohort_size parameter to Ag3.haplotypes() function.

  • Adds Ag3.h12_calibration() and Ag3.plot_h12_calibration() functions to generate and plot h12 calibration data for different window sizes.

  • Adds Ag3.h12_gwss(), Ag3.plot_h12_gwss_track() and Ag3.plot_h12_gwss() to generate and plot h12 analyses. (GH272)

  • Fixes bug where in Google Colab notebooks, debug logging couldn't be turned off. (GH274)

6.0.0

@@TODO release notes

5.1.0

  • Upgrade the default cohorts analysis in the Ag3 class to 20220608 (GH256).

5.0.1

  • Handle sample sets with missing species calls (GH251).

5.0.0

  • Upgrade the default species analysis in the Ag3 class to use a new set of improved ancestry-informative markers (AIMs). Note that this includes a small change to the values in the aim_species column of the sample metadata: the value "intermediate_arabiensis_gambiae" has been replaced by "intermediate_gambcolu_arabiensis" for consistency with AIM naming conventions used elsewhere.

  • Add Ag3.aim_calls() and Ag3.plot_aim_heatmap() functions for accessing and plotting ancestry-informative marker (AIM) genotypes (GH236).

  • Add site filters tracks to IGV via Ag3.view_alignments() (GH246).

4.4.0

  • Add Ag3.aim_variants() function (GH233).

  • Enable plotting high coverage variance samples with Ag3.plot_cnv_hmm_coverage() (GH240).

4.3.1

  • Handle sample sets with missing cohorts metadata (GH235).

4.3.0

  • Add Ag3.plot_snps() to plot segregating and non-segregating SNPs and visualise site filters (GH226).

4.2.0

  • Add progress bars for longer-running computations using tqdm (GH217).

  • Add SNP genotypes to Ag3.view_alignments() (GH216).

4.1.0

  • Use pypi for igv-notebook dependency (GH209).

  • Add sample_query parameter to Ag3.haplotypes() (GH210).

  • Add sample_query and max_coverage_variance parameters to Ag3.cnv_hmm() and Ag3.gene_cnv() (GH213).

4.0.1

  • Make igv-notebook an optional dependency while it is still only available from GitHub.

4.0.0

  • Ag3: A new pca() function has been added for performing principal components analysis (GH187).

  • Ag3: Functions plot_pca_variance() and plot_pca_coords() have been added for plotting PCA results (GH197).

  • Ag3: A new snp_allele_counts() function has been added for computing SNP allele counts, which is required for various analyses such as PCA (locating segregating variants).

  • Ag3: A new view_alignments() function has been added which creates an IGV browser in the notebook and adds a track with the sequence read alignments from a given sample (GH202). There is also an igv() function for initialising an IGV browser with just reference genome and gene tracks.

  • Ag3: The way that analysis version parameters like cohorts_analysis, species_analysis and site_filters_analysis are exposed in the API has been simplified (GH203). Now these parameters are set when the Ag3 class is instantiated, rather than at each method.

  • Ag3: A check has been added for the location of the machine from which requests are being made, and in particular to raise a warning in the case where colab allocates a VM outside the US region, which results in poor data retrieval performance (GH201).

  • Ag3: By default, bokeh is now automatically configured to output plots to the notebook (GH193).

3.1.0

  • Ag3: Limit docstring widths for better wrapping in colab help tabs (GH186).

  • Ag3: Return a copy of cached DataFrames to any subsequent user modifications do not affect the cached data (GH184).

  • Ag3: Improving zooming behaviour of bokeh genome plots (GH189).

  • Ag3: Add sample identifiers to CNV HMM heatmap plots (GH191).

  • Ag3: Exclude high coverage variance samples by default in CNV HMM heatmap plots (GH178).

  • Ag3: Standardise default width of bokeh genome plots (GH174).

  • Ag3: Consistently capitalise plot labels (GH176).

  • Ag3: Tidy title for CNV HMM heatmap plots when using multiple sample sets (GH175).

  • Ag3: Fix a bug in loading of gene CNV frequencies where intermediate species samples are missing (GH183).

3.0.0

  • Added a new function Ag3.plot_cnv_hmm_coverage() which generates a bokeh plot showing normalised coverage and HMM copy number for an individual sample.

  • Added a new function Ag3.plot_cnv_hmm_heatmap() which generates a bokeh plot showing the HMM copy number for multiple samples as a heatmap.

  • Added support for accessing genome regions to the CNV data access functions Ag3.cnv_hmm(), Ag3.gene_cnv(), Ag3.gene_cnv_frequencies() and Ag3.cnv_coverage_calls() (GH113). Please use the region parameter to specify a contig or contig region. The previous contig parameter is no longer supported.

  • Added support for a region parameter to the Ag3.geneset() function.

  • Added docstrings for Ag3.plot_genes() and Ag3.plot_transcript() (GH170).

  • Set plot width and height automatically in Ag3.plot_frequencies_heatmap() based on the number of rows and columns.

2.2.0

  • Added a new function Ag3.plot_genes() which generates a bokeh plot of gene annotations (GH154).

  • Added a new function Ag3.plot_transcript() which generates a bokeh plot of a gene model (GH155).

  • Fixed a bug in the Ag3.gene_cnv_frequencies() function (GH166).

  • CI improvements (GH150).

2.1.0

  • Ag3: Add support for giving a list of contigs to the contig parameter in gene_cnv() and gene_cnv_frequencies() (GH162).

  • Ag3: Miscellaneous optimisations and documentation fixes (GH153, GH158, GH159, GH161).

2.0.0

New features and API changes

  • Ag3: New functions have been added for space-time analysis of SNP allele frequencies and gene CNV frequencies (GH143).

    • The new function plot_frequencies_time_series() creates faceted time series plots of frequencies using plotly.

    • The new function plot_frequencies_interactive_map() creates an ipyleaflet map with coloured markers representing frequencies in different cohorts, with widgets to select the variant, taxon and time period of interest.

    • The new function plot_frequencies_map_markers() supports plotting frequency markers on an existing ipyleaflet map.

    • The new function snp_allele_frequencies_advanced() computes SNP allele frequencies in a transcript of interest and returns an xarray dataset which can be used as input to space and time plotting functions.

    • The new function aa_allele_frequencies_advanced() computes amino acid substitution frequencies in a transcript of interest and returns an xarray dataset which can be used as input to space and time plotting functions.

    • The new function gene_cnv_frequences_advanced() computes gene CNV frequencies for a given contig and returns an xarray dataset which can be used as input to space and time plotting functions.

    • The function aa_allele_frequencies() has been modified to better handle the case where SNPs at different genome positions cause the same amino acid change.

  • Ag3: The function gene_cnv_frequencies() has been modified so that each row now represents a gene and variant (amplification or deletion), and columns are cohorts (GH139). Also a new parameter drop_invariant has been added, which is True by default, meaning that only records with some evidence of copy number variation in the given cohorts are returned.

  • Ag3: Samples with high coverage variance are now removed by default when running the gene_cnv_frequencies(), and this can be controlled via a new max_coverage_variance parameter (GH141). To support this, the sample_coverage_variance variable has been added to the output of the gene_cnv() function (GH128).

  • Ag3: All functions accepting a sample_sets parameter now check for the same sample set being selected more than once (GH144).

  • Ag3: The functions which plot frequencies, including plot_frequencies_heatmap(), plot_frequencies_time_series(), and plot_frequencies_interactive_map(), have been modified to use consistent labels for variants (GH145).

  • Ag3: The frequencies plotting functions now automatically set a title based on metadata from the input dataframe or dataset (GH146). The cohorts axis labels have also been moved to the bottom to make room for a title.

  • Ag3: All column names in sample metadata dataframes are now lower case, and columns starting "adm" have been renamed to start with "admin" (e.g., "adm1_ISO" has been renamed to "admin1_iso") to have consistent naming of columns and parameter values relating to administrative units (GH142).

  • Ag3: Functions cnv_hmm(), cnv_coverage_calls() and cnv_discordant_read_calls() support multiple contigs for the contig parameter and automatically concatenate datasets (GH90).

Bug fixes, maintenance and documentation

  • Ag3: Function docstrings have been improved to document return values (GH84).

  • Ag3: Improve repr methods (GH138).

1.0.1

  • Ag3: Expose more plotting parameters through the plot_frequencies_heatmap() method (GH133).

1.0.0

New features and API changes

  • Ag3: Added support for genome regions when accessing data (GH14). N.B., the contig parameter is no longer supported, instead use the region parameter which can be a contig ID (e.g., "3L"), a contig region (e.g., "3L:1000000-2000000"), a gene ID ("AGAP004070"), or a list of any of the above. This affects methods including snp_sites(), site_filters(), snp_genotypes() and snp_dataset(). Contributed by Nace Kranjc.

  • Ag3: The parameters for specifying which species analysis version is used have changed (GH55). This affects species_calls(), sample_metadata(), snp_allele_frequencies() and gene_cnv_frequencies(). In most cases the default values for these parameters should be appropriate and so no changes to your code should be needed.

  • Ag3: The names of the columns in dataframes containing data related to species calling have changed to make it clearer which species calling method has been used. This affects dataframes returned by species_calls() and sample_metadata(). See GH93 for further details.

  • Ag3: The latest cohorts metadata are now automatically loaded and joined in with the sample metadata when calling sample_metadata(). See GH94 for further details.

  • Ag3: SNP effects are now automatically included in the output dataframe from snp_allele_frequencies() (GH95).

  • Ag3: Added a new sample_query parameter to methods returning frequencies to allow for making a sub-selection of samples (GH96).

  • Ag3: Added a new method aa_allele_frequencies() to return a dataframe of amino acid substitution allele frequencies (GH101).

  • Ag3: Added a new method plot_frequencies_heatmap() for creating a heatmap plot of allele frequencies (GH102).

  • Ag3: The Google Cloud Storage URL ("gs://vo_agam_release") is now the default value when instantiating the Ag3 class (GH103). So now you don't need to provide it if you are accessing data from GCS. I.e., you can just do:

import malariagen_data
ag3 = malariagen_data.Ag3()
  • Ag3: The identifiers used for data releases have been changed to use "3.0" instead of "v3", "3.1" instead of "v3.1", etc. (GH104)

  • The Ag3 and Amin1 classes have a better repr (GH111).

  • Ag3: All dataframe columns containing allele frequency values are now prefixed with "frq_" to allow for easier selection of frequency columns (GH116).

  • Ag3: When computing frequencies, automatically drop columns for cohorts below the minimum cohort size (GH118).

  • Amin1: Added support for region parameter instead of contig (GH119).

  • Ag3: The snp_sites() method no longer returns a tuple of arrays if the field parameter is not provided, please provide an explicit field parameter or use the snp_calls() method instead (recommended).

Bug fixes, maintenance and documentation

  • Ag3: Move default values for analysis parameters to constants (GH70).

  • Ag3: Check for manifest.tsv when discovering a release (GH74).

  • Ag3: Decode sample IDs when building snp_calls() dataset (GH82).

  • Ag3: Fix snp_calls() cannot take multiple releases for sample_set parameter (GH85).

  • Ag3: Fix chunks parameter appears to be ignored (GH86).

  • Support Python 3.9 (GH91).

  • Ag3: Fix pandas performance warnings (GH108).

  • Ag3: Fix bug involving inconsistent array lengths before and after computation (GH114).

  • Ag3: Fix compatibility with zarr 2.11.0 (GH129).

  • Some optimisations to speed up the test suite a bit (GH122).

0.15.0

  • Ag3: Update default cohort parameter to latest analysis (20211101).

0.14.1

  • Amin1: Bug fix to snp_calls() handling of site_mask parameter.

0.14.0

  • Adds the Amin1 class providing access to the Anopheles minimus Amin1 SNP data release.

0.12.1

  • Ag3: Bug fix to sample_cohorts().

0.12.0

  • Ag3: Update default cohort parameter to latest analysis (20210927).

  • Ag3: Reduce dataframe fragmentation and memory footprint in gene_cnv_frequencies().

0.11.0

  • Ag3: Add support for standard cohorts in the functions snp_allele_frequencies() and gene_cnv_frequencies().

0.10.0

  • Ag3: Add sample_cohorts().

0.9.0

  • Ag3: Add haplotypes() and supporting functions open_haplotypes() and open_haplotype_sites().

0.8.0

  • Ag3: Add site filter columns to dataframes returned by snp_effects() and snp_allele_frequencies().

0.7.0

  • Ag3: Rename parameter "populations" to "cohorts" to be consistent with sgkit terminology.

0.6.0

  • Ag3: Add gene_cnv() and gene_cnv_frequencies().

  • Ag3: Improvements and maintenance to snp_effects() and snp_allele_frequencies().

0.5.0

  • Ag3: Add snp_allele_frequencies().

  • Ag3: Add snp_effects().

  • Ag3: Add cnv_hmm(), cnv_coverage_calls() and cnv_discordant_read_calls().

  • Speed up test suite via caching.

  • Add configuration for pre-commit hooks.

0.4.3

  • Performance improvements for faster reading a indexing zarr arrays.

0.4.2

  • Ag3: Bug fix and minor improvements to snp_calls().

0.4.1

  • Ag3: Explore workarounds to xarray memory issues in the snp_calls() method.

0.4.0

  • Ag3: Make public the open_genome(), open_snp_sites(), open_site_filters() and open_snp_genotypes() methods.

  • Ag3: Add the cross_metadata() method.

  • Ag3: Add site_annotations() and open_site_annotations() methods.

  • Ag3: Add the snp_calls() method.

  • Improve unit tests.

  • Improve memory usage.

0.3.1

  • Fix compatibility issue in recent fsspec/gcsfs release.

0.3.0

First release with basic functionality in the Ag3 class for accessing Ag1000G phase 3 data.

Developer setup

To get setup for development, see this video and the instructions below.

Fork and clone this repo:

$ git clone [email protected]:[username]/malariagen-data-python.git

Install poetry >=1.2.0b3 somehow, e.g.:

$ python3.7 -m install poetry==1.2.0b3

Create development environment:

$ cd malariagen-data-python
$ python3.7 -m poetry install

Activate development environment:

$ python3.7 -m poetry shell

Install pre-commit hooks:

$ pre-commit install

Run pre-commit checks (isort, black, blackdoc, flake8, ...) manually:

$ pre-commit run --all-files

Run tests:

$ python3.7 -m run pytest -v

Release process

To create a new release...

  1. From master, create a new local branch named "vX.X.X-prep" replacing "X.X.X" with the new version number.

  2. Open pyproject.toml in a text editor and change the version property to the new version number.

  3. Open README.md in a text editor and add a new subsection in the release notes section with some information about the changes in the new release.

  4. Commit the changes to pyproject.toml and README.md, open a pull request with the title "vX.X.X release prep", and request a review.

  5. Once the PR is approved, merge to master, then create a GitHub release, using the new version number as the release tag.

  6. Back on your local system, run git pull to update your master branch and fetch remote tags, then run git checkout vX.X.X to check out the release tag locally.

  7. Run:

$ python3.7 -m poetry build
$ python3.7 -m poetry publish

You will need a PyPI username and password, and will need to be added as a maintained of the malariagen_data package on PyPI.

N.B., release numbers should follow semantic versioning.