This Python package provides convenience methods for accessing public data from MalariaGEN.
The malariagen_data
Python package is available from the Python
package index (PyPI) and can be installed via pip
, e.g.:
$ pip install malariagen-data
Documentation of classes and methods in the public API are available from the following locations:
- Adds the
Af1
class providing access to the Anopheles funestusAf1.0
SNP data release. - Adds the
AnophelesDataResource
superclass for theAf1
andAg3
subclasses. - Adds the
Pv4
andPf7
classes for accessing data in the Plasmodium releases. - Add
Pv4.sample_metadata()
andPf7.sample_metadata()
functions for accessing metadata. - Add
Pv4.variant_calls()
andPf7.variant_calls()
functions for accessing variant calls. - Add
Pv4.genome_features()
andPf7.genome_features()
functions for accessing the genome feature annotations. - Add
Pv4.genome_sequence()
andPf7.genome_sequence()
functions for accessing the reference genome sequence.
@@TODO release notes
-
Adds
cohort_size
parameter toAg3.haplotypes()
function. -
Adds
Ag3.h12_calibration()
andAg3.plot_h12_calibration()
functions to generate and plot h12 calibration data for different window sizes. -
Adds
Ag3.h12_gwss()
,Ag3.plot_h12_gwss_track()
andAg3.plot_h12_gwss()
to generate and plot h12 analyses. (GH272) -
Fixes bug where in Google Colab notebooks, debug logging couldn't be turned off. (GH274)
@@TODO release notes
- Upgrade the default cohorts analysis in the
Ag3
class to20220608
(GH256).
- Handle sample sets with missing species calls (GH251).
-
Upgrade the default species analysis in the
Ag3
class to use a new set of improved ancestry-informative markers (AIMs). Note that this includes a small change to the values in theaim_species
column of the sample metadata: the value "intermediate_arabiensis_gambiae" has been replaced by "intermediate_gambcolu_arabiensis" for consistency with AIM naming conventions used elsewhere. -
Add
Ag3.aim_calls()
andAg3.plot_aim_heatmap()
functions for accessing and plotting ancestry-informative marker (AIM) genotypes (GH236). -
Add site filters tracks to IGV via
Ag3.view_alignments()
(GH246).
-
Add
Ag3.aim_variants()
function (GH233). -
Enable plotting high coverage variance samples with
Ag3.plot_cnv_hmm_coverage()
(GH240).
- Handle sample sets with missing cohorts metadata (GH235).
- Add Ag3.plot_snps() to plot segregating and non-segregating SNPs and visualise site filters (GH226).
-
Add progress bars for longer-running computations using tqdm (GH217).
-
Add SNP genotypes to
Ag3.view_alignments()
(GH216).
-
Use pypi for igv-notebook dependency (GH209).
-
Add
sample_query
parameter toAg3.haplotypes()
(GH210). -
Add
sample_query
andmax_coverage_variance
parameters toAg3.cnv_hmm()
andAg3.gene_cnv()
(GH213).
- Make igv-notebook an optional dependency while it is still only available from GitHub.
-
Ag3
: A newpca()
function has been added for performing principal components analysis (GH187). -
Ag3
: Functionsplot_pca_variance()
andplot_pca_coords()
have been added for plotting PCA results (GH197). -
Ag3
: A newsnp_allele_counts()
function has been added for computing SNP allele counts, which is required for various analyses such as PCA (locating segregating variants). -
Ag3
: A newview_alignments()
function has been added which creates an IGV browser in the notebook and adds a track with the sequence read alignments from a given sample (GH202). There is also anigv()
function for initialising an IGV browser with just reference genome and gene tracks. -
Ag3
: The way that analysis version parameters likecohorts_analysis
,species_analysis
andsite_filters_analysis
are exposed in the API has been simplified (GH203). Now these parameters are set when theAg3
class is instantiated, rather than at each method. -
Ag3
: A check has been added for the location of the machine from which requests are being made, and in particular to raise a warning in the case where colab allocates a VM outside the US region, which results in poor data retrieval performance (GH201). -
Ag3
: By default, bokeh is now automatically configured to output plots to the notebook (GH193).
-
Ag3
: Limit docstring widths for better wrapping in colab help tabs (GH186). -
Ag3
: Return a copy of cached DataFrames to any subsequent user modifications do not affect the cached data (GH184). -
Ag3
: Improving zooming behaviour of bokeh genome plots (GH189). -
Ag3
: Add sample identifiers to CNV HMM heatmap plots (GH191). -
Ag3
: Exclude high coverage variance samples by default in CNV HMM heatmap plots (GH178). -
Ag3
: Standardise default width of bokeh genome plots (GH174). -
Ag3
: Consistently capitalise plot labels (GH176). -
Ag3
: Tidy title for CNV HMM heatmap plots when using multiple sample sets (GH175). -
Ag3
: Fix a bug in loading of gene CNV frequencies where intermediate species samples are missing (GH183).
-
Added a new function
Ag3.plot_cnv_hmm_coverage()
which generates a bokeh plot showing normalised coverage and HMM copy number for an individual sample. -
Added a new function
Ag3.plot_cnv_hmm_heatmap()
which generates a bokeh plot showing the HMM copy number for multiple samples as a heatmap. -
Added support for accessing genome regions to the CNV data access functions
Ag3.cnv_hmm()
,Ag3.gene_cnv()
,Ag3.gene_cnv_frequencies()
andAg3.cnv_coverage_calls()
(GH113). Please use theregion
parameter to specify a contig or contig region. The previouscontig
parameter is no longer supported. -
Added support for a
region
parameter to theAg3.geneset()
function. -
Added docstrings for
Ag3.plot_genes()
andAg3.plot_transcript()
(GH170). -
Set plot width and height automatically in
Ag3.plot_frequencies_heatmap()
based on the number of rows and columns.
-
Added a new function
Ag3.plot_genes()
which generates a bokeh plot of gene annotations (GH154). -
Added a new function
Ag3.plot_transcript()
which generates a bokeh plot of a gene model (GH155). -
Fixed a bug in the
Ag3.gene_cnv_frequencies()
function (GH166). -
CI improvements (GH150).
-
Ag3
: Add support for giving a list of contigs to thecontig
parameter ingene_cnv()
andgene_cnv_frequencies()
(GH162). -
Ag3
: Miscellaneous optimisations and documentation fixes (GH153, GH158, GH159, GH161).
-
Ag3
: New functions have been added for space-time analysis of SNP allele frequencies and gene CNV frequencies (GH143).-
The new function
plot_frequencies_time_series()
creates faceted time series plots of frequencies using plotly. -
The new function
plot_frequencies_interactive_map()
creates an ipyleaflet map with coloured markers representing frequencies in different cohorts, with widgets to select the variant, taxon and time period of interest. -
The new function
plot_frequencies_map_markers()
supports plotting frequency markers on an existing ipyleaflet map. -
The new function
snp_allele_frequencies_advanced()
computes SNP allele frequencies in a transcript of interest and returns an xarray dataset which can be used as input to space and time plotting functions. -
The new function
aa_allele_frequencies_advanced()
computes amino acid substitution frequencies in a transcript of interest and returns an xarray dataset which can be used as input to space and time plotting functions. -
The new function
gene_cnv_frequences_advanced()
computes gene CNV frequencies for a given contig and returns an xarray dataset which can be used as input to space and time plotting functions. -
The function
aa_allele_frequencies()
has been modified to better handle the case where SNPs at different genome positions cause the same amino acid change.
-
-
Ag3
: The functiongene_cnv_frequencies()
has been modified so that each row now represents a gene and variant (amplification or deletion), and columns are cohorts (GH139). Also a new parameterdrop_invariant
has been added, which is True by default, meaning that only records with some evidence of copy number variation in the given cohorts are returned. -
Ag3
: Samples with high coverage variance are now removed by default when running thegene_cnv_frequencies()
, and this can be controlled via a newmax_coverage_variance
parameter (GH141). To support this, thesample_coverage_variance
variable has been added to the output of thegene_cnv()
function (GH128). -
Ag3
: All functions accepting asample_sets
parameter now check for the same sample set being selected more than once (GH144). -
Ag3
: The functions which plot frequencies, includingplot_frequencies_heatmap()
,plot_frequencies_time_series()
, andplot_frequencies_interactive_map()
, have been modified to use consistent labels for variants (GH145). -
Ag3
: The frequencies plotting functions now automatically set a title based on metadata from the input dataframe or dataset (GH146). The cohorts axis labels have also been moved to the bottom to make room for a title. -
Ag3
: All column names in sample metadata dataframes are now lower case, and columns starting "adm" have been renamed to start with "admin" (e.g., "adm1_ISO" has been renamed to "admin1_iso") to have consistent naming of columns and parameter values relating to administrative units (GH142). -
Ag3
: Functionscnv_hmm()
,cnv_coverage_calls()
andcnv_discordant_read_calls()
support multiple contigs for thecontig
parameter and automatically concatenate datasets (GH90).
-
Ag3
: Function docstrings have been improved to document return values (GH84). -
Ag3
: Improve repr methods (GH138).
Ag3
: Expose more plotting parameters through theplot_frequencies_heatmap()
method (GH133).
-
Ag3
: Added support for genome regions when accessing data (GH14). N.B., thecontig
parameter is no longer supported, instead use theregion
parameter which can be a contig ID (e.g., "3L"), a contig region (e.g., "3L:1000000-2000000"), a gene ID ("AGAP004070"), or a list of any of the above. This affects methods includingsnp_sites()
,site_filters()
,snp_genotypes()
andsnp_dataset()
. Contributed by Nace Kranjc. -
Ag3
: The parameters for specifying which species analysis version is used have changed (GH55). This affectsspecies_calls()
,sample_metadata()
,snp_allele_frequencies()
andgene_cnv_frequencies()
. In most cases the default values for these parameters should be appropriate and so no changes to your code should be needed. -
Ag3
: The names of the columns in dataframes containing data related to species calling have changed to make it clearer which species calling method has been used. This affects dataframes returned byspecies_calls()
andsample_metadata()
. See GH93 for further details. -
Ag3
: The latest cohorts metadata are now automatically loaded and joined in with the sample metadata when callingsample_metadata()
. See GH94 for further details. -
Ag3
: SNP effects are now automatically included in the output dataframe fromsnp_allele_frequencies()
(GH95). -
Ag3
: Added a newsample_query
parameter to methods returning frequencies to allow for making a sub-selection of samples (GH96). -
Ag3
: Added a new methodaa_allele_frequencies()
to return a dataframe of amino acid substitution allele frequencies (GH101). -
Ag3
: Added a new methodplot_frequencies_heatmap()
for creating a heatmap plot of allele frequencies (GH102). -
Ag3
: The Google Cloud Storage URL ("gs://vo_agam_release") is now the default value when instantiating theAg3
class (GH103). So now you don't need to provide it if you are accessing data from GCS. I.e., you can just do:
import malariagen_data
ag3 = malariagen_data.Ag3()
-
Ag3
: The identifiers used for data releases have been changed to use "3.0" instead of "v3", "3.1" instead of "v3.1", etc. (GH104) -
The
Ag3
andAmin1
classes have a better repr (GH111). -
Ag3
: All dataframe columns containing allele frequency values are now prefixed with "frq_" to allow for easier selection of frequency columns (GH116). -
Ag3
: When computing frequencies, automatically drop columns for cohorts below the minimum cohort size (GH118). -
Amin1
: Added support forregion
parameter instead ofcontig
(GH119). -
Ag3
: Thesnp_sites()
method no longer returns a tuple of arrays if thefield
parameter is not provided, please provide an explicitfield
parameter or use thesnp_calls()
method instead (recommended).
-
Ag3
: Move default values for analysis parameters to constants (GH70). -
Ag3
: Check for manifest.tsv when discovering a release (GH74). -
Ag3
: Decode sample IDs when buildingsnp_calls()
dataset (GH82). -
Ag3
: Fixsnp_calls()
cannot take multiple releases forsample_set
parameter (GH85). -
Ag3
: Fixchunks
parameter appears to be ignored (GH86). -
Support Python 3.9 (GH91).
-
Ag3
: Fix pandas performance warnings (GH108). -
Ag3
: Fix bug involving inconsistent array lengths before and after computation (GH114). -
Ag3
: Fix compatibility with zarr 2.11.0 (GH129). -
Some optimisations to speed up the test suite a bit (GH122).
Ag3
: Update default cohort parameter to latest analysis (20211101).
Amin1
: Bug fix tosnp_calls()
handling of site_mask parameter.
- Adds the
Amin1
class providing access to the Anopheles minimusAmin1
SNP data release.
Ag3
: Bug fix tosample_cohorts()
.
-
Ag3
: Update default cohort parameter to latest analysis (20210927). -
Ag3
: Reduce dataframe fragmentation and memory footprint ingene_cnv_frequencies()
.
Ag3
: Add support for standard cohorts in the functionssnp_allele_frequencies()
andgene_cnv_frequencies()
.
Ag3
: Addsample_cohorts()
.
Ag3
: Addhaplotypes()
and supporting functionsopen_haplotypes()
andopen_haplotype_sites()
.
Ag3
: Add site filter columns to dataframes returned bysnp_effects()
andsnp_allele_frequencies()
.
Ag3
: Rename parameter "populations" to "cohorts" to be consistent with sgkit terminology.
-
Ag3
: Addgene_cnv()
andgene_cnv_frequencies()
. -
Ag3
: Improvements and maintenance tosnp_effects()
andsnp_allele_frequencies()
.
-
Ag3
: Addsnp_allele_frequencies()
. -
Ag3
: Addsnp_effects()
. -
Ag3
: Addcnv_hmm()
,cnv_coverage_calls()
andcnv_discordant_read_calls()
. -
Speed up test suite via caching.
-
Add configuration for pre-commit hooks.
- Performance improvements for faster reading a indexing zarr arrays.
Ag3
: Bug fix and minor improvements tosnp_calls()
.
Ag3
: Explore workarounds to xarray memory issues in thesnp_calls()
method.
-
Ag3
: Make public theopen_genome()
,open_snp_sites()
,open_site_filters()
andopen_snp_genotypes()
methods. -
Ag3
: Add thecross_metadata()
method. -
Ag3
: Addsite_annotations()
andopen_site_annotations()
methods. -
Ag3
: Add thesnp_calls()
method. -
Improve unit tests.
-
Improve memory usage.
- Fix compatibility issue in recent fsspec/gcsfs release.
First release with basic functionality in the Ag3
class for
accessing Ag1000G phase 3 data.
To get setup for development, see this video and the instructions below.
Fork and clone this repo:
$ git clone [email protected]:[username]/malariagen-data-python.git
Install poetry >=1.2.0b3 somehow, e.g.:
$ python3.7 -m install poetry==1.2.0b3
Create development environment:
$ cd malariagen-data-python
$ python3.7 -m poetry install
Activate development environment:
$ python3.7 -m poetry shell
Install pre-commit hooks:
$ pre-commit install
Run pre-commit checks (isort, black, blackdoc, flake8, ...) manually:
$ pre-commit run --all-files
Run tests:
$ python3.7 -m run pytest -v
To create a new release...
-
From master, create a new local branch named "vX.X.X-prep" replacing "X.X.X" with the new version number.
-
Open pyproject.toml in a text editor and change the
version
property to the new version number. -
Open README.md in a text editor and add a new subsection in the release notes section with some information about the changes in the new release.
-
Commit the changes to pyproject.toml and README.md, open a pull request with the title "vX.X.X release prep", and request a review.
-
Once the PR is approved, merge to master, then create a GitHub release, using the new version number as the release tag.
-
Back on your local system, run git pull to update your master branch and fetch remote tags, then run
git checkout vX.X.X
to check out the release tag locally. -
Run:
$ python3.7 -m poetry build
$ python3.7 -m poetry publish
You will need a PyPI username and password, and will need to be added as a maintained of the malariagen_data package on PyPI.
N.B., release numbers should follow semantic versioning.