This repository describes the conda package download data provided by Anaconda, Inc. It includes package download counts starting from Jan. 2017 for the following download sources:
- Anaconda Distribution: The default channels hosted on
repo.anaconda.com
(and historically onrepo.continuum.io
) - Select Anaconda.org channels: Currently this includes
conda-forge
andbioconda
.
Check out an example notebook using this data on Binder:
The download data is provided as record for every unique combination of:
data_source
:anaconda
for Anaconda distribution,conda-forge
for the conda-forge channel on Anaconda.org, andbioconda
for the bioconda channel on Anaconda.org.time
: UTC time, binned by hourpkg_name
: Package name (Ex:pandas
)pkg_version
: Package version (Ex:0.23.0
)pkg_platform
: One oflinux-32
,linux-64
,osx-64
,win-32
,win-64
,linux-armv7
,linux-ppcle64
,linux-aarch64
, ornoarch
pkg_python
: Python version required by the package, if any (Ex:3.7
)counts
: Number of downloads for this combination of attributs
The storage format is Parquet, one file per day, with SNAPPY compression. Files are hosted on S3, with the naming convention:
s3://anaconda-package-data/conda/hourly/[year]/[month]/[year]-[month]-[day].parquet
To simplify using the dataset, we have also created an Intake catalog file, which you can load either directly from the repository if you have the intake
, intake-parquet
, and python-snappy
packages installed:
import intake
cat = intake.Catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')
monthly = cat.anaconda_package_data_by_month(year=2019, month=12).to_dask()
Or you can install the data package directly with conda, which will also fetch the required dependencies:
conda install -c intake anaconda-package-data
And then the data source will appear in the global catalog of your conda environment:
import intake
monthly = intake.cat.anaconda_package_data_by_month(year=2019, month=12).to_dask()
To minimize bandwidth usage, these catalogs are configured so that Intake will cache data locally to your system on first use.
There are some known gaps in the dataset, and Anaconda.org data doesn't appear in the data set until April 2017. See KNOWN_ISSUES.md for more details.
This data will be updated approximately monthly. Note that we may revise historical data if processing issues are discovered, or to add additional data (like new Anaconda.org channels). We will update the change log when new or revised data is posted.
This dataset is licensed under a Creative Commons Attribution 4.0 International License. We are offering this data to help the community understand the usage of conda packages, but with no warranty. If you use this data, please acknowledge Anaconda as the source and link back to this Github repository.
If you have questions or find problems in the data, please open an issue on this repository. Thanks!