Keep track of datasets used in a project.
Provide a simple and straightforward way to keep track of datasets downloaded from the web.
Currently Datasets supports download from a set of URLs suited for repositories like PANGEA or ZENODO, as well as git-based repositories such as github. Support for more remote repositories will be added along the way as necessary.
It provides declarative functions to register and download datasets, as well as a way to write to and read from an equivalent (and optional) toml
config file.
This package is not registerd, so you need to install it from URL:
using Pkg
Pkg.add(url="https://github.com/awi-esc/Datasets.jl")
Here is the most straightforward use, e.g. in a datasets.toml
file:
[herzschuh2023]
downloads = ["https://doi.pangaea.de/10.1594/PANGAEA.930512?format=zip"]
doi = "10.1594/PANGAEA.930512"
[jonkers2024]
downloads = ["https://download.pangaea.de/dataset/962852/files/LGM_foraminifera_assemblages_20240110.csv"]
doi = "10.1594/PANGAEA.962852"
[tierney2020]
remote = "[email protected]:jesstierney/lgmDA.git"
And read via the Datasets.read
function, download via download_dataset
or download_datasets
using DataFrames
using Datasets
db = Datasets.read("datasets.yml"; datasets_path=expanduser("~/datasets"))
folder = download_dataset(db, "jonkers2024") # will download only if not present
df = CSV.read(joinpath(folder, "LGM_foraminifera_assemblages_20240110.csv"), DataFrame)
Examples of the declarative syntax
using Datasets
db = Database(datasets_path="datasets") # the default
register_dataset(db, "herzschuh2023"; doi="10.1594/PANGAEA.930512",
downloads=["https://doi.pangaea.de/10.1594/PANGAEA.930512?format=zip"],
)
register_dataset(db, "jonkers2024"; doi="10.1594/PANGAEA.962852",
downloads=["https://download.pangaea.de/dataset/962852/files/LGM_foraminifera_assemblages_20240110.csv"],
)
register_repository(db, "[email protected]:jesstierney/lgmDA.git"; name="tierney2020")
println(db)
yields:
Database:
- herzschuh2023 => 10.1594/PANGAEA.930512
- jonkers2024 => 10.1594/PANGAEA.962852
- tierney2020 => [email protected]:jesstierney/lgmDA.git
datasets_path: datasets
To be completed. But basically
println(repr(db))
yields
Database(
datasets=Dict(
herzschuh2023 => DatasetEntry(doi="10.1594/PANGAEA.930512"...),
jonkers2024 => DatasetEntry(doi="10.1594/PANGAEA.962852"...),
tierney2020 => RepositoryEntry(remote="[email protected]:jesstierney/lgmDA.git"...),
),
datasets_path="datasets"
)
It seems there are quite a few tools to help project and data management. What I stumbled upon includes Dr Watson, DataToolKit.jl and RemoteFiles.jl. RemoteFiles.jl does not provide enough documentation for me to judge at this stage. Dr Watson aims at assisting with all aspects of how to organize files in a scientific project, including running simulations etc, and as such it has a broader scope than Datasets.jl. DataToolKit.jl is the only package I actually tried. What I can say is it is impressive because it merges apparent simplicity of use depth of functionality. I'd say if Datasets.jl ever attempts to get past the download and on-disk management of datasets, with things like actual data loaders including lazy loading of web ressources, it should probably stop right there and use DataToolKits instead.
What made me publish this package instead of just relying on DataTookKit.jl is the KISS principle (Keep It Simple & Stupid). I dislike the idea of having data loaders included as this massively overburdens the core functionality (keep track of things), and examples provided in DataToolKit to clean-up datasets were not convincing to me: too much is kept hidden with ugly meta @syntax
mixed in the config files, were normal functions could do the job. Also I found it not straightforward to use the files as they are downloaded (thinking about a zip file that contained CSV data in need of custom loading) and it was not immediately clear to me how to store files on disk (it might be possible though!). Anyway, the DataToolKit.jl project is very good and has a dedicated main developer giving talks and it will evolve and you should check it out! For now though, Datasets.jl is so simple and tiny that it can be useful for whoever wants to follow the KISS principle.