Motivation • Installation • File Types • Key-value pairs • Workflow • License
The computational chemistry and biology communities often fails to openly provide raw and/or processed data used to draw their scientific conclusions.
For large projects, frameworks such as QCArchive, Materials Project, Pitt Quantum Repository, ioChem-BD and many others provide great storage solutions. This approach would not be practical for fluid data pipelines and small-scale projects such as a single manuscript.
Alternatively, you could use individual files in formats such as JSON, XML, YAML, npz, etc. These are great options for customizable data storage with their own advantages and disadvantages. However, you often must choose between (1) a standardized parser that might not support your workflow or (2) writing your own.
Reptar is designed for easy data storage and analysis for individual projects. Customizable parsers provide a simple way to extract new data without submitting issues and pull requests (although this is highly encouraged). While files are the heart of reptar, it strives to be file-type agnostic by providing the same interface for all supported file types. The result is a user-specified file streamlined for analysis in Python and archival on places such as GitHub and Zenodo.
You can install reptar from PyPI by using pip install reptar
.
Or, the latest development version can be installed directly from the GitHub repository or from TestPyPI.
git clone https://github.com/oasci/reptar
cd reptar
pip install .
Reptar supports four file types with a single interface: exdir, zarr, JSON, and npz.
JSON is a text file for storing key-value pairs with few dimensions (i.e., no large arrays).
NumPy's npz format is useful for arrays; however, no nesting is possible and loading data often requires postprocessing for 0D arrays (e.g., np.array('data')
).
Exdir is a simple, yet powerful open file format that mimics the HDF5 format with metadata and data stored in directories with YAML and npy files instead of a single binary file. For more detailed information, please read this Front. Neuroinform. article about exdir. Zarr is a similar hierarchical data format for chunked and compressed NumPy-like arrays and JSON attributes. Both of these file types provide several advantages such as mixing human-readable and binary files, being easier for version control, and only loading requested portions of arrays into memory.
All data is stored under a key
-value
pair within the reptar framework.
The key
tells reptar where the data is stored and is conceptually related to standard file paths (without file extensions).
Nested data is specified by separating the nested keys with a /
.
For example, energy_pot
, md_run/geometry
, and entity_ids
are all valid keys.
Note that gradients
and /gradients
would translate to the same value (/
species the "root" of the file).
We refer to a "reptar file" as any file that can be used with the reptar.File
class.
Creating a reptar file starts by having a set of data files generated from some calculation.
Paths to these data files are passed into reptar.Creator.from_calc
that extracts information using a reptar.parser
class.
Information parsed from these files, parsed_info
, is then used to populate a reptar.File
object.
Data can also be manually added by using File.put(key, data)
where key
is a string specifying where to store the data.
Data can be added or retrieved using the same interface regardless of the underlying file format (e.g., exdir, JSON, and npz).
The only thing required is the respective key
specifying where it is stored.
Then, File.get(key)
can retrieve the data.
When working with JSON and npz files, File.save()
must be explicitly called after any modification.
Other packages often require data to be formatted in their own specific way.
Reptar provides ways to extract data from reptar files using File.get(key)
and passing it into the desired reptar.writer
function.
Reptar currently automates the creation of:
- Atomic simulation environment (ASE) databases,
- Gaussian approximate potentials (GAP) extended XYZ files,
- Protein data bank (PDB) files,
- Schnetpack databases,
- XYZ files.
Distributed under the MIT License. See LICENSE
for more information.