Skip to content

Commit

Permalink
Apply suggestions from code review
Browse files Browse the repository at this point in the history
Co-authored-by: Tuomas Rossi <[email protected]>
  • Loading branch information
vnmabus and trossi authored Sep 7, 2024
1 parent 85d5672 commit e074b3f
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 16 deletions.
21 changes: 17 additions & 4 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ @misc{diaz-vico+ramos-carreno_2022_scikitdatasets
copyright = {MIT}
}

@misc{fajardo_2018_pyreadr,
@misc{fajardo_2024_pyreadr,
title = {Pyreadr},
author = {Fajardo, Otto},
year = {2018},
month = dec,
doi = {10.5281/zenodo.7110170},
year = {2024},
month = jul,
doi = {10.5281/zenodo.13132498},
url = {https://github.com/ofajardo/pyreadr}
}

Expand Down Expand Up @@ -50,4 +50,17 @@ @article{ramos-carreno+_2024_scikitfda
abstract = {The library scikit-fda is a Python package for functional data analysis (FDA). It provides a comprehensive set of tools for representation, preprocessing, and exploratory analysis of functional data. The library is built upon and integrated in Python's scientific ecosystem. In particular, it conforms to the scikit-learn application programming interface so as to take advantage of the functionality for machine learning provided by this package: Pipelines, model selection, and hyperparameter tuning, among others. The scikit-fda package has been released as free and open-source software under a 3-clause BSD license and is open to contributions from the FDA community. The library's extensive documentation includes step-by-step tutorials and detailed examples of use.},
copyright = {Copyright (c) 2024 Carlos Ramos-Carre{\~n}o, Jos{\'e} Luis Torrecilla, Miguel Carbajo-Berrocal, Pablo Marcos, Alberto Su{\'a}rez},
langid = {english}
}

@article{rahman+_2024_hmschpc,
title = {Accelerating joint species distribution modelling with {Hmsc-HPC} by {GPU} porting},
author = {Rahman, Anis Ur and Tikhonov, Gleb and Oksanen, Jari and Rossi, Tuomas and Ovaskainen, Otso},
year = {2024},
month = sep,
journal = {PLOS Computational Biology},
volume = {20},
number = {9},
pages = {e1011914},
doi = {10.1371/journal.pcbi.1011914},
abstract = {Joint species distribution modelling (JSDM) is a widely used statistical method that analyzes combined patterns of all species in a community, linking empirical data to ecological theory and enhancing community-wide prediction tasks. However, fitting JSDMs to large datasets is often computationally demanding and time-consuming. Recent studies have introduced new statistical and machine learning techniques to provide more scalable fitting algorithms, but extending these to complex JSDM structures that account for spatial dependencies or multi-level sampling designs remains challenging. In this study, we aim to enhance JSDM scalability by leveraging high-performance computing (HPC) resources for an existing fitting method. Our work focuses on the Hmsc R-package, a widely used JSDM framework that supports the integration of various dataset types into a single comprehensive model. We developed a GPU-compatible implementation of its model-fitting algorithm using Python and the TensorFlow library. Despite these changes, our enhanced framework retains the original user interface of the Hmsc R-package. We evaluated the performance of the proposed implementation across various model configurations and dataset sizes. Our results show a significant increase in model fitting speed for most models compared to the baseline Hmsc R-package. For the largest datasets, we achieved speed-ups of over 1000 times, demonstrating the substantial potential of GPU porting for previously CPU-bound JSDM software. This advancement opens promising opportunities for better utilizing the rapidly accumulating new biodiversity data resources for inference and prediction.},
}
31 changes: 19 additions & 12 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: 'rdata: Read R datasets from Python'
title: 'rdata: A Python library for R datasets'
tags:
- Python
- R
Expand All @@ -14,22 +14,23 @@ authors:
orcid: 0000-0002-8713-4559
affiliation: 2
affiliations:
- name: Universidad Autónoma de Madrid, Spain
index: 1
- name: CSC - IT Center for Science Ltd, Finland
index: 2
date: 31 August 2024
- name: Universidad Autónoma de Madrid, Spain
index: 1
- name: CSC IT Center for Science Ltd., Finland
index: 2
date: 4 September 2024
bibliography: paper.bib

---

# Summary

Research work usually requires the analysis and processing of data from different sources.
Traditionally statisticians and other research professionals have been using R for this task, and have compiled a huge amount of datasets in the Rda and Rds formats, native to this programming language.
Traditionally in statistical computing, R language has been widely used for this task, and a huge amount of datasets have been compiled in the Rda and Rds formats, native to this programming language.
As these formats contain internally the representation of R objects, they cannot be directly used from Python, another widely used language for data analysis and processing.
The library `rdata` allows to load and convert these datasets to Python objects, without the need of exporting them to other intermediate formats which may not keep all the original information.
This library has minimal dependencies, ensuring that it can be used in contexts where an R installation is not available.
The capability to write data in Rda and Rds formats is also under development.
Thus, the library `rdata` facilitates data interchange, enabling the usage of the same datasets in both languages (e.g. for reproducibility, comparisons of results against methods in both languages, or migration of processing pipelines to Python).

# Statement of need
Expand All @@ -44,15 +45,15 @@ In the first place, the package requires an R installation, as it relies in laun
Secondly, launching R just to load data is inefficient, both in time and memory.
Finally, this package inherits the GPL license from the R language, which is not compatible with most Python packages, typically released under more permissive licenses.

The recent package `pyreadr` [@fajardo_2018_pyreadr] also provides functionality to read some R datasets.
The package `pyreadr` [@fajardo_2024_pyreadr] also provides functionality to read and write some R datasets.
It relies in the C library `librdata` in order to perform the parsing of the RData format.
This adds an additional dependency from C building tools, and requires that the package is compiled for all the desired operating systems.
Moreover, this package is limited by the functionalities available in `librdata`, which at the moment of writing
does not include the parsing of common objects such as R lists and S4 objects.
The license can also be a problem, as it is part of the GPL family and does not allow commercial use.

As existing solutions were unsuitable for our needs, the package `rdata` was developed to parse data in the RData format.
This is a small, extensible and very complete implementation in pure Python of a RData parser, that is able to read and convert most datasets in the CRAN repository to equivalent Python objects.
This is a small, extensible, efficient, and very complete implementation in pure Python of a RData parser, that is able to read and convert most datasets in the CRAN repository to equivalent Python objects.
It has a permissive license and can be extended to support additional conversions from custom R classes.

The package `rdata` has been designed as a pure Python package with minimal dependencies, so that it can be easily integrated inside other libraries and applications.
Expand Down Expand Up @@ -120,11 +121,17 @@ Several utility functions, such as the routines `convert_char()` and `convert_li

# Ongoing work


To broaden the utility of the `rdata` library to data processing pipelines with steps in both R and Python, we are currently extending the library with the capability to write compatible Python objects to RData files.
As an example, such a pipeline is present in the Hmsc-HPC code [@rahman+_2024_hmschpc], the continuous development of which has been driving the ongoing work on the writing functionality in the `rdata` library.
The writing of RData files is implemented as a two-step process similar to reading: first, the Python object is converted to the tree-like intermediate representation used in parsing, and then this intermediate representation is written to a RData file.
Currently, the writing functionality supporting common types is available at the development branch of the `rdata` library.

# Acknowledgements

The authors acknowledge financial support from the Spanish Ministry of Education and Innovation, projects PID2019-106827GB-I00 / AEI / 10.13039/501100011033 and PID2019-109387GB-I00.
This work was also supported by an FPU grant (Formación de Profesorado Universitario) from the Spanish Ministry of Science, Innovation and Universities(MICIU) with reference FPU18/00047.
This work has received funding
from the Spanish Ministry of Education and Innovation, projects PID2019-106827GB-I00 / AEI / 10.13039/501100011033 and PID2019-109387GB-I00,
from an FPU grant (Formación de Profesorado Universitario) from the Spanish Ministry of Science, Innovation and Universities(MICIU) with reference FPU18/00047,
and from the European Union's Horizon Europe research and innovation programme under grant agreement No 101057437 (BioDT project, [https://doi.org/10.3030/101057437](https://doi.org/10.3030/101057437)).
Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.

# References

0 comments on commit e074b3f

Please sign in to comment.