Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disdrodb review #156

Open
20 of 31 tasks
ghiggi opened this issue Jan 18, 2024 · 12 comments
Open
20 of 31 tasks

disdrodb review #156

ghiggi opened this issue Jan 18, 2024 · 12 comments
Assignees

Comments

@ghiggi
Copy link

ghiggi commented Jan 18, 2024

Submitting Author: Gionata Ghiggi (@ghiggi)
All current maintainers: (@ghiggi)
Package Name: disdrodb
One-Line Description of Package: disdrodb - A software for the decentralized archiving and standardization of global disdrometer data
Repository Link: https://github.com/ltelab/disdrodb
Version submitted: v.0.0.21
EIC: @isabelizimm
Editor: @Zeitsperre
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD


Code of Conduct & Commitment to Maintain Package

Description

The raindrop size distribution (DSD) describes the concentration and size distributions of raindrops in a volume of air. It is a crucial piece of information to model the propagation of microwave signals through the atmosphere (key for telecommunication and weather radar remote sensing calibration), to improve microphysical schemes in numerical weather prediction models, and to understand land surface processes (rainfall interception, soil erosion).

Recognizing the importance of understanding DSD's spatial and temporal variability, scientists worldwide have initiated efforts to "count the drops" by deploying disdrometers—specialized instruments designed to record DSD. Numerous measurement campaigns have been conducted by meteorological services, national agencies (e.g., NASA, ARM, NCAR), and university research groups. Despite these efforts, a significant portion of the collected data remains difficult to access. These data are often stored in diverse formats with inadequate documentation and metadata, posing challenges in sharing, analyzing, comparing, and reusing the data.

In response to these challenges, the disdrodb Python package offers:

  1. A Decentralized Data Archive Infrastructure: The disdrodb package establishes a decentralized data archive, fostering the exchange and retrieval of raw disdrometer data within the scientific community. This infrastructure addresses the issue of data accessibility, documentation and promotes collaborative research.

  2. Standardization of Raw Data. The disdrodb package provides tools to convert heterogeneous raw data into a uniform netCDF4 format, known as the DISDRODB L0 product. This standardization is a significant step forward, ensuring that data from different sources become compatible and easier to analyze, compare, and share, thereby enhancing the overall utility and reusability of the data.

Scope

  • Data retrieval
  • Data extraction
  • Data processing/munging
  • Data deposition
  • Data validation and testing
  • Data visualization
  • Workflow automation
  • Citation management and bibliometrics
  • Scientific software wrappers
  • Database interoperability

Domain Specific & Community Partnerships

  • Geospatial
  • Education
  • Pangeo

How the and why the package falls under the categories you indicated above

Data Retrieval

The disdrodb package facilitates the retrieval of raw measurement acquired by disdrometer stations which are included in the DISDRODB Decentralized Data Archive. This remote archive comprises public cloud repositories such as Zenodo. The disdrodb package tracks the available stations through the DISDRODB Metadata Archive which is hosted on GitHub.

Data Munging

After downloading the desired data, users can use disdrodb to convert the heterogeneous raw data into a uniform netCDF4 format (DISDRODB L0) with a single command. This conversion facilitates subsequent scientific analysis and product generation. For each disdrometer station, the disdrodb python package has a specialized reader that enable to accurately parse the raw sensor data.

Data Deposition

The disdrodb package offers a workflow for users who wish to contribute their disdrometer measurements to the DISDRODB community. This workflow ensures the long-term documentation of the data and simplifies the data upload process to the DISDRODB Decentralized Data Archive. Users must perform three main tasks:

  • Create a reader that reads the raw data into a dataframe, adhering to the DISDRODB guidelines.

  • Provide the metadata of the disdrometer station, which will be included in the DISDRODB Metadata Archive.

  • Upload the station's raw data to a remote repository and insert the station data URL into the DISDRODB Metadata Repository. The disdrodb package can automate this final step if the chosen remote repository is Zenodo.

Who is the target audience and what are scientific applications of this package?

The primary audience for this package includes researchers and students in the fields of remote sensing and atmospheric science, specifically those focused on precipitation. The package is designed to support applications in remote sensing and atmospheric science.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

To our knowledge, there are no other packages that offer an integrated infrastructure for retrieving, sharing, archiving, reading, and standardizing disdrometer data.

However, the pyDSD package exists for studying the DSD. It provides methods for high-level scientific analysis of disdrometer raw data, such as computing DSD parameters and simulating weather radar reflectivities.

The DISDRODB Working Group plans to leverage and adapt pyDSD codes in the future to generate uniform, high-level scientific products for all stations within the DISDRODB Global Archive.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • uses an OSI approved license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a tutorial with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

JOSS Checks
  • The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
  • The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
  • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • The package is deposited in a long-term repository with the DOI:

Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

  • Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

  • I have read the author guide.
  • I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

@lwasser
Copy link
Member

lwasser commented Feb 6, 2024

hi @ghiggi 👋 I wanted to welcome you to pyOpenSci! We have seen this submission and an editor will get back to you with some initial checks soon. In the meantime if you have any questions you can ask here or in our discourse.

@isabelizimm
Copy link
Contributor

isabelizimm commented Feb 7, 2024

Hello and welcome to pyOpenSci!!!

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci
review. Below are the basic checks that your package needs to pass
to begin our review. If some of these are missing, we will ask you
to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements
below.

  • Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
    • The package imports properly into a standard Python environment import package.
  • Fit The package meets criteria for fit and overlap.
  • Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
    • User-facing documentation that overviews how to install and start using the package.
    • Short tutorials that help a user understand how to use the package and what it can do for them.
    • API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
  • Core GitHub repository Files
    • README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
    • Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
    • Code of Conduct The package has a CODE_OF_CONDUCT.md file.
    • License The package has an OSI approved license.
      NOTE: We prefer that you have development instructions in your documentation too.
  • Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
  • Automated tests Package has a testing suite and is tested via a Continuous Integration service.
  • Repository The repository link resolves correctly.
  • Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

  • Initial onboarding survey was filled out
    We appreciate each maintainer of the package filling out this survey individually. 🙌
    Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. 🙌


Editor comments

Few small nits as I am going through the repository. These nits are non-blocking from continuing on with your review, but thought I would call them out.

An overall impression I've received from going through your packages is how much effort you all have put into documenting the tricky steps that people might not be super familiar with. This is not trivial work, so I just wanted to call out how impressed I was!

The one piece that I would like to open up for discussion is

The package imports properly into a standard Python environment import package.

I do agree that disdrodb itself imports easily from PyPI. However, it does not seem like the PyPI package itself is useable without cloning https://github.com/ltelab/disdrodb-data, which is not able to be imported via PyPI. Are you able to give some more context on why this was a different repository vs. generated from the package? It does feel a bit awkward currently, but I'm sure there were tradeoffs that impacted the decision of having this structure so I'm curious to hear why this infrastructure was chosen!

@ghiggi
Copy link
Author

ghiggi commented Feb 8, 2024

Hi @isabelizimm,

Thanks a lot for going through the software and thoroughly reviewing the documentation !
Here below I answer to the points you raised.

Documentation Glitches

I've addressed the glitches you encountered in the CONTRIBUTING.rst and the README.md. For the CONTRIBUTING.rst, I modified the references with URLs of the online documentation to allow compilation by Sphinx while ensuring the file remains readable on GitHub. As an alternative, we could considering moving CONTRIBUTING.rst to the docs/source directory and adding a symlink of the CONTRIBUTING file to the base of the repository. What is your opinion?

Reasons Behind the Creation of the disdrodb-data Repository

The decision to separate the disdrodb-data repository was made taking into consideration the project's structure and the future development.

The disdrodb-data repository, or the DISDRODB Metadata Archive, serves as an online platform to track DISDRODB stations, with each station described by a metadata YAML file. This file includes a standard set of metadata fields detailing the instrument specifics, location, and the URL to the online data storage repository.

By cloning the disdrodb-data repository to a local machine, users essentially create a storage space for DISDRODB data downloads. This setup allows for easy updates to station metadata or the addition of new stations with a simple git pull, ensuring the local data remains up-to-date and accurate. The community's ability to report specific timestamps or periods of sensor malfunctions in the DISDRODB Metadata Archive also enable for recursive improvement of data quality, while maintaining the DISDRODB product chain's transparency and reproducibility.

Given the potential high-frequency of updates to the disdrodb-data metadata repository compared to the less frequent updates of the disdrodb python package routines, separating these components was a spontaneous decision. This separation also considers GitHub's file size limits and the anticipated expansion of our archive.

It's worth noting that the disdrodb python package will, in the future, host processing algorithms for the DSD that do not necessarily require the disdrodb-data. This flexibility allows for broader application and utility of the software independent of the metadata repository.

We believe this structure best supports the project's current needs and future growth, balancing ease of use with the complexity of data management and software development.

Thank you once again for your constructive feedback and for opening up this discussion.

I am looking forward for additional questions and suggestions you might have :)

@isabelizimm
Copy link
Contributor

Hey there-- I've been thinking about this a bunch, and also have solicited some advice from the awesome pyOpenSci community on their experiences with storing data outside a package! Feel free to chime in on that thread as well if you would like (I couldn't find you on Slack to @)!

From your experience, it seems like the two repo solution has come from:

  1. having data that is too big to live inside the package in a reasonable way
  2. having data that has a much higher velocity than the package itself
  3. users having a comfortability with tools (or at least providing documentation to help users with the sticking points)

I do think that this setup feels a bit awkward, and the complexity might turn people away from the package. BUT I think these problems are super solvable and there's some great tools out there to help!

A few options that might make sense:

  • adding functions in disdrodb that grab the data from the disdrodb-data
  • making disdrodb-data into its own package, and then disdrodb depending on that

Option 1: adding functions in disdrodb that grab the data from the disdrodb-data

This is pretty classic "faking data in a package" move. The most common way to import data is something like from package_name.data import dataset_name, but it might make sense to implement a sort of load_dataset('dataset_name') functionality, similar to HuggingFace's datasets, which is a front end to fetch data from their hub. That way, maybe you could even have other arguments for date, or something (just brainstorming here 🧠 🎉)!

There is more information and options in a section that is getting added to pyOpenSci's packaging guide https://github.com/pyOpenSci/python-package-guide/pull/110/files (it's currently in draft form, but contains lots of great info).

Tool ideas for loading data:

  • pooch: grab data from any url! You could even host the data in zenodo, dropbox, or keep it right on GitHub. This one is probably lowest lift, as its prime focus is fetching data from a URL.
  • pins: able to read "pins" (where a pin == your data) from "boards" such as a GitHub repo, DropBox, etc. There are less locations, but it does offer metadata on each pin that you can customize with information that you're currently hosting in .yml files (disclaimer: I'm also a maintainer 😄)
  • DVC: data version control. This is a bigger, more robust system, but it is especially made for versioning/storing/retrieving data.

Option 2: making disdrodb-data into its own package, and then disdrodb depending on that

This is another classic move for data-- a good example of this is vega-altair, which has an optional dependency on vega-datasets which includes files in it, but also fetches datasets. There are definitely options out there if you don't want to host all the data on GitHub and are concerned about size limits; this solution might end up implementing a little bit of option 1 anyway, if you so desire.


Overall, the data can definitely stay in its current repository to help manage the velocity/size/overall organization of files, but I would strongly encourage taking one of these routes to allow users easier access to the data. I understand that it could be a big lift to make this change, but I do think it would be a great investment in the usability of disdrodb. Do you have any initial thoughts/feelings/questions on either or both of these options?

@ghiggi
Copy link
Author

ghiggi commented Feb 21, 2024

Hey Isabel,

I tried joining the pyOpenSci Slack Workspace but I've been unable because of some access restrictions: I need an invitation I guess 😉 Could you sort that out? You could send the invitation to [email protected]. Then we could also chat smoother if something is not clear.

Regarding your message, I noticed there might be some confusion around the disdrodb-data repository.
To clarify, it primarily serves as a catalog for metadata, not the actual meteorological measurements !

Maybe we could rename the disdrodb-data repository as disdrodb-metadata to avoid misinterpretation.
Anyway, to ease the discussion, here below I will refer to such repository as disdrodb-metadata.

The actual station data are stored in cloud repositories such Zenodo (see here for some example).
Using the URLs information included in the station metadata YAML files of the disdrodb-metadata repository, the disdrodb package download the stations from the cloud (i.e. Zenodo) using pooch 😉

We chose GitHub for the metadata repository for several reasons:

  • It allows for direct, online modifications to the metadata without needing to clone or download/upload files.
  • It provides a robust mechanism for tracking changes over time.
  • It enables us to run specialized CI checks to ensure metadata modifications meet our standards (see for example this dummy PR).

When starting the project, we thought about using DVC for managing the metadata archive, but decided against it due to the added complexity, which might deter individuals with limited IT experience. Additionally, DVC seemed excessive for our needs.

Opting for the two-repo solution was a deliberate choice to facilitate frequent updates to the metadata without necessitating constant releases of the disdrodb Python package each time a metadata file is edited or a new station is added to the metadata archive.

This approach might look unusual, but as far as I know, it’s also the first attempt in the landscape of meteorological (and non?) data management to create a decentralized online data archive infrastructure backed by a GitHub-powered metadata repository 😊

Regarding the options you proposed, they seem to stem from a misunderstanding of the disdrodb-metadata repository's role, but I try to answer again to both points:

Option 2: making disdrodb-metadata into its own package, and then disdrodb depending on that

The suggestion to package disdrodb-metadata seems to misunderstand its purpose.

If I understand the suggestion correctly, you assumed disdrodb-data contained the actual meteorological data, right?

The disdrodb-metadata repository does not contain any code and any meteorological measurement.
The disdrodb-metadata repository holds stations metadata files in a structured directory. By cloning this repository, we want the user to have such structure on disk so that the disdrodb package can exploit the metadata files (and the specified stations URLs) to download the actual stations data which are instead stored in online scientific data repositories.

We want this (cloned) directory to be kept in synchronization with the main repository so that any repo update can be immediately be available to the user, without requiring new package releases. The packaging solution goes in contrast with the potential high frequency of repo updates.

Option 1: adding functions in disdrodb that grab the data from the disdrodb-metadata

The disdrodb-metadata repository contains YAML metadata files that list and describe weather stations, including URLs for actual data stored online, not the data measurements themselves.

Users interested in downloading disdrometer data simply:

  1. Install disdrodb with: pip install disdrodb or conda install disdrodb
  2. Clone disdrodb-metadata via a terminal command: git clone [email protected]:ltelab/disdrodb-metadata.git
  3. Indicate the local path to the /disdrodb-metadata/DISDRODB directory in DISDRODB_BASE_DIR.
  4. The disdrodb software then fetches and downloads data from online sources like Zenodo into the cloned local disdrodb-metadata directory structure (that’s why we have called the repo disdrodb-data instead of disdrodb-metadata: the repo name also determines the name of the directory where then the data archive is downloaded locally)
  5. Updates, including new stations, are easily integrated with a simple git pull, minimizing the need for deep knowledge of Git beyond initial setup.

I recognize that using (and installing) git could introduce some complexity, but we only require the user to type one git command in the terminal, nothing more. Knowledge of the git workflow is not required.

To eliminate the need for git for the average user, I see the following options:

  1. We could suggest to download a zip file of the repository (i.e. with this link https://github.com/ltelab/disdrodb-data/archive/refs/heads/main.zip), then unzipping and moving it to the desired location of the local DISDRODB archive (defined by DISDRODB_BASE_DIR).
    Alternatively, we could integrate a utility function in the disdrodb Python package to download and unzip the disdrodb-metadata repository directly to the specified DISDRODB_BASE_DIR on the user's local machine.
    However, both options would inhibit the metadata archive's ability to update, thus preventing the download of new stations. Additionally, this approach risks overwriting and losing data if the DISDRODB_BASE_DIR subdirectories are modified by the user.

  2. We could enhance the disdrodb Python package with utilities to create the necessary local DISDRODB directory structure and then cpnnect to (and download/update metadata) from the GitHub disdrodb-metadata repository upon user request, albeit with the risk of overwriting any local changes made by the user.

Implementing this second option would demand significant effort for functionality that git already offers efficiently and reliably.
Moreover, it could complicate in several ways the process for contributors wanting to add new stations to the DISDRODB
data archive.
For example, if we adopt option 2, a new data contributor should prepare the metadata in the software-
created directories (which are not synchronized with disdrodb-metadata and does not have version control) and then copy- paste it into the dedicated directory which is the branch of the forked repository of the disdrodb-metadata. If there are edits to do, it must copy back stuffs to the other directory, check they works, then copy back, etc ...


I hope you don't feel overwhelmed by all that I've just written down 😅

For me, all this discussion it's also a great way to understand how we can further improve the documentation of our software 😃

@isabelizimm
Copy link
Contributor

Oh! It looks like I have misunderstood the disdrodb-data repository. I won't have the bandwidth today to go through all the info you've given here, but will get back to you shortly. BUT--I just sent you an invite (I think!) to the Slack since it was a quick task 😄 please do let me know if you did not receive it!

@isabelizimm
Copy link
Contributor

Okay, I am back! Thank you so much for your detailed response, that was super helpful to re-orient my brain to understand this set up.

My opinion is still that the workflow for new users is most ergonomic if they can pip install all the necessary artifacts, especially from the close-knit dependency between these two pieces of software. I do think that having git commands (even though I know it is just one very small one!) does raise the bar of accessibility to data. I like your suggested fix to fetch the .zip file; it seems like a reasonable helper function that should not create an overwhelming amount of work for you.

BUT since the intent of disdrodb-data repository is more for metadata for data that might be updated and modified by a user/community, it does make sense that it is maintained separately from the package itself. From the information you provided, I think we are okay to continue on with the review with the setup you have right now. 👍

Let's keep this conversation going with an editor with domain expertise on this specific type of data to help determine what would be most useful to the community, and have reviewers (who will be closer to the codebase) to give an extra eye on usability and "getting started" feelings.


So, tldr, no need to make any changes to the setup yet. I am actively looking for an editor, and will let you know when we have someone lined up for you!

@Batalex
Copy link
Contributor

Batalex commented May 7, 2024

Hey there! Here's a long awaited update!
@Zeitsperre will lead the review as the editor, guiding you through the process. I have updated the editor field as well.
Trevor will be your point of contact for the rest of the review, though you are welcome to ask anyone if you have any questions.
Happy review y'all 🤝

@ghiggi
Copy link
Author

ghiggi commented May 8, 2024

Thanks ! Looking ahead for your feedbacks ;)

@Zeitsperre
Copy link

Hi @ghiggi, nice to meet you!

I'll be looking into finding some reviewers this week to get the ball rolling here. This is my first time editing for PyOS, so please bear with me!

@Batalex Batalex assigned Batalex and Zeitsperre and unassigned Batalex May 20, 2024
@ghiggi
Copy link
Author

ghiggi commented Jun 27, 2024

Hi @Zeitsperre. I hope you are doing well. Is there any news regarding the review of the software?
I don't want to put pressure but when we submitted the software to pyopensci for review we were planning to have the software manuscript to be published on JOSS by the end of the summer and discuss at a conference in September the next steps with the community :). Do you think that this expectation is still realistic?

@Zeitsperre
Copy link

Hi @ghiggi,

Apologies for the delay!

I have a few reviewers in mind that will be contacted this week. In terms of the review itself, it can sometimes take more than a month to get all review comments back. Depending on the amount of changes requested, the remaining changes could take some time and effort. I have seen some review turnarounds of a few months at most.

At the very least, for the purposes of your presentation you should be able to use either the Zenodo DOI or perhaps there's a way to park a PyOS DOI ? I'll look into that after contacting the reviewers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: under-review
Development

No branches or pull requests

5 participants