Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harmonize-wq #157

Open
18 of 30 tasks
jbousquin opened this issue Feb 8, 2024 · 38 comments
Open
18 of 30 tasks

harmonize-wq #157

jbousquin opened this issue Feb 8, 2024 · 38 comments
Assignees

Comments

@jbousquin
Copy link

jbousquin commented Feb 8, 2024

Submitting Author: Justin Bousquin (@jbousquin)
All current maintainers: (@jbousquin)
Package Name: harmonize-wq
One-Line Description of Package: Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats
Repository Link: https://github.com/USEPA/harmonize-wq
Version submitted: 0.4.0
Editor: @Batalex
Reviewer 1: @rcaneill
Reviewer 2: @Jacqui-123
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): 08/10/2024


Code of Conduct & Commitment to Maintain Package

Description

  • Include a brief paragraph describing what your package does:
    The US EPA's Water Quality Portal (WQP) is a data warehouse that facilitates access to data stored in large water quality databases in a common format. There are tools to facilitate both publishing data to and retrieving data from WQP, harmonize-wq is focused on retrieved data (1) cleaning to ensure it meets the required quality standards, and (2) wrangling to get it in a more analytic-ready format. Although there are many examples where this has been done, standardized tools to perform this task could make it less time-intensive, more standardized, and more reproducible.

Scope

  • Please indicate which category or categories.
    Check out our package scope page to learn more about our
    scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):

    • Data retrieval
    • Data extraction
    • Data processing/munging
    • Data deposition
    • Data validation and testing
    • Data visualization1
    • Workflow automation
    • Citation management and bibliometrics
    • Scientific software wrappers
    • Database interoperability

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo

Community Partnerships

If your package is associated with an
existing community please check below:

  • For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):

    • Who is the target audience and what are scientific applications of this package?
      Water quality domain experts trying to synthesize available data in a stream, bay, estuary, etc.. More standardized data cleansing and wrangling allows outputs to be integrated into other tools in the water quality data pipeline, e.g., for integration into dashboards for visualization (Beck et al., 2021) or decision support tools (Booth et al., 2011).

    • Are there other Python packages that accomplish the same thing? If so, how does yours differ?
      No python packages to my knowledge, there is in R: USEPA/TADA

    • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted: Presubmission: harmonize-wq #132

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • uses an OSI approved license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a tutorial with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

JOSS Checks
  • The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
  • The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
  • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • The package is deposited in a long-term repository with the DOI:

Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

  • Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

  • I have read the author guide.
  • I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Footnotes

  1. Please fill out a pre-submission inquiry before submitting a data visualization package.

@isabelizimm
Copy link
Contributor

Hello there @jbousquin, thank you for submitting this issue--welcome to the pyOpenSci community! Just wanted to let you know we've seen your issue. The next step is for us to run some initial checks, we will give that first feedback soon.

In the meantime, if you have any questions you can ask here or in our discourse.

@isabelizimm
Copy link
Contributor

isabelizimm commented Feb 13, 2024

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci
review. Below are the basic checks that your package needs to pass
to begin our review. If some of these are missing, we will ask you
to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements
below.

  • Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
    • The package imports properly into a standard Python environment import package.
  • Fit The package meets criteria for fit and overlap.
  • Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
    • User-facing documentation that overviews how to install and start using the package.
    • Short tutorials that help a user understand how to use the package and what it can do for them.
    • API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
  • Core GitHub repository Files
    • README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
    • Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
    • Code of Conduct The package has a CODE_OF_CONDUCT.md file.
    • License The package has an OSI approved license.
      NOTE: We prefer that you have development instructions in your documentation too.
  • Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
  • Automated tests Package has a testing suite and is tested via a Continuous Integration service.
  • Repository The repository link resolves correctly.
  • Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

  • Initial onboarding survey was filled out
    We appreciate each maintainer of the package filling out this survey individually. 🙌
    Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. 🙌


Editor comments

As a Floridian, I do appreciate your tutorial locations 🐊

A few quick fixes:

  1. For the CODE_OF_CONDUCT file, it is optimal to have it at the root of the repository. Right now, it looks like yours is in docs/source/Code of Conduct.rst. I'd recommend moving that file, since that is the typical place people look for a CoC. Also, if it is in the root, it will show up as a "tab" next to your README, sort of how the MIT License is shown here 🎉
Screenshot 2024-02-13 at 6 08 26 PM
  1. Second, pending some sort of tool that requires it, you shouldn't need a separate [metadata] section in your pyproject.toml.

In the meantime, I'll start hunting for an editor to facilitate a review for you!

@jbousquin
Copy link
Author

Thanks @isabelizimm - made those suggested changes on pyOpenSci-review branch. Let me know if there is anything else while we wait.

@isabelizimm
Copy link
Contributor

No other tasks yet! That should be good to start. I think I've got an editor just about figured out, I will let you know for sure mid-next week.

@isabelizimm
Copy link
Contributor

Update: @Batalex will be the editor for harmonize-wq, guiding you through the review process. He will be the point of contact for things from here on out (although I am still happy to answer any questions if you need me!), and I've updated the Editor field in the initial comment on this issue.

@Batalex
Copy link
Contributor

Batalex commented Mar 3, 2024

Hey @jbousquin,
I am Alex, and I am delighted to be the editor for harmonize-wq!
During the coming week(s), I'll be looking into harmonize-wq's codebase and reaching out to potential reviewers. Meanwhile, feel free to address me any question you might have.

@jbousquin
Copy link
Author

Thanks @Batalex. No questions so far, let me know if anything comes up.

@Batalex
Copy link
Contributor

Batalex commented Mar 16, 2024

👋 Hi @rcaneill and @Jacqui-123! Thank you for volunteering to review for pyOpenSci!

Please don't hesitate to introduce yourselves. @jbousquin, I am pleased to announce that we found our A-team to proceed with the review.

Please fill out our pre-review survey

Before beginning your review, please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey - it will only be used in an aggregated format by our Executive Director to improve our processes and programs.

The following resources will help you complete your review:

  1. Here is the reviewers guide. This guide contains all the steps and information needed to complete your review.
  2. Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

Please get in touch with any questions or concerns! Your review is due: April 8th

Reviewers: @rcaneill, @Jacqui-123
Due date: 2024/04/08

@rcaneill
Copy link

@rcaneill survey completed.

I just filled the survey

@rcaneill
Copy link

rcaneill commented Mar 18, 2024

Hi @jbousquin I am happy to review this package and will start soon :)

@jbousquin
Copy link
Author

Thanks @rcaneill! Let me know as things come up :)

@rcaneill
Copy link

rcaneill commented Mar 22, 2024

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work.

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README.
  • Installation instructions: for the development version of the package and any non-standard dependencies in README.
  • Vignette(s) demonstrating major functionality that runs successfully locally.
  • Function Documentation: for all user-facing functions.
  • Examples for all user-facing functions.
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements
The package meets the readme requirements below:

  • Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • The package name
    • The package name is located after the badges, I guess that it is not an issue
  • Badges for:
    • Continuous integration and test coverage,
    • Docs building (if you have a documentation website),
    • A repostatus.org badge,
    • Python versions supported,
    • Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

  • Short description of package goals.
  • Package installation instructions
  • Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
    • Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
  • Link to your documentation website.
  • If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
  • Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Package structure should follow general community best-practices. In general please consider whether:

  • Package documentation is clear and easy to find and use.
  • The need for the package is clear
  • All functions have documentation and associated examples for use
  • The package is easy to install

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests:
    • All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
      • branch new_release_0_4_0
      • branch main at commit 81448a9
    • Tests cover essential functions of the package and a reasonable range of inputs and conditions.
  • Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
  • Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.
    A few notable highlights to look at:

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 8-10


Review Comments

@Batalex
Copy link
Contributor

Batalex commented Mar 31, 2024

Please find below a list of comments, with my own format (editor's privilege 🐈‍⬛ )
I tried to rank them so that you can prioritize your work. I'll complete this list as I revisit the package.

Praises

  • praise (general): The code and the docs are extra clean.
  • praise (general): Whenever I see pint, I'm happy!

Typos

  • typo (readme.md): l7 on package name
  • typo (readme.md, contributing.rst): double spaces

Nitpicks

  • nitpick (general): I recommend adding a new line at each full stop in a markdown or rst paragraph. This way, we keep the lines short in git (easier to spot diffs in PR, easier to pinpoint a line with an issue). No worries, a single new line is not rendered.
  • nitpick (domain.py): there is no need for a raw string for TADA_DATA_URL

Discussions

  • discussion (convert.py): About the TODO - both points of view (regrouping constants in a single place, or having them defined near their place of use to avoid jumping around the code base) are valid. I am usually in favor of the former.

Suggestions

  • suggestion (domain.py): In harmonize_TADA_dict, we could use a groupby operation to avoid looping through the dataframe using python. TOCHECK
  • suggestion (domain.py): We could replace the following pattern for x in list(set(pandas_series)) by using the .unique method
  • suggestion (domain.py, basic.py): out_col_lookup does not need to be a function. Same for all other functions returning a dict. If we make those simple module-level dicts, we can still list the sources in the module docstring.
  • suggestion (convert.py): We could add "references" sections in the docstrings so that the sources are present in the website and not only in the source code.
  • suggestion (basis.py, general): By using pandas' methods, we could streamline a little some operations. The choice is ultimately yours; I prefer using existing methods over rolling my own implementations, even if that means that other folks need to go to the documentation website to understand what is going on.
    For instance, here is my proposition for set_basis
def set_basis(df, mask, basis, basis_col):
    return df.assign(**{basis_col: np.where(mask, basis, np.nan)})

I find this implementation easier to read (but I understand that this is debatable), but it is also more efficient. I have noticed that you use this pattern quite a few time throughout the code base, so I figured this might interests you.

Todos

  • todo (pyproject.toml): We should remove the metadata section.
  • todo (__init__.py): importlib.metadata was added in python 3.8, which is the minimal version supported by the package according to its pyproject.toml. The try .. except block should not be needed, even more so considering that importlib_metadata is not listed in the project requirements.
  • todo (basis.py): We could regroup the conditions branches in update_result_basis
  • todo (contributing.rst): To lower the cost of entry for potention contributors, let's make sure that we provide all the information they need. Consider adding a section describing how to setup their development environment (e.g. installing the test and docs dependencies).

Issues

  • issue (general): code quality (see below)
  • issue (domain.py): requests should be listed in the project's dependencies. The rationale is as follows: we should not import in our code any transitive dependency, because we have no guarantee that the primary dependency will not drop the former in a future update. As far as we know, dataretrieval could replace requests by httpx without notice in a patch release, which would break new harmonize-wqinstallations. The same can be said about pandas, though I agree it is unlikely that geopandas will change its backend dataframe lib.
  • issue (domain.py): We should specify what kind of exception we are expecting in re_case. Making a try except block too wide can lead to hard-to-debug issues.
  • issue (general): It seems that there are circular dependencies: harmonize -> visualize -> wrangle -> harmonize or clean -> wrangle -> clean as well. They do not raise an exception for now, but they will if any imported object is used at the module level. I strongly advise that we rework the project structure so that the files get imported in an acyclic fashion. It is also way easier to get familiar with the code base as a new contributor if the structure is predictable and linear.

General recommendations

Code quality is important in a public package.
It is obvious that a great amount of care went in making harmonize_wq, but what I mean by code quality is having tools enforcing conventions across the code base.
Such conventions usually cover code format, and catching simple anti patterns.

To do so, I would advise you to use both a linter and a formatter.
I usually recommend:

  • black for formatting the code
  • ruff to validate that the code follows good practices, and do quick fixes.

This is up to debate of course, some people might prefer one tool over another, but the point is that a project using such tools:

  • is more welcoming to external contributors
  • needs less time dedicated to low-value maintainance.

If you are ok with everything I said so far, I'd be happy to propose a PR to help you setup everything.

@jbousquin
Copy link
Author

I'll start addressing these on a pyOpenSciReview branch (I'll try to be better about merging to main so other reviewers aren't running into the same things). Will generate a issue task list w/ any that are more involved. Let me know if there is anything else that I should be doing for review/edit tracking.

Would love a PR for black & ruff setup - have been running a linter and code analysis locally and definitely see the value for contributors/maintenance. Only concern is being able to easily ignore certain conventions when appropriate.

@jbousquin
Copy link
Author

@Batalex fixing issue (general): circular dependencies - will be a breaking change. To resolve I moved functions from harmonize, df_checks()/add_qa_flag() to clean, convert_unit_series() to convert and units_dimension() to wq_data (to become a method). These seemed as logical a place to find them as harmonize. Now importing specific functions from other modules where practical. This breaks docs - before addressing that I wanted to confirm this is what you had in mind?

@Batalex
Copy link
Contributor

Batalex commented Apr 3, 2024

@jbousquin Based on a quick look through the PR, yes that's exactly what I had in mind

@Jacqui-123
Copy link

Jacqui-123 commented Apr 17, 2024

Great package! I hope these comments are helpful. This was my first package review so please let me know if there is anything I missed or if I was misguided with any of my comments.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README.
  • Installation instructions: for the development version of the package and any non-standard dependencies in README.
  • Vignette(s) demonstrating major functionality that runs successfully locally.
  • Function Documentation: for all user-facing functions.
  • Examples for all user-facing functions.
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements
The package meets the readme requirements below:

  • Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • The package name
  • Badges for:
    • Continuous integration and test coverage,
    • Docs building (if you have a documentation website),
    • A repostatus.org badge,
    • Python versions supported,
    • Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

  • Short description of package goals.
  • Package installation instructions
  • Any additional setup required to use the package (authentication tokens, etc.)
  • Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
    • Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
  • Link to your documentation website.
  • If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
  • Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Package structure should follow general community best-practices. In general please consider whether:

  • Package documentation is clear and easy to find and use.
  • The need for the package is clear
  • All functions have documentation and associated examples for use
  • The package is easy to install

Functionality (Skipped this)

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests:
    • All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
    • Tests cover essential functions of the package and a reasonable range of inputs and conditions.
  • Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
  • Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.
    A few notable highlights to look at:
    • Package supports modern versions of Python and not End of life versions.
    • Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

approximately 8

Review Comments

  1. Harmonize_Pensacola.Rmd:
    -Small language changes suggested to make the installation process more user-friendly and clear:
    -make it clear when something is an option to run and when it's step-by-step instruction, as it switches back an forth in this demo. For example, could add "# Install the harmonize-wq package... [#option 1] package install... [#option 2] development version..."
    -Clearer separation of code chunks by task, so each code chunk focuses on a specific task. This makes debugging/error message interpretation easier. Ie a new code chunk after options(reticulate.conda_binary = "..."), new code chunks after conda_install() section (lines 72, 81). (For good examples see the .ipynb demo files for this package).
    -I think use_condaenv("wq_harmonize") should be use_condaenv("wq-reticulate") (line 90)

  2. Comments for Harmonize_CapeCod_Simple.ipynb
    -easy to follow and clearly documented
    -attribute errors for harmonize_all(df, errors='ignore'): AttributeError: 'float' object has no attribute 'upper' (these attribute errors happened a few times in the other demos, too.)

  3. usability:
    -"All functions have documentation and associated examples for use" -> I wasn't completely clear on exactly each function did, particularly some of the cleaning/tidying ones and how they changed the resulting dataframe. For example, what are all the flag options in the QA_flag column and what do each of them mean? The overall package was really clear though in terms of what it was doing and how, but some of the nuances were less clear to me.

  4. I am curious to know if the package looks at or flags the different method detection limits (mdl) that different analytical laboratories often use, or if that is an issue with this dataset? I tend to run into this issue in my work but I don't typically work with EPA datasets.

@Batalex
Copy link
Contributor

Batalex commented Apr 27, 2024

Hey @jbousquin,
I just want to give you a brief update. @rcaneill privately reached out to me, and needs some more time to proceed with the review due to personal reasons.
Meanwhile, you can proceed with the two reviews you have here so that we avoid staling this issue for @Jacqui-123.
Does this arrangement work for you?

@jbousquin
Copy link
Author

Hey @Batalex - that works for me. I've already been working through issues/suggestions as received/as I can.

@rcaneill
Copy link

Hi @Batalex and @jbousquin, I finished my review (cf #157 (comment))
I have 0 knowledge about the water quality field, but I found the doc quite clear :)

@Batalex
Copy link
Contributor

Batalex commented May 31, 2024

Hey @jbousquin,
I noticed that this review has been quite stale lately, and so has harmonize_wq's codebase.

Would you mind giving us a rough rundown on how and when you plan to address the reviews?
My goal here is to set the proper expectations for everyone and manage our reviewers' time effectively.

@jbousquin
Copy link
Author

Hey @Batalex - yes a couple PRs in the pipeline I need to check tutorials on but had to back-burner with the holiday and field season coming up. Hoping to get those merged this week and that should resolve most of the major changes. I've been sitting on the ruff PR to see if I can work it out as a pre-commit, trying to avoid contributors having to have the dev depends where possible.

@Batalex
Copy link
Contributor

Batalex commented Jul 1, 2024

Hello @jbousquin,

I sent you a reminder a month ago about the review going stale, and I have not seen any public activity on the repository ever since.

As I said before, the deadlines in our process are more like guidelines as to when we expect things to move forward. However, as the editor for this submission, I have a responsibility to the volunteers who gave their personal time to do an in-depth review of harmonize-wq, and even submitted PRs. It is okay to be late, but I expect you to be transparent and committed to moving forward with the review.
Per our review policy, I am putting this submission on hold, and will close it one month from now on if I do not see any change.
Thank you for your understanding.

@Batalex Batalex added the on-hold A tag to represent packages on review hold until we figure out a bigger issue associate with review label Jul 1, 2024
@jbousquin
Copy link
Author

Hello @Batalex,

Apologies - I was hoping to have gotten the tutorials checked against changes and summaries of changes/responses copied over here before getting buried in field work in June. My intent is not to be non-transparent, just hopeful I would have had a chance to do those small tasks by now. Should see some movement this week. Thank you for your understanding.

@jbousquin
Copy link
Author

Hey @Jacque-123,
Thanks again for your review. Several changes over on the package repo I wanted to draw your attention to/responses to comments:

1. Harmonize_Pensacola.Rmd:
-Small language changes suggested to make the installation process more user-friendly and clear:
-make it clear when something is an option to run and when it's step-by-step instruction, as it switches back an forth in this demo. For example, could add "# Install the harmonize-wq package... [#option 1] package install... [#option 2] development version..."
-Clearer separation of code chunks by task, so each code chunk focuses on a specific task. This makes debugging/error message interpretation easier. Ie a new code chunk after options(reticulate.conda_binary = "..."), new code chunks after conda_install() section (lines 72, 81). (For good examples see the .ipynb demo files for this package).
-I think use_condaenv("wq_harmonize") should be use_condaenv("wq-reticulate") (line 90)

Two PRs (67 & 78 from branch 62) were used to make these suggested updates to the example for setting up and running the python package in R. The second PR focused on CI/CD tests via git actions that will render the rmd to help ensure there are not errors. One of the runners generates an artifact for easier inspection (e.g., https://github.com/USEPA/harmonize-wq/actions/runs/9811884248/artifacts/1672085755). Hopefully that will help make it easier to identify any further text edit suggestions you have.

2. Harmonize_CapeCod_Simple.ipynb
-attribute errors for harmonize_all(df, errors='ignore'): AttributeError: 'float' object has no attribute 'upper' (these attribute errors happened a few times in the other demos, too.)

Generated an issue for this, hard to reproduce but I have a feeling it has to do with dependency management and how you installed the package. I'm hopeful changes to the pyproject file will fix it (USEPA/harmonize-wq@b125f65), but if not we can try to dig into this error a bit more.

3 usability:
-"All functions have documentation and associated examples for use" -> I wasn't completely clear on exactly each function did, particularly some of the cleaning/tidying ones and how they changed the resulting dataframe. For example, what are all the flag options in the QA_flag column and what do each of them mean? The overall package was really clear though in terms of what it was doing and how, but some of the nuances were less clear to me.

I suspect the issue here is having something that goes deeper than the function documentation, which is what the tutorials are meant to do. There are a lot of functions (50+), each should currently be documented in numpy style (with input/return parameter types/descriptions and examples). clean.add_qa_flag() is meant to be used by higher level functions to add QA_flags as the data are cleaned and harmonized, i.e., to make changes/assumptions that might have quality issues more transparent to the user and allow them to filter/remove on them if it doesn't meet their QA standards. Those higher-level functions should document the specific flag string used, e.g., basis.basis_from_unit() provides an example where speciation was updated by conflicting meta-data. However, the add_qa_flag() function is exposed to the user because we can't anticipate all data they may want to flag. The example shows a custom mask and flag text (it's a bit of a spam and eggs type example, simplified to show how it works) whereas examples in the tutorials are more 'real-world'. In e.g., Harmonize_Pensacola_Detailed.ipynb, code-block 11 we show how the docstring for harmonize_locations can be displayed, and that references how it implements 'QA_flag' to identify 'any row that has location based problems like limited decimal precision or an unknown input CRS'. In code-block 15 of the same demo we examine what QA_flags were assigned. In code-block 27-29 we look to this flag to help explain why ResultMeasure/MeasureUnitCode is NaN. There are several additional examples of this in that notebook and all of the detailed notebooks should follow a similar structure. Please let me know if any of the functions are missing documentation or examples in the docs, if you have suggestions for improving any of those descriptions/examples, if you have suggestions for improving the detailed tutorials to make the use of QA_flags clearer, etc.

4. I am curious to know if the package looks at or flags the different method detection limits (mdl) that different analytical laboratories often use, or if that is an issue with this dataset? I tend to run into this issue in my work but I don't typically work with EPA datasets.

This is 100% the direction of some future feature adds. Specifically, 17 plans to address detection limits. It is a multi-part problem though. The existing function will pull in detection limits from that specific meta-data table, but then it needs to be compared against results to determine if the result value was under it and a QA_flag needs to be assigned. If the result value is under the limit there are several alternatives to estimate values statistically (user would have actively choose to alter results in this way but we could port the functionality from USEPA/EPATADA). However, as you've also identified the data-provider may have specified a method with a standard MDL, in which case the detection limit might not be in the meta-data table and might have to be inferred from those methods. Methods filtering 37 is the first step for that, where we start to develop a table/dict of standard methods and try to recognize them (a lot of differences in how they are entered). MDL could be associated with each as a col in that table/lookup or in a related table/lookup.

@jbousquin
Copy link
Author

jbousquin commented Jul 19, 2024

@Batalex - weird I commented your responses a couple weeks ago, but just came back to make sure I hadn't missed anything from you and don't see that comment here... I'll try to re-create, mainly just copying over month old status from the repo (there is also follow-up on your draft PR that I'd written after as follow-up in case you didn't see it here)

@jbousquin
Copy link
Author

@Batalex If you would like additional links/line numbers just let me know:

Typos
should be resolved as suggested

Nitpicks
nitpick (general)
should be resolved as suggested

nitpick (domain.py): there is no need for a raw string for TADA_DATA_URL
This url is only used once at the moment, but is currently a raw string (1) to allow it to be easily integrated into feature adds (i.e., intend to use it more places, especially w/ WQX 2->3), and (2) for easier maintenance given the repo is still underdevelopment (e.g., like when the url recently changed).

Discussions
Kept it in convert module because fewer module references made ensuring no circular references easier. Already importing registry_adds_list from domains so there isn't a strong reason not to move it there if the need arises in the future.

Suggestions
suggestion (domain.py): In harmonize_TADA_dict, we could use a groupby operation to avoid looping through the dataframe using python. TOCHECK
should be resolved as suggested, was there more to the TOCHECK?

suggestion (domain.py): We could replace the following pattern for x in list(set(pandas_series)) by using the .unique method
should be resolved as suggested

suggestion (domain.py, basic.py): out_col_lookup does not need to be a function. Same for all other functions returning a dict. If we make those simple module-level dicts, we can still list the sources in the module docstring.
These have been updated to be module-level dicts, but I'm not sure on how you are proposing the docstrings could be included. Hate to lose all the examples etc. on these, have you seen this in documentation for other projects you could point me to?

suggestion (convert.py): We could add "references" sections in the docstrings so that the sources are present in the website and not only in the source code.
When a conversion function has equation or methods references the documentation has a reference section for that (e.g., conductivity_to_PSU). However, if the information is for code/checks then it goes in as a comment in the code (e.g., the url in DO_concentration get to a converter written in JS). In those cases is it adequate/suggested to add contextual comments, e.g., # To check compare against:

suggestion (basis.py, general): By using pandas' methods, we could streamline a little some operations. The choice is ultimately yours; I prefer using existing methods over rolling my own implementations, even if that means that other folks need to go to the documentation website to understand what is going on.

I agree on using existing methods, I really tried to implement this suggestion but ran into issues. In the provided example if there are existing values in columns those need to be preserved. That can be done with an if/else. Additionally, numpy.where will coerce the other values (y) to the dtype which is problematic for nan. Do-able, but more complex than the current solution.

Todos
pyproject.toml & init
should be resolved as suggested
basis.py: regroup conditions in update_result_basis
Admittedly these additional basis columns haven't received much attention yet (not frequently leveraged by those entering data), and it was coded this way to make it easy to come back to and write additional specific handling. For now we combined weight/time, left particuleSize as is with added notes specific to it's handling.

contributing.rst
Added dev section

Issues
domain.py: dependencies
Added the suggested dependencies (stop short of pandas but did include numpy). pyproj.toml should populate depends from requirements now - decreasing maintenance/risk of differences.
domain.py: specify exception expected by re_case
Resolved as suggested
Circular dependencies
should be resolved as suggested

General recommendations

To summarize, working on implementing black. All the code changes are sitting on the pyOpenSci-review branch. It runs locally as suggested in your PR. I'm trying to get my head around pre-commits so that contributors will have style/format checks without having to run it locally.

@jbousquin
Copy link
Author

@rcaneill - Really appreciate your doing issues/PRs over on the repo (saves steps!). I think we resolved everything over there (leaving the citation issue open so it gets resolved after), but let me know if I missed anything from your review here.

@Batalex
Copy link
Contributor

Batalex commented Jul 29, 2024

@jbousquin, here is some quick feedback.

nitpick (domain.py): there is no need for a raw string for TADA_DATA_URL
This url is only used once at the moment, but is currently a raw string (1) to allow it to be easily integrated into feature adds (i.e., intend to use it more places, especially w/ WQX 2->3), and (2) for easier maintenance given the repo is still underdevelopment (e.g., like when the url recently changed).

I am not sure how using a raw string is relevant to the reasons you mentioned. Maybe we are not talking about the same thing: I am speaking about the r prefix in r"http://url.com". Raw strings are usually used in regular expressions.

suggestion (domain.py, basic.py): out_col_lookup does not need to be a function. Same for all other functions returning a dict. If we make those simple module-level dicts, we can still list the sources in the module docstring.
These have been updated to be module-level dicts, but I'm not sure on how you are proposing the docstrings could be included. Hate to lose all the examples etc. on these, have you seen this in documentation for other projects you could point me to

The idea would be to add the sources and any relevant information in the module docstring:

constants.py

"""
Constants submodule.


References
-----------

Plank:
The NIST Reference on Constants, Units, and Uncertainty. [NIST](https://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology). 20 May 2019.
"""

plank = 6.62607015e-34

Then you can access the source using help on the submodule, just like you would on a function. python -c "import constant;help(constant)"

Help on module constant:

NAME
    constant - Constants submodule.

DESCRIPTION

    References
    -----------

    Plank:
    The NIST Reference on Constants, Units, and Uncertainty. NIST. 20 May 2019.

DATA
    plank = 6.62607015e-34

As for the rest of my original points, I am okay with the changes / reasons not to change. Nice job!

@Batalex
Copy link
Contributor

Batalex commented Jul 29, 2024

@Jacqui-123, @rcaneill Were your concerns addressed?

jbousquin added a commit to USEPA/harmonize-wq that referenced this issue Jul 29, 2024
Doesn't need to be raw string (see Batalex pyOpenSci/software-submission#157 (comment))
@jbousquin
Copy link
Author

@Batalex RE quick feedback:

Ah! You really did mean it being raw string not it being a constant, resolved on branch (passing, will merge with the linting).

docstrings for dict constants - what I was stuck on was what to document it as if module level (''Attributes'' for sphinx). I'm not sure how to do the child level of an attribute, e.g., Examples, but I'll play around with it. docstring at the variable I wasn't sure how to associate it (still not sure of that, but looking at the sphinx doc helped me understand it needed to be after), documented that way the child level works, but I see where it doesn't seem to be part of the module level help, and I'm not sure how you would get help to retrieve the variable level doc-string (will look into that if module level doesn't work out).

@Jacqui-123
Copy link

@jbousquin Thanks so much for the detailed response to my review/comments. The changes look great, and I appreciate your explanations. @Batalex I don't have anything further to add but let me know if you need anything else.

@Batalex
Copy link
Contributor

Batalex commented Jul 30, 2024

@jbousquin Thanks so much for the detailed response to my review/comments. The changes look great, and I appreciate your explanations. @Batalex I don't have anything further to add but let me know if you need anything else.

Perfect, I just need you to check the approval box in your review above. Thank you so much for contributing to this review!

@jbousquin
Copy link
Author

@Batalex RE:RE quick feedback: module level doc-strings are passing for both help() and docs.

pre-commits are very close to working, just need ruff to see settings in pyproject.toml like it does when local. Tried a few things based on pre-commit issues but haven't solved it yet. Close to just writing them out in the config - but reluctant since that duplicates what is in the toml (more maintenance making sure they always match)

@rcaneill
Copy link

@Batalex I am happy with the changes made / the answers when the authors disagreed with me

jbousquin added a commit to USEPA/harmonize-wq that referenced this issue Aug 3, 2024
* Implementing suggested ruff rules

* isort

* Fix whitespace (many of these were copied from docs example execution - need to confirm it passes docs tests)

* Run test.yml on push to this branch

* Whitespace

* F401 (redundant alias)

* Missed whitespace

* First attempt w/ pre-commit

* Fix indent

* indent/drop name

* Rename .pre-commit-config.yml to .pre-commit-config.yaml

yAml

* Update .pre-commit-config.yaml

fix file structure

* Reduce .pre-commit-config.yaml

Reduce what files it is run on

* Update domains.py

Doesn't need to be raw string (see Batalex pyOpenSci/software-submission#157 (comment))

* Dict doc strings as module level attributes

* Update to main (#88)

* Update domains.py

'Field' -> 'Field***'

* 62   r test ci (#86)

Update test_r.yaml to install conda outside r, specifically miniforge, then run on env from setup with current package (vs pip installing main)

* Update .pre-commit-config.yaml

From issue: pass_filenames: false in the pre-commit config so that the file discovery is done by Ruff taking into account the includes and excludes configured by the user in their pyproject.toml

* Update .pre-commit-config.yaml

Try updating to patch version and specify config in args.

* Update pyproject.toml

try withouth 'docstring-code-format = true' as this may override other settings.

* Update pyproject.toml

Try to get pre-commit to see config

* Update pyproject.toml

Warning message, so it is getting these settings from the toml?

* Update conf.py

E501

* Update basis.py

E501

* Update basis.py

Moved constant doc-string to module level

* Update clean.py

E501

* Update convert.py

E501

* Update conf.py

lint/format edits

* Update pyproject.toml

Without single checking if double is default

* Update pyproject.toml

Will move to one or the other (likely default double for ease), but trying to post-pone to work through diff

* lint/formating

* linted

* W293

* black format/lint

* W605 - try pulling r str out of test doc-string and instead as a comment. Comment shouldn't cause problems but this one has in the past.

* I001 (all whitespace except test_harmonize_WQP.py)

* lint conf file

* lint

* Add white space between module doc-string and imports

* Format: add whitespace after mod doc-string

* Add assert for actual2 - where the characteristics specific function is used instead of the generic.

* Resolved some E501

* Check if new line fails doctest

* Revert to get doc-test passing

* Spread out example df entry

* Spread out dict read out to reduce line length. White space is already normalized for doc-test so this may pass.

* Revert

* Spread out building df for wq_dat.WQCharData example.

* spread out example df for we_date.measure_mask()

* Shotern len of dict for wq_data.replace_unit_str() & wq_data.apply_conversion() examples

* Attempt to skip E501 on this line

* skip rule on line

* Last attempt to ignore line too long in docstrings (3)

* Update pyproject.toml

Drop single quote for lint

* '' -> ""

* Update test.yml

Revert back to testing on main only
@jbousquin
Copy link
Author

@Batalex - resolved ruff checks with pre-commits on PR 89, please let me know if there is anything unresolved from your review. Really happy getting lint/formatting as part of this workflow and thank you as the edits to the pyproject.toml in your draft PR helped immensely!

@Batalex
Copy link
Contributor

Batalex commented Aug 10, 2024

🎉 harmonize-wq has been approved by pyOpenSci! Thank you @jbousquin for submitting harmonize-wq and many thanks to @rcaneill and @Jacqui-123 for reviewing this package! 😸

Author Wrap Up Tasks

There are a few things left to do to wrap up this submission:

  • Activate Zenodo watching the repo if you haven't already done so.
  • Tag and create a release to create a Zenodo version and DOI.
  • Add the badge for pyOpenSci peer-review to the README.md of . The badge should be [![pyOpenSci Peer-Reviewed](https://pyopensci.org/badges/peer-reviewed.svg)](https://github.com/pyOpenSci/software-review/issues/157).
  • Please fill out the post-review survey. All maintainers and reviewers should fill this out.

It looks like you would like to submit this package to JOSS. Here are the next steps:

  • Login to the JOSS website and fill out the JOSS submission form using your Zenodo DOI. When you fill out the form, be sure to mention and link to the approved pyOpenSci review. JOSS will tag your package for expedited review if it is already pyOpenSci approved.
  • Wait for a JOSS editor to approve the presubmission (which includes a scope check).
  • Once the package is approved by JOSS, you will be given instructions by JOSS about updating the citation information in your README file.
  • When the JOSS review is complete, add a comment to your review in the pyOpenSci software-review repo here that it has been approved by JOSS. An editor will then add the JOSS-approved label to this issue.

🎉 Congratulations! You are now published with both JOSS and pyOpenSci! 🎉

Editor Final Checks

  • Make sure that the maintainers filled out the post-review survey
  • Invite the maintainers to submit a blog post highlighting their package. Feel free to use / adapt language found in this comment to help guide the author.
  • Change the status tag of the issue to 6/pyOS-approved6 🚀🚀🚀.
  • Invite the package maintainer(s) and both reviewers to slack if they wish to join.
  • If the author submits to JOSS, please continue to update the labels for JOSS on this issue until the author is accepted (do not remove the 6/pyOS-approved label). Once accepted add the label 9/joss-approved to the issue. Skip this check if the package is not submitted to JOSS.
  • If the package is JOSS-accepted please add the JOSS doi to the YAML at the top of the issue.

If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide.

@Batalex Batalex added 6/pyOS-approved and removed 4/reviews-in-awaiting-changes on-hold A tag to represent packages on review hold until we figure out a bigger issue associate with review labels Aug 10, 2024
@jbousquin
Copy link
Author

jbousquin commented Aug 21, 2024

Author Wrap Up Tasks

Will update as tasks to wrap up this submission are completed:

  • Activate Zenodo watching the repo if you haven't already done so.
  • Tag and create a release to create a Zenodo version and DOI.
  • Add the badge for pyOpenSci peer-review to the README.md of . The badge should be pyOpenSci Peer-Reviewed.
  • Please fill out the post-review survey. All maintainers and reviewers should fill this out.

It looks like you would like to submit this package to JOSS. Here are the next steps:

  • Login to the JOSS website and fill out the JOSS submission form using your Zenodo DOI. When you fill out the form, be sure to mention and link to the approved pyOpenSci review. JOSS will tag your package for expedited review if it is already pyOpenSci approved.
  • Wait for a JOSS editor to approve the presubmission (which includes a scope check).
  • Once the package is approved by JOSS, you will be given instructions by JOSS about updating the citation information in your README file.
  • When the JOSS review is complete, add a comment to your review in the pyOpenSci software-review repo here that it has been approved by JOSS. An editor will then add the JOSS-approved label to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: pyos-accepted
Development

No branches or pull requests

5 participants