Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling kwargs to be passed to pandas.ExcelFile #877

Merged

Conversation

korsbakken
Copy link
Collaborator

@korsbakken korsbakken commented Aug 26, 2024

Detects kwargs passed to IamDataFrame.__init__ and to read_pandas that need to be passed to pandas.ExcelFile.__init__ to be handled properly.

This PR closes #876

Please confirm that this PR has done the following:

  • [ ] Tests Added
  • [ ] Documentation Added
  • [ ] Name of contributors Added to AUTHORS.rst
  • Description in RELEASE_NOTES.md Added

Tests: I have run existing tests, and tested locally that the new code reads Excel files correctly with engine="calamine" as a kwarg to pyam.IamDataFrame. But a new test under /tests/ that tests the added/fixed functionality (rather than just regression testing) would require that probably aren't intended to be part of the core requirements. If the lead developers would like me to do this, I would appreciate guidance on how to add a test that requires extra optional packages, without breaking the existing test suite.

Documentation: As far as I can tell, this is a bug fix, and shouldn't require new documentation. The one new helper function, utils.get_excel_file_with_kwargs, has a docstring that explains how it's used.

Name in AUTHORS.rst: This is a fairly trivial addition, and I don't need my name in the AUTHORS file for it. But I can add it in a update commit if you do want it there.

Description of PR

The PR modifies lines in pyam.core.IamDataFrame.__init__ and pyam.utils.read_pandas that create a pandas.ExcelFile instance to read an Excel file. The modifications ensure that keyword arguments that are recognized by pandas.ExcelFile.__init__ are passed on to the pandas.ExcelFile instance and removed from calls to later methods (which would cause, e.g., pandas.read_excel to raise an exception).

The modifications add a new helper function get_excel_file_with_kwargs to pyam.utils. The function takes an excel file path and arbitrary kwargs, extracts the keyword arguments that are accepted by pandas.ExcelFile.__init__, creates a new ExcelFile instance with those keyword arguments, and returns it along with a dict of the other keyword arguments that were not used.

The new functionality makes it possible to use the keyword arguments engine, storage_options and engine_kwargs with IamDataFrame.__init__ and utils.read_pandas when reading Excel files, which could cause an exception to be raised in the previous version.

Detects kwargs passed to `IamDataFrame.__init__` and to `read_pandas` that
need to be passed to `pandas.ExcelFile.__init__` to be handled properly.
@danielhuppmann
Copy link
Member

Thanks!

One suggestion for adding a test: add a few more keyword arguments, like header, usecols, skiprows, nrows, then make a copy of

def test_read_xls(test_df_year):
with

import_df = IamDataFrame(TEST_DATA_DIR / "test_df.xls", nrows=2)
assert_iamframe_equal(test_df_year.filter(scenario="scen_a"), import_df)

Copy link

codecov bot commented Aug 26, 2024

Codecov Report

Attention: Patch coverage is 96.29630% with 1 line in your changes missing coverage. Please review.

Project coverage is 95.0%. Comparing base (ddbb88e) to head (6a300c2).
Report is 32 commits behind head on main.

Files Patch % Lines
pyam/utils.py 91.6% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main    #877   +/-   ##
=====================================
  Coverage   95.0%   95.0%           
=====================================
  Files         64      64           
  Lines       6134    6216   +82     
=====================================
+ Hits        5828    5910   +82     
  Misses       306     306           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Added a test that kwargs to `IamDataFrame.__init__` when reading from an Excel
file are passed on to `pandas.read_excel` and `pandas.ExcelFile`.
Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but you need to add the pandas-arguments to make that test strategy work.

tests/test_io.py Show resolved Hide resolved
pyam/utils.py Show resolved Hide resolved
@danielhuppmann
Copy link
Member

I guess that the pytest-legacy test (with pandas 2.1.2) is failing because engine_kwargs={"data_only": False} is not supported there? Can you add a pytest-skip for versions lower than 2.2 for that test? Then this PR is good to be merged!

@korsbakken
Copy link
Collaborator Author

I guess that the pytest-legacy test (with pandas 2.1.2) is failing because engine_kwargs={"data_only": False} is not supported there? Can you add a pytest-skip for versions lower than 2.2 for that test? Then this PR is good to be merged!

Yes. In fact, both the added tests fail with pandas < 2.2 for different reasons. test_read_xlsx_calamine fails because calamine wasn't added as an engine before 2.2.0. And test_read_xlsx_kwargs fails because pandas included hard-coded engine_kwargs that couldn't be overridden until the same version. I will add a warning about the latter if anyone tries to use engine_kwargs with pandas < 2.2 in the main code, and skip the tests in that case.

Pandas < 2.2.0 does not support calamine as an engine, and has hard-coded some
engine_kwargs and therefore does not fully support specifying a custom
engine_kwargs keyword argument to `pandas.ExcelFile`.
Updated workflows to include python-calamine in the testing environment.
@korsbakken
Copy link
Collaborator Author

I pushed a new version now, with a warning if anyone tries to use engine_kwargs with pandas version lower than 2.2.0, and skipping the tests if pandas is lower than that version.

In order to actually test the new functionality in CI, I also added a new optional dependency group calamine to pyproject.toml and updated the pytest workflows to include it. In order to make that work, I also had to update poetry.lock. If you would rather not include that, you can either reject the last two commits if that's possible, or just let me know and I will push a revert commit.

pyam/utils.py Outdated Show resolved Hide resolved
Only warn about `engine_kwargs` parameter not being fully supported at pandas < 2.2.0 for users who both have a lower pandas versions *and* actually attempt to use the engine_kwargs keyword.

Co-authored-by: Daniel Huppmann <[email protected]>
Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much, very nice improvement!

@danielhuppmann danielhuppmann merged commit e27e90e into IAMconsortium:main Aug 28, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kwargs are not passed to pandas.ExcelFile when reading Excel files with IamDataFrame
2 participants