Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve error(s) in looping exercise solution. #670

Closed
wants to merge 1 commit into from

Conversation

davidwilby
Copy link
Contributor

Hi, I think there is an error in the last exercise's solution in episode 14: Looping Datasets.

I don't think that this is associated with an open issue, sorry if I'm mistaken.

The last exercise "Comparing Data" has the following solution:

import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
    dataframe = pd.read_csv(filename)
    # extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
    # we will split the string using the split method and `_` as our separator,
    # retrieve the last string in the list that split returns (`<region>.csv`), 
    # and then remove the `.csv` extension from that string.
    region = filename.split('_')[-1][:-4] 
    dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()

However, this results in an error at in the call to dataframe.mean(), eg.:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 dataframe.mean()

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/frame.py:11666, in DataFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11658 @doc(make_doc("mean", ndim=2))
  11659 def mean(
  11660     self,
   (...)
  11664     **kwargs,
  11665 ):
> 11666     result = super().mean(axis, skipna, numeric_only, **kwargs)
  11667     if isinstance(result, Series):
  11668         result = result.__finalize__(self, method="mean")

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/generic.py:12413, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  12406 def mean(
  12407     self,
  12408     axis: Axis | None = 0,
   (...)
  12411     **kwargs,
  12412 ) -> Series | float:
> 12413     return self._stat_function(
  12414         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  12415     )

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/generic.py:12370, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  12366 nv.validate_func(name, (), kwargs)
  12368 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12370 return self._reduce(
  12371     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  12372 )

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/frame.py:11535, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  11531     df = df.T
  11533 # After possibly _get_data and transposing, we are now in the
  11534 #  simple case where we can use BlockManager.reduce
> 11535 res = df._mgr.reduce(blk_func)
  11536 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
  11537 if out_dtype is not None and out.dtype != "boolean":

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/internals/managers.py:1501, in BlockManager.reduce(self, func)
   1499 res_blocks: list[Block] = []
   1500 for blk in self.blocks:
-> 1501     nbs = blk.reduce(func)
   1502     res_blocks.extend(nbs)
   1504 index = Index([None])  # placeholder

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/internals/blocks.py:404, in Block.reduce(self, func)
    398 @final
    399 def reduce(self, func) -> list[Block]:
    400     # We will apply the function and reshape the result into a single-row
    401     #  Block with the same mgr_locs; squeezing will be done at a higher level
    402     assert self.ndim == 2
--> 404     result = func(self.values)
    406     if self.values.ndim == 1:
    407         res_values = result

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/frame.py:11454, in DataFrame._reduce.<locals>.blk_func(values, axis)
  11452         return np.array([result])
  11453 else:
> 11454     return op(values, axis=axis, skipna=skipna, **kwds)

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    145         result = alt(values, axis=axis, skipna=skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    401 if datetimelike and mask is None:
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:
    407     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
    718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):
    723     count = cast(np.ndarray, count)

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:1686, in _ensure_numeric(x)
   1683 inferred = lib.infer_dtype(x)
   1684 if inferred in ["string", "mixed"]:
   1685     # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1686     raise TypeError(f"Could not convert {x} to numeric")
   1687 try:
   1688     x = x.astype(np.complex128)

TypeError: Could not convert ['AustraliaNew Zealand'] to numeric

This should be corrected by using index_col="country" when reading the csv.

Secondly, there is another error caused due to the continent column being present in the gapminder_gdp_americas.csv (but not any of the others).

I appreciate that these may be deliberate mistakes, though I suspect not since this is within the solution.

Copy link

github-actions bot commented Feb 6, 2024

🆗 Pre-flight checks passed 😃

This pull request has been checked and contains no modified workflow files, spoofing, or invalid commits.

It should be safe to Approve and Run the workflows that need maintainer approval.

@davidwilby davidwilby changed the title Resolve error(s) in looping exercise. Resolve error(s) in looping exercise solution. Feb 6, 2024
@vahtras
Copy link
Contributor

vahtras commented Feb 7, 2024

Thank you! This used to be perfectly fine because pandas would silently ignore numeric operations, such as df.mean(), for columns where it did not make sense. Apparently, the approach is now to transform all columns to numeric format first, and then calculate the mean. I don't think this is a good idea, because when this fails, we get the traceback with TypeError we see.

Setting the index column to 'country' works for some files but one of the files (americas) also has the continent column (All values the same). This could be an oversight in the generation of the datafile, since the regional tables now do not follow the same pattern.

So what to do? Your suggestion together with cleaning up the data files would work. On the other hand life is full of unclean data so a more realistic approach could be to filter out the 'gdp' columns.

Something like

import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
    dataframe = pd.read_csv(filename)
    region = filename.split('_')[-1][:-4]
    gdpdata = [c for c in dataframe.columns if c.startswith('gdp')]
    dataframe[gdpdata].mean().plot(ax=ax, label=region)
plt.legend()

which gives

gdp

Whoops, the x labels do not look good here either, they write into each other. We need to handle them somehow.

To summarise: great that you found this error! I myself haven't decided which solution is the best here and appreciate comments from other maintainers.

Olav

@alee
Copy link
Member

alee commented Feb 8, 2024

Thank you for catching this @davidwilby and taking the time to document and submit a PR for this! Ditto @vahtras for the additional context.

I like the filtering approach and think it's useful. To muddy the waters even more here's two other ways we could do this:

  1. use dataframe.filter e.g.,

dataframe.filter(like="gdp").mean().plot(...)

to only compute the mean on columns with gdp in the name.

  1. set the numeric_only kwarg to True when we compute the mean, e.g.,

dataframe.mean(numeric_only=True).plot(...)

to only compute the mean on columns with numeric data

@vahtras
Copy link
Contributor

vahtras commented Feb 8, 2024

Great options @alee! I did not know about filter...like. Both are fine with me, but I guess numeric_only is closest to the original solution, so I would prefer that in this case.

@davidwilby
Copy link
Contributor Author

Glad to be of help!

Depending on how kind you want to be to learners, it may be worth including a note in the exercise description that there's a complication/gotcha/extra challenge in this one. Along with a comment in the solution pointing out the reason for this.

alee added a commit to alee/python-novice-gapminder that referenced this pull request Mar 5, 2024
pandas now raises errors when computing the mean on a dataframe with
non-numeric columns

add a note to the challenge describing the issue and provide a few
additional ways of solving it

Thanks to @davidwilby for finding this bug and outlining possible
solutions to it in swcarpentry#670

Co-authored-by: David Wilby <[email protected]>
Co-authored-by: Olav Vahtras <[email protected]>
@alee
Copy link
Member

alee commented Mar 5, 2024

closing this in favor of #674 - thanks!

@alee alee closed this Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants