-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/19 improve readability of the table format #42
Conversation
I did not offer a parameter raw=False to choose between data tables as we suggested but rather added a new attribute. This is in line with how we call the raw_data and metadata but it could be good if this was the default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and the work on this issue!
Sorry that I had so many columns but there is a lot of room for improvement and better naming as well as more performant code.
If you have any questions, feel free to ask me anytime
src/pystatis/table.py
Outdated
@@ -41,9 +41,40 @@ def get_data(self, area: str = "all", **kwargs): | |||
self.raw_data = raw_data | |||
data_str = StringIO(raw_data) | |||
self.data = pd.read_csv(data_str, sep=";") | |||
self.nice_data = format_table(self.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not introduce a new class instance variable inside of methods, all instance variables should already be known or set in the __init__
method.
self.data
is meant to hold the "pretty" data, because we also have self.raw_data
which contains the raw data from the endpoint. So I think you can directly use self.data
and make it more readable. We can control this via a variable in the get_data
method like prettify=True
which, by default, makes the data more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you introduce to many class instance variables you make it harder for the user to use the class because they no longer know which attribute to choose, is it raw_data
, data
or nice_data
and what are the differences?
src/pystatis/table.py
Outdated
|
||
metadata = load_data( | ||
endpoint="metadata", method="table", params=params, as_json=True | ||
) | ||
assert isinstance(metadata, dict) # nosec assert_used | ||
self.metadata = metadata | ||
|
||
def format_table(data: pd.DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is ok to put the method outside of the class, but you could also have chosen to make it a staticmethod
inside the class so it is more clear that this method only works for the table data. From a method outside I would actually expect that it is also used somewhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should find a better name of the function because you are not formatting the table, it is already formatted, in a flat csv file. What we actually do is reformatting the table to a better readable format, so you could think about something like reformat_table_data
or prettify_table
which is also very close to my suggested parameter prettify=True
src/pystatis/table.py
Outdated
|
||
metadata = load_data( | ||
endpoint="metadata", method="table", params=params, as_json=True | ||
) | ||
assert isinstance(metadata, dict) # nosec assert_used | ||
self.metadata = metadata | ||
|
||
def format_table(data: pd.DataFrame, | ||
) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to run the pre-commit pipeline over this, at least run black formatter, your style formatting is a little bit off like this empty bracket in a new line. We use the tool black
to format the code properly and in the same way
src/pystatis/table.py
Outdated
|
||
def format_table(data: pd.DataFrame, | ||
) -> pd.DataFrame: | ||
"""Format the raw data into a more readable table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
careful, you are not formatting the raw data but the data, so self.data
, instead of self.raw_data
, otherwise you would get a string and not the pandas DataFrame
src/pystatis/table.py
Outdated
"""Format the raw data into a more readable table | ||
|
||
Args: | ||
data (pd.DataFrame): A pandas dataframe created with get_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of "created" it would be better to say "returned by" to make it clear that this is the return value of the mentioned method
src/pystatis/table.py
Outdated
time_name, = data["Zeit_Label"].unique() | ||
time_values = data["Zeit"] | ||
|
||
merkmal_labels = data.filter(like="Merkmal_Label").columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that in Germany this is called "Merkmal" but it reads strange in an English code base, so maybe we could use the more appropriate "attributes" here or something like this also I admit that it is strange to have the German filter text as the like
attribute...
src/pystatis/table.py
Outdated
indep_names = [data[name].unique()[0] for name in merkmal_labels] # list of column names from Merkmal_Label | ||
|
||
auspraegung_labels = data.filter(like="Auspraegung_Label").columns | ||
indep_values = [data[name] for name in auspraegung_labels] # list of data from Ausgepragung_Label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are just getting a list of columns here right? Then it should be easier to do data[auspraegung_labels
directly or even even just data.filter()
which will give you the columns directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, instead of indep_values
I think it would be nice to have something like merkmal_column_values
src/pystatis/table.py
Outdated
auspraegung_labels = data.filter(like="Auspraegung_Label").columns | ||
indep_values = [data[name] for name in auspraegung_labels] # list of data from Ausgepragung_Label | ||
|
||
dep_values = data.loc[:,auspraegung_labels[-1]:].iloc[:,1:] # get all columns after last Auspraegung column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reads very strange, maybe you have to explain a little more why this is necessary? In any way a simple drop would probably work better and read much easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could probably use the filter
method again scanning for double underscore?
src/pystatis/table.py
Outdated
|
||
dep_values = data.loc[:,auspraegung_labels[-1]:].iloc[:,1:] # get all columns after last Auspraegung column | ||
dep_names = [" ".join(name.split('_')[1:]) | ||
for name in dep_values.columns] # splits strings in column names for readability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your comments are a little too generic. "splits strings in column names for readability" explains for one the obvious as the reader can see that you use the split method but it is not clear WHY you split this way. So a better comment would be to give a concrete example or explain why the first part of the name is omitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think you are keeping the unit right? I think we should not do this but store this information somewhere else. So I would actually choose `[1:-1] to exclude both the code and the unit from the text.
AS comment something like this would be a good idea I guess:
# Given a name like BEV036__Bevoelkerung_in_Hauptwohnsitzhaushalten__1000
# we extract the readable label and omit both the code and the unit
Also I just notice that you split by _
but the separator is __
so you could simplify your code again by:
[name.split('__')[1] for name in dep_values.columns]
and you don't need to join at all
src/pystatis/table.py
Outdated
|
||
nice_dict = {time_name:time_values, | ||
**dict(zip(indep_names, indep_values)), | ||
**dict(zip(dep_names, dep_values.values.T))} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is strange that you have to call .values.T here, we should try to fix this at it seems very odd and probably can be solved more easily. It is simple to miss and hard to understand while reading. I would have to test the code to see why you did it but dep_values is a dataframe so you could just merge the two dataframes, for example, or use the dataframes to_dict
method. So first overwrite the column names with dep_names
and than export the dataframe into a dict using to_dict. You need no zip for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
4b9f51d
to
92f7b16
Compare
* Bump version to next major version #9 * Revert flake8 to ^3.0 for docstrings #9 * add a notebook that shows how to run init_config * Make dev dependencies optional, update lock and README #9 * Update workflow install --with dev, add matrix poetry version #9 * Fix python and poetry version definition #9 * Fix python and poetry version definition #9 * fix lock file * update dev dependencies and add python-dotenv to dev * improve readme * update readme * Feat/8 handle multiple databases and users (#20) * change config module to handle multiple databases * finalize work on config module to handle multiple databases; significantly reduced lines of code by getting rid of the settings.ini * add a new db module that serves as a layer between the user and the config. Can set the current active database and get the settings from the config * simplify config module * refactor code to implement new config; correct tests * fix all remaining tests * fix all text issues * update notebooks according to latest changes in config * drop support for Python 3.9 due to pipe operator for types and set supported versions to 3.10 and 3.11 * fix problem with config dir creation during setup * fix isort * Improve clear_cache output for full wipe, remove unused import * Address all non global-related pylint issues #20 * because of complexity get rid of the current support of custom config dir and always use the default config dir under user home. * fix all tests; get rid of settings.ini and functionality for user to define own config path; pystatis supports only default config path but custom data cache path * fix all tests; get rid of settings.ini and functionality for user to define own config path; pystatis supports only default config path but custom data cache path * refactor config module to work with a ConfigParser global config object instead of overwriting the config variable within the functions using global (bad style according to pylint) * address pylint issues * fix mypy issues * fix pylint issues --------- Co-authored-by: MarcoHuebner <[email protected]> * update README to the latest changes of multi database support * Added lists of all available statistics and tables * Feat/10 update and auto deploy sphinx (#27) * Updated dev-dependencies, added first version of Sphinx documentation, including built html documentation. * Added Logo, updated theme, updated GitHub workflow, fixed docstrings in cache and cube. Hosting on ReadTheDocs has to be done by Owner/ CorrelAid (but can be requested and triggered that way). * Updated urllib3 version, but everything <2.0.0 (deprecating `strict`) should be fine... * Updated poetry as recommended in cachecontrol issue report. * Fixed black formatting, fixed make docs (is now ran by poetry). * Fixed linting issue, updated packages, updated make docs. * Updated ReadMe, added developer sphinx documentation, added custom pre-commit hook and changed to hard-coded version in docs, added built documentation to artifacts, #3 * Add deployment workflow, needs Repo updates * Update depencies for Sphinx documentation #10 * Remove redundant docu information #10 Render parts of the README.md in the respective .rst files * Remove unused mdinclude, fix run-test py version, update pre-commit #10 * Fix dependency group for SPhinx workflow #10 * Fix docstring parameter rendering in Sphinx #10 * Fix image rendering by mimicking folder structure #10 * Add comment on warnings related to ext.napoleon #10 * Rename deploy-docs #10 * Fix black format issue in conf.py #10 * Update deploy key, add deploy trigger comment #10 * Update documentation deploy workflow #10 * Switch to matrix.os definition #10 * Fix pull_request target in deploy workflow #10 * Update poetry.lock #10 * Import package version to Sphinx docu #10 * Manually fix black formatting issue #10 * With auto-deploy working, decrease retention days #10 * Update readme and Sphinx header references #10 * Fix deploy to update files on the remote #10 * fix cube functionality: it seems like structure of QEI header part was changed as well as DQA no longer has information about axis so we assume that the order is preserved (#43) * add jupytext and new nb for presentation * Feat/35 implement regex matching (#44) * Implemented regex matching, initial commit * Added credentials check for cubes and removed all references to set_db() * Implemented regex matching, initial commit * Added credentials check for cubes and removed all references to set_db() * fix tests * refactoring Find and Result class to work with new database detection logic; because find does not use names like Table and Cube, use has to specify the database * fix tests --------- Co-authored-by: Michael Aydinbas <[email protected]> Co-authored-by: Michael Aydinbas <[email protected]> * add presentation nb * remove presentation nb for now * Feat/19 improve readability of the table format (#42) * Reformatting the raw data tables for readability * Adding comments * Applied suggested changes and run code formatting * add tests for Table --------- Co-authored-by: Michael Aydinbas <[email protected]> * prepare Table so it can parse data from three different sources * Added description and examples of Find * implement parse logic for prettify zensus tables * fix pylint issues * edits on Find section * fixing overwritten changes * update presentation nb * add genesis parse code for regio, too, for the moment. * Feat/34 visualization examples (#48) * Add 02_Geo_visualization_example.ipynb * changed '-' to 0 instead of nan --> reproduce Simons result * new case study in visualization notebook, integration to presentation notebook * catch NA-values in read_csv and added Auspraegung_Code to table.py to have the unique region identifiers --------- Co-authored-by: jkrause <[email protected]> * final presentation nb and shape data; omit file check in pre-commit * fixed typo and beautified plots in presentation.ipynb /.py * add a first workaround for the new Zensus zip content type * fix all tests; separate Find and Results classes into own modules * update dependencies * update README * set version to 0.2 * remove Cubes from package for now; we no longer support cubes until they are requested * fix all tests; fix all relevant nb; * fix pylint issues * fix mypy issues * add documentation key * update changelog --------- Co-authored-by: MarcoHuebner <[email protected]> Co-authored-by: Pia <[email protected]> Co-authored-by: MarcoHuebner <[email protected]> Co-authored-by: zosiaboro <[email protected]> Co-authored-by: Zosia Borowska <[email protected]> Co-authored-by: jkrause123 <[email protected]> Co-authored-by: jkrause <[email protected]>
Added a nice_data attribute to Table to display a reformatted data table
Resolves #19