Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas simplification v2 #826

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

CalCraven
Copy link
Contributor

This is a replacement PR for the features requested in #814, which has some commits in the merge history that deviated when switching to ruff linting.

@CalCraven
Copy link
Contributor Author

From previous PR:

This PR looks to improve the handling for converting a topology to a dataframe. This currently lives as a method for topology. It is now being moved to a convert_dataframe.py module. A few different formats are available which give some nice default ways to view a topology. Notably, we have the formats:
-publication which gives all the parameter values you would want to have in a table for publication. This also removes duplicates so each parameter is only listed once.
-default some default values which are nice to have
-remove_duplicates which allows you to get a smaller dataframe with duplicate rows removed.
-specific_columns Allows the user to specify what they want in the dataframe.

There is also an added function that allows you to generate dataframes that cover the parameters for a set of topologies.

Finally, there will be some function that prints the dataframes with the rdkit mols which are labeled to match the dataframes.

TODO Checklist:

  • Error checking on arguments
  • Replace topology.py dataframe methods/tests
  • Doc strings
  • Handle units
  • Handle parameters that return lists
  • Handle parameters that return dictionaries
  • Return unique elements if style is publication
  • Handle parameter "all" better
  • Function to concatenate multiple topogies into one output
  • Remove replicate rows flag -> similar to the publication style, but without the atom_indices section added
  • Function to create a topology with all data from dataframe matching rdkit mol image
    • Could just also make this a format unique_types


# handle positions?
# handle connection_members
pass

Check warning

Code scanning / CodeQL

Unnecessary pass Warning library

Unnecessary 'pass' statement.
@CalCraven
Copy link
Contributor Author

From discussion with @Vtsoch, there are a few more general use cases that would be nice to have working in the arguments for the main function, to_dataframeDict.

dfDict = to_dataframeDict(ptop, parameters="sites", columns=["name", "atom_type.name", "atom_type.parameters", "charge", "molecule.name"], format="remove_duplicates")

should be put as an example, since it could be hard to find the molecule info or parameters info if you don't know how the parsing works of these attributes. I would even consider these attributes to be in the default format since they're nice to know.

dfDict = to_dataframeDict(ptop, parameters=["sites", "bonds"], columns=["name"], format="specific_columns")

This will fail currently. However, I think it should just go through the columns and only grab attributes that exist, skipping the others.

Copy link

codecov bot commented Jun 16, 2024

Codecov Report

Attention: Patch coverage is 91.66667% with 13 lines in your changes missing coverage. Please review.

Project coverage is 93.31%. Comparing base (5a4f17d) to head (72037d5).
Report is 36 commits behind head on main.

Files with missing lines Patch % Lines
gmso/external/convert_dataframe.py 91.55% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #826      +/-   ##
==========================================
- Coverage   94.07%   93.31%   -0.77%     
==========================================
  Files          65       66       +1     
  Lines        6953     7088     +135     
==========================================
+ Hits         6541     6614      +73     
- Misses        412      474      +62     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@chrisjonesBSU chrisjonesBSU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments.

format: str = "default",
columns: list[str] = None,
handle_unyts: str = "to_headers",
) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be dict correct?

gmso/external/convert_dataframe.py Show resolved Hide resolved
@Zeerakkhan47 Zeerakkhan47 self-assigned this Oct 21, 2024
```


>>> gmso.external.convert_dataframe.to_dataframeDict(ptop, parameters='sites', columns=['charge'], handle_unyts="to_headers")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zeerakkhan47 I think you could expand the information passed into the columns parameter here to include some of the things @CalCraven mention in his comment on June 16th.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants