Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check input frames have same columns - missingness #1611

Merged
merged 8 commits into from
Nov 8, 2023

Conversation

ADBond
Copy link
Contributor

@ADBond ADBond commented Sep 15, 2023

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Closes #808.

Give a brief description for the solution you have provided

The main thing here is that the property linker._input_columns now checks columns for all input dataframes, and gives a clear error message if they are not identical sets, outputting columns that are not common to all frames.

Other changes:

  • renamed linker._get_input_columns -> linker._input_columns (clearer name for property vs method)
  • linker._input_columns returns a list of InputColumns rather than just their names, for flexibility
    • and adjusted profile_data accordingly
  • used linker._input_columns for missingness chart so we get the check for free
  • fix missingness_chart docstring
  • added simple tests

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@@ -249,18 +249,38 @@ def __init__(
self.debug_mode = False

@property
def _get_input_columns(
def _input_columns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't see Sam's PR from last month, but it duplicates some of the functionality within the settings validation class.

I think the linker object is a better place to house this code, but can we consolidate the two?

All I need on my end is the column_names_by_input_df variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to trim remove the quotes from our column names, there's some code in the settings validator to do this too. Though, I don't think this is a necessity.

@ThomasHepworth ThomasHepworth merged commit 68aee62 into master Nov 8, 2023
@ThomasHepworth ThomasHepworth deleted the missingness-same-columns branch November 8, 2023 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missingness chart fails if tables don't have same columns
2 participants