-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
996 profiling upgrades #1379
996 profiling upgrades #1379
Conversation
…mns. To encourage this to be modular and allow for easier adding of future graphs, I have restructured the profile_data.json file to be built iteratively by the components of all of the plots. This allows each individual plot to be turned on/off by setting a parameter that allows them to be False for distribution plots and None for top_n_plots and bottom_n_plots.
Test: test_2_rounds_1k_duckdbPercentage change: -12.8%
Test: test_2_rounds_1k_sqlitePercentage change: -6.2%
Click here for vega lite time series charts |
splink/linker.py
Outdated
self, | ||
column_expressions: str | list[str], | ||
top_n=10, | ||
bottom_n=10, | ||
kde_plots=False, | ||
distribution_plots=True, | ||
): | ||
return profile_columns(self, column_expressions, top_n=top_n, bottom_n=bottom_n) | ||
return profile_columns( | ||
self, | ||
column_expressions, | ||
top_n=top_n, | ||
bottom_n=bottom_n, | ||
kde_plots=kde_plots, | ||
distribution_plots=distribution_plots, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you coordinate writing the docstring for profile_data
with @afua-moj?
I think one of you should put in a PR for the current version of splink and then build on that in this PR and her upcoming PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've written a docstring in this, and another request. Can update on Afua's branch once this (or the other) has been merged.
Have left as a work in progress until @afua-moj has merged in. I've scanned both branches and there'll be minimal conflicts- but will let her changes be merged in first. |
…st to allow for this functionality to be reviewed.
f184686
to
d33cf28
Compare
This is ready-for-review as is, but will have conflicts with c8a6a26 - please review and merge this branch first, then I can rebase into this branch and ensure conflicts are resolved appropriately. |
I really like the idea of Splink enabling more types of exploratory data analysis and I think the KDE option will be really useful. There are a few challenges that come to mind from an API/usability point of view that I wondered if we'd considered. (Also relevant to the profile arrays PR.) In summary, the challenge is around designing (expanding) the function signature to accomodate complex options. The input data has many columns of different types, where as the arguments are currently booleans which are not column specific. To try and explain this a bit better, I think the difficulty arises from:
To give a couple of examples of function calls that could be confusing:
I wondered whether we might be at risk of overloading a single function with too many options that make it difficult to use, and potentially hide the useful new functionality. Could an alternative option be simply to have separate function(s) for numeric and array profiling, and leave the generic profile_columns as distribution only (could even be deprecated)? (Note: I'm not very sure about the best solution here - it's just one option and there may well be a better one) Apologies in advance if I'm misunderstood something, I haven't had time to pull the branch and actually run the code. Lastly - just for my understanding - I've not used KDE before. Is it preferable to a histogram? |
I think your ideas in our previous discussion are also relevant here in terms of API design. I wonder whether it might make sense for a set of profiling functions to live outside the main Linker, so that you can pass any dataframe into them more easily (e.g. without needing Splink settings or instantiating a linker). Maybe there should be something like an (That's a very rough idea, i'm not suggesting those should be the precise names) Also relevant: #1055 |
I think I am in support of having a separate exploratory module (but haven't thought about it in great detail). I have never been a massive fan on having to instantiate a linker just to do some EDA 👍 |
I think I would prefer to keep a single
Example
Almost certainly making this seem simpler than it would really be to implement, and possibly missing the point in the earlier discussion so let me know if this is nonsense. |
My view is that we should retain There are several reasons:
On the idea of Altair style names ( |
A lot of points and really useful discussion to dissect here, so I'll attempt to summarise. We have three options:
For me, 2) feels like we're trying to reinvent the wheel. There are a whole host of options to do exploratory analysis of data generally, and to be profile_columns functions as a step that allows a user to easy get a view of things we deem key. In this sense, definitely agree with Sam that I like profile_colmns being the go-to. However, I agree that adding further functionality to this becomes a bit "heavy", but what we include could be quite stringent to this. I guess the question here is- what else would we actually want to include, and are we throwing the baby our with the bathwater by abandoning this? I really like the idea of 3), as I think it follows the Splink "feel", but I appreciate it's more difficult to engineer/maintain from our end than simple functions. From an engineering perspective though, I think the framework from this PR that allows iterative building of those graphs would allow this anyway. I'm not sure I agree with the point that this is difficult code for the user to write, and actually think it's simpler than multiple lines of a user calling graph 1 for X,Y,Z, then graph 2 for Y,I,J, etc. The data typing issue feels like the largest issue here, but I'm not sure in practice whether it is an issue we have to solve. Within current implementation, for example, if a kde_plot is called for a non-numeric variable, it simply renders the chart without any data, rather than erroring the function. I think it would be fairly simple to validate the entry data going into the chart (i.e. will it generate anything), and throw out an error message akin to "The for appears to have invalid input data. This is usually triggered by an incorrect data type." I will propose a fourth option to muddy the waters further. If we believe that data types are this complicated issue, how would people feel about simply having two profile column functions: profile_columns and profile_columns_numeric (naming TBC), with the intention that one holds a set of graphs applicable to columns the user wants to treat as strings (eg. A,B,C), and the other holds a set of functions that user wants to treat as numeric (eg. B,C,D)? The intention being that profiling becomes two-step process, and produces two separate sets of graphs. This would also allows the user to be able to treat a single column as two different types types, as Robin highlighted above, by including it in both sets of graphs. |
I think a I'm quite focussed on data typing because in my experience it's usually harder than it looks when you're working across multiple tools (sql backends), and can easily leads to subtle bugs etc. It doesn't fully solve the 'how to profile a single df for a multi-df link job', but that functionality could probably be added later (to both profile_columns_numeric and profile_columns), e.g. by optionally passing in the data (or a reference to it). I definitely agree with the point that we need to keep an eye on value added i.e. concentrate efforts on data vis that is very linkage-focussed, rather than re-implementing things that can be achieved with other tools |
Have implemented agreed changes by creating a seperate profile_numeric_columns. They are currently very similar (the only difference being a kernel density plot, which is blocked in the original). As additional features are added, it is expect they will diverge, but the base profile_columns function not associated with the linker will retain all functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path needs updating - docs/demos/02_Exploratory_analysis.ipynb -> docs/demos/tutorials/02_Exploratory_analysis.ipynb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
splink/linker.py
Outdated
self, column_expressions: str | list[str] = None, top_n=10, bottom_n=10 | ||
self, | ||
column_expressions: str | list[str], | ||
top_n=10, | ||
bottom_n=10, | ||
distribution_plots=True, | ||
): | ||
""" | ||
Profiles the specified columns of the dataframe initiated with the linker. | ||
|
||
This can be computationally expensive if the dataframe is large. | ||
|
||
For the provided columns with column_expressions (or for all columns if | ||
left empty) calculate: | ||
- A distribution plot that shows the count of values at each percentile. | ||
- A top n chart, that produces a chart showing the count of the top n values | ||
within the column | ||
- A bottom n chart, that produces a chart showing the count of the bottom | ||
n values within the column | ||
|
||
This should be used to explore the dataframe, determine if columns have | ||
sufficient completeness for linking, analyse the cardinality of columns, and | ||
identify the need for standardisation within a given column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to delete the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did. I have added the docstring to the base function within the base function profile_columns. These functions sit as wrappers, so would involve duplicating the doc string. Happy to do so if needed for markdown docs to function correctly, but is it needed?
splink/linker.py
Outdated
def profile_numeric_columns( | ||
self, | ||
column_expressions: str | list[str], | ||
top_n=10, | ||
bottom_n=10, | ||
kde_plots=False, | ||
distribution_plots=True, | ||
): | ||
return profile_columns( | ||
self, column_expressions=column_expressions, top_n=top_n, bottom_n=bottom_n | ||
self, | ||
column_expressions, | ||
top_n=top_n, | ||
bottom_n=bottom_n, | ||
kde_plots=kde_plots, | ||
distribution_plots=distribution_plots, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above- docstring is at main function that this calls. Need repeating?
splink/linker.py
Outdated
top_n=10, | ||
bottom_n=10, | ||
distribution_plots=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might as well add some type annotations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
splink/linker.py
Outdated
top_n=10, | ||
bottom_n=10, | ||
kde_plots=False, | ||
distribution_plots=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
"x": { | ||
"type": "quantitative", | ||
"field": "value", | ||
"title": "Value" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have ammended- if you're interested this is the difference between setting a variable to "quantitative" (as above) or "nominal" in the "type" field within vegalite. (see changed in profile_data_kde_json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I'd fixed this, but I haven't found a way around this. Vegalite seems to autoformat these labels like this because the data type of "x" is quantitiative. If we set this to nominal, as we do in other charts, this will fix the labels, but produce graphs like this:
Obviously, we do not want KDE plots of non-numeric data, not least for the fact that these will render as wide as there are values in the data.
It also means that columns with valid numeric data end up like this:
When they should be like this:
This behaviour won't happen if formatted as date since this bug fix but this won't apply to anything that is pulled via SQL statements (which we often do).
splink/profile_data.py
Outdated
_top_n_plot = load_chart_definition("profile_data_top_n.json") | ||
_bottom_n_plot = load_chart_definition("profile_data_bottom_n.json") | ||
_kde_plot = load_chart_definition("profile_data_kde.json") | ||
_correlation_plot = load_chart_definition("profile_data_correlation_heatmap.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used in the script. The linter will fail if you try to run it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code left in by mistake from larger PR- removed
splink/profile_data.py
Outdated
return sql | ||
|
||
|
||
def _get_df_correlations(column_expressions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is failing for me in postgres and sqlite. I'll post the test I am running in a second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For postgres, you'll need docker and to run source scripts/postgres/setup,sh
.
From there, you can use:
from tests.helpers import PostgresTestHelper
import pandas as pd
from tests.basic_settings import get_settings_dict
from tests.decorator import mark_with_dialects_excluding
from sqlalchemy import create_engine, text
engine = create_engine(
f"postgresql+psycopg2://splinkognito:splink123!"
f"@localhost:5432/splink_db"
)
helper = PostgresTestHelper(engine)
Linker = helper.Linker
brl = helper.brl
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
from tests.basic_settings import get_settings_dict
linker = Linker(df, get_settings_dict(), **helper.extra_linker_args())
linker.profile_numeric_columns(["substr(dob, 1,4)"], top_n=None, bottom_n=None, kde_plots=True, distribution_plots=False)
To generate the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably split up your CTE expressions first and then attempt to run the code above in debug mode. That should help you narrow down what's going wrong in the background.
This will ultimately all come down to a missing feature of postgres, but I'm not familiar enough with the db to be able to give you any clues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for diagnosing. Code was actually left in in error as this PR was originally meant to be larger, but was cut down. Have removed this code but will save this comment so that I can refer back when doing the other PR.
splink/profile_data.py
Outdated
sql = f""" | ||
WITH column_list AS ( | ||
SELECT unnest(ARRAY{column_expressions}) AS column_name | ||
), | ||
column_data AS ( | ||
SELECT {', '.join(column_expressions)} | ||
FROM __splink__df_concat | ||
) | ||
SELECT | ||
t1.column_name AS column1, | ||
t2.column_name AS column2, | ||
CORR(d1.val::DOUBLE PRECISION, d2.val::DOUBLE PRECISION) AS correlation | ||
FROM | ||
column_list AS t1 | ||
CROSS JOIN | ||
column_list AS t2 | ||
JOIN | ||
( | ||
SELECT | ||
unnest(ARRAY{column_expressions}) AS column_name, | ||
unnest(ARRAY[{', '.join(f't.{column_name}' for column_name in column_expressions)}]) AS val | ||
FROM column_data AS t | ||
) AS d1 ON t1.column_name = d1.column_name | ||
JOIN | ||
( | ||
SELECT | ||
unnest(ARRAY{column_expressions}) AS column_name, | ||
unnest(ARRAY[{', '.join(f't.{column_name}' for column_name in column_expressions)}]) AS val | ||
FROM column_data AS t | ||
) AS d2 ON t2.column_name = d2.column_name | ||
WHERE | ||
t1.column_name < t2.column_name | ||
GROUP BY | ||
t1.column_name, | ||
t2.column_name | ||
ORDER BY | ||
t1.column_name, | ||
t2.column_name | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I know it's a pain, but on this large SQL statement, can you split it up such that each CTE expression is its own SQL expression?
In Splink, we have a pipeline class that generates CTE expressions automatically when provided with raw SQL statements and table names.
This means:
- Debug mode is more effective. Each CTE expression gets run, meaning it's much easier to identify which section of code is breaking.
- The code is easier to read.
To summarise each each step and give you some code examples:
- Write each individual SQL statement, adding them to a dictionary of the format
{"sql": sql, "output_table_name": <str_name>}
. - Once you have these, you can queue up each SQL step using
self._enqueue_sql
. This queues up a SQL statement and assigns it a CTE name. - After step 2, you don't need to make any additional changes. However, for completeness, see our pipeline class. This processes your individual SQL statements, translating them into CTE expressions and generating your final pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, code no longer needed, but thankyou for the explanation here, really helpful.
Could you add some tests to this code too? A really basic one I used to test your code was working is: from .decorator import mark_with_dialects_excluding
@mark_with_dialects_excluding()
def test_profile_data(test_helpers, dialect):
helper = test_helpers[dialect]
settings = get_settings_dict()
Linker = helper.Linker
df = helper.load_frame_from_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
linker = Linker(df, settings, **helper.extra_linker_args())
linker.profile_columns(["first_name", "city", "surname", "email", "substr(dob, 1,4)"], top_n=10, bottom_n=5)
linker.profile_numeric_columns(["substr(dob, 1,4)"], top_n=None, bottom_n=None, kde_plots=True, distribution_plots=False) This acts as a very basic integration test to check the methods work on all of our backends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Sam, I really love the new chart!! It's easy to read and I think it'll prove really valuable for evaluating numerical data at a glance.
We can easily identify outliers without expending too much effort.
The method is currently broken for postgres and sqlite. I'm not quite sure why or what's wrong, but hopefully the code snippets provided help.
It would be nice to get this merged and released by next week, but there's no pressure!
Hey. Just looked at this in a bit more detail. I think I might be minunderstanding something, but think there may be an issue for continuous numerical data - for which I think it always produces a 'square': Example codeimport altair as alt
import pandas as pd
import numpy as np
from splink.duckdb.comparison_library import (
exact_match,
levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
# Sort by first_name
df.sort_values("first_name", inplace=True)
# Generate random numbers influenced by alphabetical order
df["random_num_continuous"] = np.linspace(0, 1, len(df)) + np.random.uniform(
0, 1, len(df)
)
df["random_num_discrete"] = np.round(df["random_num_continuous"] * 2) / 2
chart = (
alt.Chart(df)
.transform_density(
"random_num_continuous", as_=["random_num_continuous", "density"], extent=[0, 2]
)
.mark_area()
.encode(x="random_num_continuous:Q", y="density:Q")
)
display(chart)
chart = (
alt.Chart(df)
.transform_density(
"random_num_discrete", as_=["random_num_discrete", "density"], extent=[0, 2]
)
.mark_area()
.encode(x="random_num_discrete:Q", y="density:Q")
)
display(chart)
settings = {
"link_type": "dedupe_only",
}
linker = DuckDBLinker(df, settings, connection=":memory:")
# linker.debug_mode=True
linker.profile_numeric_columns(
["random_num_continuous", "random_num_discrete"], kde_plots=True
) In this example, Altair shows this for the Whereas Splink shows: Is the chart actually a KDE? I've not really used them before, but I they seem to perform some type of smoothing to estimate a distribution. Note that when i round the data (the |
Type of PR
Is your Pull Request linked to an existing Issue or Pull Request?
One of the many requests compiled within Issue 996
This PR aims to add kernel density plots for continuous variables to allow for easier profiling of columns. To add in this graph, it became necessary to make the json files produced by profile_columns to eventually produce the graphical outputs modular and allow them to be pieced together by the parameters the user inputs. This means that any individual graph can be turned on/off, and should make adding more graphs in future significantly easier.
PR Checklist