Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

996 profiling upgrades #1379

Closed
wants to merge 23 commits into from
Closed

996 profiling upgrades #1379

wants to merge 23 commits into from

Conversation

sama-ds
Copy link
Contributor

@sama-ds sama-ds commented Jun 29, 2023

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

One of the many requests compiled within Issue 996

This PR aims to add kernel density plots for continuous variables to allow for easier profiling of columns. To add in this graph, it became necessary to make the json files produced by profile_columns to eventually produce the graphical outputs modular and allow them to be pieced together by the parameters the user inputs. This means that any individual graph can be turned on/off, and should make adding more graphs in future significantly easier.

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks at tutorials in splink_demos (if appropriate)
  • Added tests (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

sama-ds added 3 commits June 20, 2023 12:49
…mns. To encourage this to be modular and allow for easier adding of future graphs, I have restructured the profile_data.json file to be built iteratively by the components of all of the plots. This allows each individual plot to be turned on/off by setting a parameter that allows them to be False for distribution plots and None for top_n_plots and bottom_n_plots.
@sama-ds sama-ds marked this pull request as draft June 29, 2023 14:07
@github-actions
Copy link
Contributor

github-actions bot commented Jun 29, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -12.8%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1986 2023-08-30 15:07:58 1.67427 1.63405 (detached head) 732e2af Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz 2.0952 GHz 732e2af

Test: test_2_rounds_1k_sqlite

Percentage change: -6.2%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1988 2023-08-30 15:07:58 4.02793 3.99455 (detached head) 732e2af Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz 2.0952 GHz 732e2af

Click here for vega lite time series charts

splink/linker.py Outdated
Comment on lines 2020 to 2034
self,
column_expressions: str | list[str],
top_n=10,
bottom_n=10,
kde_plots=False,
distribution_plots=True,
):
return profile_columns(self, column_expressions, top_n=top_n, bottom_n=bottom_n)
return profile_columns(
self,
column_expressions,
top_n=top_n,
bottom_n=bottom_n,
kde_plots=kde_plots,
distribution_plots=distribution_plots,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you coordinate writing the docstring for profile_data with @afua-moj?

I think one of you should put in a PR for the current version of splink and then build on that in this PR and her upcoming PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've written a docstring in this, and another request. Can update on Afua's branch once this (or the other) has been merged.

@sama-ds
Copy link
Contributor Author

sama-ds commented Jun 29, 2023

Have left as a work in progress until @afua-moj has merged in. I've scanned both branches and there'll be minimal conflicts- but will let her changes be merged in first.

@sama-ds sama-ds force-pushed the 996_profiling_upgrades branch from f184686 to d33cf28 Compare August 30, 2023 15:06
@sama-ds sama-ds marked this pull request as ready for review August 30, 2023 15:09
@sama-ds
Copy link
Contributor Author

sama-ds commented Aug 30, 2023

This is ready-for-review as is, but will have conflicts with c8a6a26 - please review and merge this branch first, then I can rebase into this branch and ensure conflicts are resolved appropriately.

@RobinL
Copy link
Member

RobinL commented Aug 31, 2023

I really like the idea of Splink enabling more types of exploratory data analysis and I think the KDE option will be really useful.

There are a few challenges that come to mind from an API/usability point of view that I wondered if we'd considered. (Also relevant to the profile arrays PR.)

In summary, the challenge is around designing (expanding) the function signature to accomodate complex options. The input data has many columns of different types, where as the arguments are currently booleans which are not column specific.

To try and explain this a bit better, I think the difficulty arises from:

  • Input columns of multiple types. For an input dataset with columns of multiple types, it's challenging to design the function parameters to allow different profiling to be run depending on data type. Especially because choices are not deterministic: you don't always want to explode arrays and you don't always want to treat numeric data as continuous.

  • Data typing. The existing distribution plots can be used for any data types, but kde and corr can only be used with with continuous numeric data. This means it's possible to write a function call that is not 'executable' - e.g. asking for a kde on a 'first name'. At the moment, as far as I know, we don't do any data type checking in Splink, partly because it's hard to implement in a backend-agnostic way. It might therefore be hard to produce a descriptive error message.

To give a couple of examples of function calls that could be confusing:

# Example 1: Ambiguity in applying distribution and KDE plots when data contains numeric and non numeric
profile_columns(linker, ["first_name", "salary"], distribution_plots=True, kde_plots=True)

# Example 2: Suppose I want a distribution for age (i.e. treat is a nominal/categorical), and a kde plot for log(salary), it's difficult to design
# the function arguments to accomodate without significant complexity  (e.g. list of booleans, or dict like splink_settings)
profile_columns(linker, column_expressions=["age", "log(salary)"],  distribution_plots=True kde_plots=True)

# Example 3: Or when using the function with zero arguments 
profile_columns(linker,  distribution_plots = True, kde_plots=True)

I wondered whether we might be at risk of overloading a single function with too many options that make it difficult to use, and potentially hide the useful new functionality. Could an alternative option be simply to have separate function(s) for numeric and array profiling, and leave the generic profile_columns as distribution only (could even be deprecated)?

(Note: I'm not very sure about the best solution here - it's just one option and there may well be a better one) Apologies in advance if I'm misunderstood something, I haven't had time to pull the branch and actually run the code.

Lastly - just for my understanding - I've not used KDE before. Is it preferable to a histogram?

@RobinL RobinL mentioned this pull request Aug 31, 2023
6 tasks
@RobinL
Copy link
Member

RobinL commented Aug 31, 2023

I think your ideas in our previous discussion are also relevant here in terms of API design.

I wonder whether it might make sense for a set of profiling functions to live outside the main Linker, so that you can pass any dataframe into them more easily (e.g. without needing Splink settings or instantiating a linker).

Maybe there should be something like an exploratory.py module that offers a variety of profiling/exploratory functions e.g. splink.exploratory.values_distribution(dataframe, column_expressions)
splink.exploratory.kde(dataframe, column_expressions)

(That's a very rough idea, i'm not suggesting those should be the precise names)

Also relevant: #1055

@RossKen
Copy link
Contributor

RossKen commented Sep 5, 2023

I think your ideas in our previous discussion are also relevant here in terms of API design.

I wonder whether it might make sense for a set of profiling functions to live outside the main Linker, so that you can pass any dataframe into them more easily (e.g. without needing Splink settings or instantiating a linker).

Maybe there should be something like an exploratory.py module that offers a variety of profiling/exploratory functions e.g. splink.exploratory.values_distribution(dataframe, column_expressions) splink.exploratory.kde(dataframe, column_expressions)

(That's a very rough idea, i'm not suggesting those should be the precise names)

Also relevant: #1055

I think I am in support of having a separate exploratory module (but haven't thought about it in great detail). I have never been a massive fan on having to instantiate a linker just to do some EDA 👍

@samnlindsay
Copy link
Contributor

samnlindsay commented Sep 5, 2023

@RobinL

I wondered whether we might be at risk of overloading a single function with too many options that make it difficult to use, and potentially hide the useful new functionality. Could an alternative option be simply to have separate function(s) for numeric and array profiling, and leave the generic profile_columns as distribution only (could even be deprecated)?

I think I would prefer to keep a single profile_columns as a one-stop shop for summarising the data, rather than requiring separate functions for different kinds of columns and output. Could we use some kind of shorthand (like altair's salary:Q for a numerical column and full_name:N for a categorical one), or the option to supply columns as:

  • a list (current distribution-only version) or
  • a dictionary (to specify how to profile each column).

Example

  1. Default - distributions only, as previously
cols = ["first_name", "salary"]
profile_columns(linker, cols)
  1. Equivalent, more verbose alternative
cols = {
  "first_name": {dist = True, kde = False}, 
  "salary": {dist = True, kde = False}
}
profile_columns(linker, cols)
  1. Per-column custom profiling as appropriate for data type
cols = {
  "first_name": {dist = True, kde = False}, 
  "salary": {dist = False, kde = True}
}
profile_columns(linker, cols)

Almost certainly making this seem simpler than it would really be to implement, and possibly missing the point in the earlier discussion so let me know if this is nonsense.

@RobinL
Copy link
Member

RobinL commented Sep 5, 2023

My view is that we should retain profile_columns in its current form as a simple way to get some basic charts (esp. the new zero argument version), but we should not add further complexity to it.

There are several reasons:

  • Code complexity I worry that trying to do too much in profile_columns may make the code complicated and difficult to maintain. The code behind profile_columns is already pretty complicated, but a saving grace is that, because everything is (treated like) a string, then at least the code doesn't have to deal with data typing issues.

  • At the moment, profile_columns is confusing when there are multiple input datasets. It returns the profile of the concatenated dataframes, but it doesn't tell you that anywhere. If you want to profile a single input dataset you have to use a workaround (instantiate a new linker with one input dataset). Having separate functions means the 'damage' of a bug (see errors below) is contained to that single function and is probably easier to fix

  • Extensibility I think it's easier to add additional exploratory functionality if it doesn't all have to be rolled into a single function. We're already adding support for profiling array columns, KDE, and correlation. Probably we might want histograms at some future point.

  • Discoverability If functionality is only available via specific non-default parameters, it may be hard for users to discover it (you don't get tab completion, for instance, and it's harder to present in the docs). It's easier to document several functions that do different things than one mega-function

  • Difficulty of API design/complexity of function calls. I agree that it would be possible to design a settings={} parameter that controlled profile_columns in a similar what to how Splink settings control the Linker. But I don't think this code is particularly easy to write for the user, and from our (the maintainers) perspective, there's also potentially quite a bit of complexity in error checking (do data types align with what the user has asked for, are the settings valid, etc.)

  • Trying to be consistent about having long descriptive names

  • Errors: At the moment, using profile_columns it should be rare for users to experience SQL execution failures due to data typing. Adding in functions that require numeric inputs makes the likelihood much higher, and so makes descriptive errors harder to raise. Arguably it should be accompanied by data type checking code (so we can raise descriptive error message), but that's a pretty big piece of work that we probably don't want to get into right now.

On the idea of Altair style names (full_name:N) I think there's a useful idea here that can be useful irrespective on the decision of the profile_columns API. One issue here is whether it means (a) full_name should be a string, or (b) full_name is a string. (Because choices are not deterministic - e.g. for a 'transaction_amount' column, you might sometimes want it to be treated as a string even though it's a number, so that you notice that £9.99 is very common, but £10.01 is not)

@sama-ds
Copy link
Contributor Author

sama-ds commented Sep 11, 2023

A lot of points and really useful discussion to dissect here, so I'll attempt to summarise. We have three options:

  1. Accept this PR and functionality within
  2. Accept the ability to remove elements from this PR, and add separate functionality to a set of exploratory tools not needing a linker to be initiated
  3. Alter this PR to include some elements of type casting, and allow the user to turn charts on/off for a given column

For me, 2) feels like we're trying to reinvent the wheel. There are a whole host of options to do exploratory analysis of data generally, and to be profile_columns functions as a step that allows a user to easy get a view of things we deem key. In this sense, definitely agree with Sam that I like profile_colmns being the go-to. However, I agree that adding further functionality to this becomes a bit "heavy", but what we include could be quite stringent to this. I guess the question here is- what else would we actually want to include, and are we throwing the baby our with the bathwater by abandoning this?

I really like the idea of 3), as I think it follows the Splink "feel", but I appreciate it's more difficult to engineer/maintain from our end than simple functions. From an engineering perspective though, I think the framework from this PR that allows iterative building of those graphs would allow this anyway. I'm not sure I agree with the point that this is difficult code for the user to write, and actually think it's simpler than multiple lines of a user calling graph 1 for X,Y,Z, then graph 2 for Y,I,J, etc. The data typing issue feels like the largest issue here, but I'm not sure in practice whether it is an issue we have to solve. Within current implementation, for example, if a kde_plot is called for a non-numeric variable, it simply renders the chart without any data, rather than erroring the function. I think it would be fairly simple to validate the entry data going into the chart (i.e. will it generate anything), and throw out an error message akin to "The for appears to have invalid input data. This is usually triggered by an incorrect data type."

I will propose a fourth option to muddy the waters further. If we believe that data types are this complicated issue, how would people feel about simply having two profile column functions: profile_columns and profile_columns_numeric (naming TBC), with the intention that one holds a set of graphs applicable to columns the user wants to treat as strings (eg. A,B,C), and the other holds a set of functions that user wants to treat as numeric (eg. B,C,D)? The intention being that profiling becomes two-step process, and produces two separate sets of graphs. This would also allows the user to be able to treat a single column as two different types types, as Robin highlighted above, by including it in both sets of graphs.

@RobinL
Copy link
Member

RobinL commented Sep 11, 2023

I think a profile_columns_numeric function seems like a good compromise - it isolates any problems with data typing and keeps code for different things more separate.

I'm quite focussed on data typing because in my experience it's usually harder than it looks when you're working across multiple tools (sql backends), and can easily leads to subtle bugs etc.

It doesn't fully solve the 'how to profile a single df for a multi-df link job', but that functionality could probably be added later (to both profile_columns_numeric and profile_columns), e.g. by optionally passing in the data (or a reference to it).

I definitely agree with the point that we need to keep an eye on value added i.e. concentrate efforts on data vis that is very linkage-focussed, rather than re-implementing things that can be achieved with other tools

@sama-ds
Copy link
Contributor Author

sama-ds commented Sep 22, 2023

Have implemented agreed changes by creating a seperate profile_numeric_columns. They are currently very similar (the only difference being a kernel density plot, which is blocked in the original). As additional features are added, it is expect they will diverge, but the base profile_columns function not associated with the linker will retain all functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path needs updating - docs/demos/02_Exploratory_analysis.ipynb -> docs/demos/tutorials/02_Exploratory_analysis.ipynb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

splink/linker.py Outdated
Comment on lines 2083 to 2100
self, column_expressions: str | list[str] = None, top_n=10, bottom_n=10
self,
column_expressions: str | list[str],
top_n=10,
bottom_n=10,
distribution_plots=True,
):
"""
Profiles the specified columns of the dataframe initiated with the linker.

This can be computationally expensive if the dataframe is large.

For the provided columns with column_expressions (or for all columns if
left empty) calculate:
- A distribution plot that shows the count of values at each percentile.
- A top n chart, that produces a chart showing the count of the top n values
within the column
- A bottom n chart, that produces a chart showing the count of the bottom
n values within the column

This should be used to explore the dataframe, determine if columns have
sufficient completeness for linking, analyse the cardinality of columns, and
identify the need for standardisation within a given column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to delete the docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did. I have added the docstring to the base function within the base function profile_columns. These functions sit as wrappers, so would involve duplicating the doc string. Happy to do so if needed for markdown docs to function correctly, but is it needed?

splink/linker.py Outdated
Comment on lines 2098 to 2112
def profile_numeric_columns(
self,
column_expressions: str | list[str],
top_n=10,
bottom_n=10,
kde_plots=False,
distribution_plots=True,
):
return profile_columns(
self, column_expressions=column_expressions, top_n=top_n, bottom_n=bottom_n
self,
column_expressions,
top_n=top_n,
bottom_n=bottom_n,
kde_plots=kde_plots,
distribution_plots=distribution_plots,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above- docstring is at main function that this calls. Need repeating?

splink/linker.py Outdated
Comment on lines 2085 to 2087
top_n=10,
bottom_n=10,
distribution_plots=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might as well add some type annotations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

splink/linker.py Outdated
Comment on lines 2101 to 2104
top_n=10,
bottom_n=10,
kde_plots=False,
distribution_plots=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment on lines 5 to 9
"x": {
"type": "quantitative",
"field": "value",
"title": "Value"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar enough with vegalite and how it works under the hood, but I am assuming it is automatically adding in commas for our numerical values 👇

Screenshot 2023-10-05 at 16 42 08

Perhaps we should change this behaviour so that it only adds commas if there are more than four digits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have ammended- if you're interested this is the difference between setting a variable to "quantitative" (as above) or "nominal" in the "type" field within vegalite. (see changed in profile_data_kde_json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I'd fixed this, but I haven't found a way around this. Vegalite seems to autoformat these labels like this because the data type of "x" is quantitiative. If we set this to nominal, as we do in other charts, this will fix the labels, but produce graphs like this:
image

Obviously, we do not want KDE plots of non-numeric data, not least for the fact that these will render as wide as there are values in the data.

It also means that columns with valid numeric data end up like this:
image

When they should be like this:

image

This behaviour won't happen if formatted as date since this bug fix but this won't apply to anything that is pulled via SQL statements (which we often do).

_top_n_plot = load_chart_definition("profile_data_top_n.json")
_bottom_n_plot = load_chart_definition("profile_data_bottom_n.json")
_kde_plot = load_chart_definition("profile_data_kde.json")
_correlation_plot = load_chart_definition("profile_data_correlation_heatmap.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used in the script. The linter will fail if you try to run it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code left in by mistake from larger PR- removed

return sql


def _get_df_correlations(column_expressions):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is failing for me in postgres and sqlite. I'll post the test I am running in a second.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For postgres, you'll need docker and to run source scripts/postgres/setup,sh.

From there, you can use:

from tests.helpers import PostgresTestHelper
import pandas as pd
from tests.basic_settings import get_settings_dict
from tests.decorator import mark_with_dialects_excluding
from sqlalchemy import create_engine, text

engine = create_engine(
    f"postgresql+psycopg2://splinkognito:splink123!"
    f"@localhost:5432/splink_db"
)

helper = PostgresTestHelper(engine)
Linker = helper.Linker
brl = helper.brl
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

from tests.basic_settings import get_settings_dict
linker = Linker(df, get_settings_dict(), **helper.extra_linker_args())

linker.profile_numeric_columns(["substr(dob, 1,4)"], top_n=None, bottom_n=None, kde_plots=True, distribution_plots=False)

To generate the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably split up your CTE expressions first and then attempt to run the code above in debug mode. That should help you narrow down what's going wrong in the background.

This will ultimately all come down to a missing feature of postgres, but I'm not familiar enough with the db to be able to give you any clues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for diagnosing. Code was actually left in in error as this PR was originally meant to be larger, but was cut down. Have removed this code but will save this comment so that I can refer back when doing the other PR.

Comment on lines 173 to 211
sql = f"""
WITH column_list AS (
SELECT unnest(ARRAY{column_expressions}) AS column_name
),
column_data AS (
SELECT {', '.join(column_expressions)}
FROM __splink__df_concat
)
SELECT
t1.column_name AS column1,
t2.column_name AS column2,
CORR(d1.val::DOUBLE PRECISION, d2.val::DOUBLE PRECISION) AS correlation
FROM
column_list AS t1
CROSS JOIN
column_list AS t2
JOIN
(
SELECT
unnest(ARRAY{column_expressions}) AS column_name,
unnest(ARRAY[{', '.join(f't.{column_name}' for column_name in column_expressions)}]) AS val
FROM column_data AS t
) AS d1 ON t1.column_name = d1.column_name
JOIN
(
SELECT
unnest(ARRAY{column_expressions}) AS column_name,
unnest(ARRAY[{', '.join(f't.{column_name}' for column_name in column_expressions)}]) AS val
FROM column_data AS t
) AS d2 ON t2.column_name = d2.column_name
WHERE
t1.column_name < t2.column_name
GROUP BY
t1.column_name,
t2.column_name
ORDER BY
t1.column_name,
t2.column_name
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I know it's a pain, but on this large SQL statement, can you split it up such that each CTE expression is its own SQL expression?

In Splink, we have a pipeline class that generates CTE expressions automatically when provided with raw SQL statements and table names.

This means:

  1. Debug mode is more effective. Each CTE expression gets run, meaning it's much easier to identify which section of code is breaking.
  2. The code is easier to read.

To summarise each each step and give you some code examples:

  1. Write each individual SQL statement, adding them to a dictionary of the format {"sql": sql, "output_table_name": <str_name>}.
  2. Once you have these, you can queue up each SQL step using self._enqueue_sql. This queues up a SQL statement and assigns it a CTE name.
  3. After step 2, you don't need to make any additional changes. However, for completeness, see our pipeline class. This processes your individual SQL statements, translating them into CTE expressions and generating your final pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, code no longer needed, but thankyou for the explanation here, really helpful.

@ThomasHepworth
Copy link
Contributor

Could you add some tests to this code too?

A really basic one I used to test your code was working is:

from .decorator import mark_with_dialects_excluding

@mark_with_dialects_excluding()
def test_profile_data(test_helpers, dialect):
    helper = test_helpers[dialect]
    settings = get_settings_dict()
    Linker = helper.Linker

    df = helper.load_frame_from_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
    linker = Linker(df, settings, **helper.extra_linker_args())

    linker.profile_columns(["first_name", "city", "surname", "email", "substr(dob, 1,4)"], top_n=10, bottom_n=5)
    linker.profile_numeric_columns(["substr(dob, 1,4)"], top_n=None, bottom_n=None, kde_plots=True, distribution_plots=False)

This acts as a very basic integration test to check the methods work on all of our backends.

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sam, I really love the new chart!! It's easy to read and I think it'll prove really valuable for evaluating numerical data at a glance.

We can easily identify outliers without expending too much effort.

The method is currently broken for postgres and sqlite. I'm not quite sure why or what's wrong, but hopefully the code snippets provided help.

It would be nice to get this merged and released by next week, but there's no pressure!

@RobinL
Copy link
Member

RobinL commented Oct 10, 2023

Hey. Just looked at this in a bit more detail. I think I might be minunderstanding something, but think there may be an issue for continuous numerical data - for which I think it always produces a 'square':

Example code
import altair as alt
import pandas as pd
import numpy as np
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")


# Sort by first_name
df.sort_values("first_name", inplace=True)

# Generate random numbers influenced by alphabetical order
df["random_num_continuous"] = np.linspace(0, 1, len(df)) + np.random.uniform(
    0, 1, len(df)
)
df["random_num_discrete"] = np.round(df["random_num_continuous"] * 2) / 2


chart = (
    alt.Chart(df)
    .transform_density(
        "random_num_continuous", as_=["random_num_continuous", "density"], extent=[0, 2]
    )
    .mark_area()
    .encode(x="random_num_continuous:Q", y="density:Q")
)

display(chart)


chart = (
    alt.Chart(df)
    .transform_density(
        "random_num_discrete", as_=["random_num_discrete", "density"], extent=[0, 2]
    )
    .mark_area()
    .encode(x="random_num_discrete:Q", y="density:Q")
)

display(chart)


settings = {
    "link_type": "dedupe_only",
}


linker = DuckDBLinker(df, settings, connection=":memory:")

# linker.debug_mode=True

linker.profile_numeric_columns(
    ["random_num_continuous", "random_num_discrete"], kde_plots=True
)

In this example, Altair shows this for the random_num_continuous variable using its KDE (transform_density) function:

image

Whereas Splink shows:

image

Is the chart actually a KDE? I've not really used them before, but I they seem to perform some type of smoothing to estimate a distribution.

Note that when i round the data (the random_num_discrete variable), the new Splink chart seems to approximate the Altair KDE more closely

image

@RobinL RobinL closed this Jan 8, 2024
@RobinL RobinL deleted the 996_profiling_upgrades branch August 12, 2024 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants