996 profiling upgrades #1379

sama-ds · 2023-06-29T14:07:25Z

Type of PR

BUG
FEAT
MAINT
DOC

Is your Pull Request linked to an existing Issue or Pull Request?

One of the many requests compiled within Issue 996

This PR aims to add kernel density plots for continuous variables to allow for easier profiling of columns. To add in this graph, it became necessary to make the json files produced by profile_columns to eventually produce the graphical outputs modular and allow them to be pieced together by the parameters the user inputs. This means that any individual graph can be turned on/off, and should make adding more graphs in future significantly easier.

PR Checklist

Added documentation for changes
Added feature to example notebooks at tutorials in splink_demos (if appropriate)
Added tests (if appropriate)
Made changes based off the latest version of Splink
Run the linter

…mns. To encourage this to be modular and allow for easier adding of future graphs, I have restructured the profile_data.json file to be built iteratively by the components of all of the plots. This allows each individual plot to be turned on/off by setting a parameter that allows them to be False for distribution plots and None for top_n_plots and bottom_n_plots.

github-actions · 2023-06-29T14:12:38Z

Test: test_2_rounds_1k_duckdb

Percentage change: -12.8%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
849	2022-07-12	18:40:05	1.89098	1.87463	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1986	2023-08-30	15:07:58	1.67427	1.63405	(detached head)	`732e2af`	Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz	2.0952 GHz	`732e2af`

Test: test_2_rounds_1k_sqlite

Percentage change: -6.2%

	date	time	stats_mean	stats_min	commit_info_branch	commit_info_id	machine_info_cpu_brand_raw	machine_info_cpu_hz_actual_friendly	commit_hash
851	2022-07-12	18:40:05	4.32179	4.25898	splink3	`c334bb9`	Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz	2.7934 GHz	`c334bb9`
1988	2023-08-30	15:07:58	4.02793	3.99455	(detached head)	`732e2af`	Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz	2.0952 GHz	`732e2af`

Click here for vega lite time series charts

ThomasHepworth · 2023-06-29T14:30:21Z

splink/linker.py

+        self,
+        column_expressions: str | list[str],
+        top_n=10,
+        bottom_n=10,
+        kde_plots=False,
+        distribution_plots=True,
    ):
-        return profile_columns(self, column_expressions, top_n=top_n, bottom_n=bottom_n)
+        return profile_columns(
+            self,
+            column_expressions,
+            top_n=top_n,
+            bottom_n=bottom_n,
+            kde_plots=kde_plots,
+            distribution_plots=distribution_plots,
+        )


Can you coordinate writing the docstring for profile_data with @afua-moj?

I think one of you should put in a PR for the current version of splink and then build on that in this PR and her upcoming PR.

I've written a docstring in this, and another request. Can update on Afua's branch once this (or the other) has been merged.

sama-ds · 2023-06-29T15:20:12Z

Have left as a work in progress until @afua-moj has merged in. I've scanned both branches and there'll be minimal conflicts- but will let her changes be merged in first.

…st to allow for this functionality to be reviewed.

sama-ds · 2023-08-30T15:13:32Z

This is ready-for-review as is, but will have conflicts with c8a6a26 - please review and merge this branch first, then I can rebase into this branch and ensure conflicts are resolved appropriately.

RobinL · 2023-08-31T17:41:59Z

I really like the idea of Splink enabling more types of exploratory data analysis and I think the KDE option will be really useful.

There are a few challenges that come to mind from an API/usability point of view that I wondered if we'd considered. (Also relevant to the profile arrays PR.)

In summary, the challenge is around designing (expanding) the function signature to accomodate complex options. The input data has many columns of different types, where as the arguments are currently booleans which are not column specific.

To try and explain this a bit better, I think the difficulty arises from:

Input columns of multiple types. For an input dataset with columns of multiple types, it's challenging to design the function parameters to allow different profiling to be run depending on data type. Especially because choices are not deterministic: you don't always want to explode arrays and you don't always want to treat numeric data as continuous.
Data typing. The existing distribution plots can be used for any data types, but kde and corr can only be used with with continuous numeric data. This means it's possible to write a function call that is not 'executable' - e.g. asking for a kde on a 'first name'. At the moment, as far as I know, we don't do any data type checking in Splink, partly because it's hard to implement in a backend-agnostic way. It might therefore be hard to produce a descriptive error message.

To give a couple of examples of function calls that could be confusing:

# Example 1: Ambiguity in applying distribution and KDE plots when data contains numeric and non numeric
profile_columns(linker, ["first_name", "salary"], distribution_plots=True, kde_plots=True)

# Example 2: Suppose I want a distribution for age (i.e. treat is a nominal/categorical), and a kde plot for log(salary), it's difficult to design
# the function arguments to accomodate without significant complexity  (e.g. list of booleans, or dict like splink_settings)
profile_columns(linker, column_expressions=["age", "log(salary)"],  distribution_plots=True kde_plots=True)

# Example 3: Or when using the function with zero arguments 
profile_columns(linker,  distribution_plots = True, kde_plots=True)

I wondered whether we might be at risk of overloading a single function with too many options that make it difficult to use, and potentially hide the useful new functionality. Could an alternative option be simply to have separate function(s) for numeric and array profiling, and leave the generic profile_columns as distribution only (could even be deprecated)?

(Note: I'm not very sure about the best solution here - it's just one option and there may well be a better one) Apologies in advance if I'm misunderstood something, I haven't had time to pull the branch and actually run the code.

Lastly - just for my understanding - I've not used KDE before. Is it preferable to a histogram?

RobinL · 2023-08-31T17:46:30Z

I think your ideas in our previous discussion are also relevant here in terms of API design.

I wonder whether it might make sense for a set of profiling functions to live outside the main Linker, so that you can pass any dataframe into them more easily (e.g. without needing Splink settings or instantiating a linker).

Maybe there should be something like an exploratory.py module that offers a variety of profiling/exploratory functions e.g. splink.exploratory.values_distribution(dataframe, column_expressions)
splink.exploratory.kde(dataframe, column_expressions)

(That's a very rough idea, i'm not suggesting those should be the precise names)

Also relevant: #1055

RossKen · 2023-09-05T08:38:23Z

I think your ideas in our previous discussion are also relevant here in terms of API design.

I wonder whether it might make sense for a set of profiling functions to live outside the main Linker, so that you can pass any dataframe into them more easily (e.g. without needing Splink settings or instantiating a linker).

Maybe there should be something like an exploratory.py module that offers a variety of profiling/exploratory functions e.g. splink.exploratory.values_distribution(dataframe, column_expressions) splink.exploratory.kde(dataframe, column_expressions)

(That's a very rough idea, i'm not suggesting those should be the precise names)

Also relevant: #1055

I think I am in support of having a separate exploratory module (but haven't thought about it in great detail). I have never been a massive fan on having to instantiate a linker just to do some EDA 👍

samnlindsay · 2023-09-05T10:29:05Z

@RobinL

I wondered whether we might be at risk of overloading a single function with too many options that make it difficult to use, and potentially hide the useful new functionality. Could an alternative option be simply to have separate function(s) for numeric and array profiling, and leave the generic profile_columns as distribution only (could even be deprecated)?

I think I would prefer to keep a single profile_columns as a one-stop shop for summarising the data, rather than requiring separate functions for different kinds of columns and output. Could we use some kind of shorthand (like altair's salary:Q for a numerical column and full_name:N for a categorical one), or the option to supply columns as:

a list (current distribution-only version) or
a dictionary (to specify how to profile each column).

Example

Default - distributions only, as previously

cols = ["first_name", "salary"]
profile_columns(linker, cols)

Equivalent, more verbose alternative

cols = {
  "first_name": {dist = True, kde = False}, 
  "salary": {dist = True, kde = False}
}
profile_columns(linker, cols)

Per-column custom profiling as appropriate for data type

cols = {
  "first_name": {dist = True, kde = False}, 
  "salary": {dist = False, kde = True}
}
profile_columns(linker, cols)

Almost certainly making this seem simpler than it would really be to implement, and possibly missing the point in the earlier discussion so let me know if this is nonsense.

RobinL · 2023-09-05T13:18:14Z

My view is that we should retain profile_columns in its current form as a simple way to get some basic charts (esp. the new zero argument version), but we should not add further complexity to it.

There are several reasons:

Code complexity I worry that trying to do too much in profile_columns may make the code complicated and difficult to maintain. The code behind profile_columns is already pretty complicated, but a saving grace is that, because everything is (treated like) a string, then at least the code doesn't have to deal with data typing issues.
At the moment, profile_columns is confusing when there are multiple input datasets. It returns the profile of the concatenated dataframes, but it doesn't tell you that anywhere. If you want to profile a single input dataset you have to use a workaround (instantiate a new linker with one input dataset). Having separate functions means the 'damage' of a bug (see errors below) is contained to that single function and is probably easier to fix
Extensibility I think it's easier to add additional exploratory functionality if it doesn't all have to be rolled into a single function. We're already adding support for profiling array columns, KDE, and correlation. Probably we might want histograms at some future point.
Discoverability If functionality is only available via specific non-default parameters, it may be hard for users to discover it (you don't get tab completion, for instance, and it's harder to present in the docs). It's easier to document several functions that do different things than one mega-function
Difficulty of API design/complexity of function calls. I agree that it would be possible to design a settings={} parameter that controlled profile_columns in a similar what to how Splink settings control the Linker. But I don't think this code is particularly easy to write for the user, and from our (the maintainers) perspective, there's also potentially quite a bit of complexity in error checking (do data types align with what the user has asked for, are the settings valid, etc.)
Trying to be consistent about having long descriptive names
Errors: At the moment, using profile_columns it should be rare for users to experience SQL execution failures due to data typing. Adding in functions that require numeric inputs makes the likelihood much higher, and so makes descriptive errors harder to raise. Arguably it should be accompanied by data type checking code (so we can raise descriptive error message), but that's a pretty big piece of work that we probably don't want to get into right now.

On the idea of Altair style names (full_name:N) I think there's a useful idea here that can be useful irrespective on the decision of the profile_columns API. One issue here is whether it means (a) full_name should be a string, or (b) full_name is a string. (Because choices are not deterministic - e.g. for a 'transaction_amount' column, you might sometimes want it to be treated as a string even though it's a number, so that you notice that £9.99 is very common, but £10.01 is not)

sama-ds · 2023-09-11T12:43:23Z

A lot of points and really useful discussion to dissect here, so I'll attempt to summarise. We have three options:

Accept this PR and functionality within
Accept the ability to remove elements from this PR, and add separate functionality to a set of exploratory tools not needing a linker to be initiated
Alter this PR to include some elements of type casting, and allow the user to turn charts on/off for a given column

For me, 2) feels like we're trying to reinvent the wheel. There are a whole host of options to do exploratory analysis of data generally, and to be profile_columns functions as a step that allows a user to easy get a view of things we deem key. In this sense, definitely agree with Sam that I like profile_colmns being the go-to. However, I agree that adding further functionality to this becomes a bit "heavy", but what we include could be quite stringent to this. I guess the question here is- what else would we actually want to include, and are we throwing the baby our with the bathwater by abandoning this?

I really like the idea of 3), as I think it follows the Splink "feel", but I appreciate it's more difficult to engineer/maintain from our end than simple functions. From an engineering perspective though, I think the framework from this PR that allows iterative building of those graphs would allow this anyway. I'm not sure I agree with the point that this is difficult code for the user to write, and actually think it's simpler than multiple lines of a user calling graph 1 for X,Y,Z, then graph 2 for Y,I,J, etc. The data typing issue feels like the largest issue here, but I'm not sure in practice whether it is an issue we have to solve. Within current implementation, for example, if a kde_plot is called for a non-numeric variable, it simply renders the chart without any data, rather than erroring the function. I think it would be fairly simple to validate the entry data going into the chart (i.e. will it generate anything), and throw out an error message akin to "The for appears to have invalid input data. This is usually triggered by an incorrect data type."

I will propose a fourth option to muddy the waters further. If we believe that data types are this complicated issue, how would people feel about simply having two profile column functions: profile_columns and profile_columns_numeric (naming TBC), with the intention that one holds a set of graphs applicable to columns the user wants to treat as strings (eg. A,B,C), and the other holds a set of functions that user wants to treat as numeric (eg. B,C,D)? The intention being that profiling becomes two-step process, and produces two separate sets of graphs. This would also allows the user to be able to treat a single column as two different types types, as Robin highlighted above, by including it in both sets of graphs.

RobinL · 2023-09-11T21:01:23Z

I think a profile_columns_numeric function seems like a good compromise - it isolates any problems with data typing and keeps code for different things more separate.

I'm quite focussed on data typing because in my experience it's usually harder than it looks when you're working across multiple tools (sql backends), and can easily leads to subtle bugs etc.

It doesn't fully solve the 'how to profile a single df for a multi-df link job', but that functionality could probably be added later (to both profile_columns_numeric and profile_columns), e.g. by optionally passing in the data (or a reference to it).

I definitely agree with the point that we need to keep an eye on value added i.e. concentrate efforts on data vis that is very linkage-focussed, rather than re-implementing things that can be achieved with other tools

sama-ds · 2023-09-22T13:48:34Z

Have implemented agreed changes by creating a seperate profile_numeric_columns. They are currently very similar (the only difference being a kernel density plot, which is blocked in the original). As additional features are added, it is expect they will diverge, but the base profile_columns function not associated with the linker will retain all functionality.

ThomasHepworth · 2023-10-02T14:57:40Z

docs/demos/02_Exploratory_analysis.ipynb

The path needs updating - docs/demos/02_Exploratory_analysis.ipynb -> docs/demos/tutorials/02_Exploratory_analysis.ipynb

ThomasHepworth · 2023-10-05T15:34:54Z

splink/linker.py

-        self, column_expressions: str | list[str] = None, top_n=10, bottom_n=10
+        self,
+        column_expressions: str | list[str],
+        top_n=10,
+        bottom_n=10,
+        distribution_plots=True,
    ):
-        """
-        Profiles the specified columns of the dataframe initiated with the linker.
-
-        This can be computationally expensive if the dataframe is large.
-
-        For the provided columns with column_expressions (or for all columns if
-         left empty) calculate:
-        - A distribution plot that shows the count of values at each percentile.
-        - A top n chart, that produces a chart showing the count of the top n values
-        within the column
-        - A bottom n chart, that produces a chart showing the count of the bottom
-        n values within the column
-
-        This should be used to explore the dataframe, determine if columns have
-        sufficient completeness for linking, analyse the cardinality of columns, and
-        identify the need for standardisation within a given column.


Did you mean to delete the docstring?

I did. I have added the docstring to the base function within the base function profile_columns. These functions sit as wrappers, so would involve duplicating the doc string. Happy to do so if needed for markdown docs to function correctly, but is it needed?

ThomasHepworth · 2023-10-05T15:35:03Z

splink/linker.py

+    def profile_numeric_columns(
+        self,
+        column_expressions: str | list[str],
+        top_n=10,
+        bottom_n=10,
+        kde_plots=False,
+        distribution_plots=True,
+    ):
        return profile_columns(
-            self, column_expressions=column_expressions, top_n=top_n, bottom_n=bottom_n
+            self,
+            column_expressions,
+            top_n=top_n,
+            bottom_n=bottom_n,
+            kde_plots=kde_plots,
+            distribution_plots=distribution_plots,


Needs a docstring

As above- docstring is at main function that this calls. Need repeating?

ThomasHepworth · 2023-10-05T15:37:39Z

splink/linker.py

+        top_n=10,
+        bottom_n=10,
+        distribution_plots=True,


You might as well add some type annotations

ThomasHepworth · 2023-10-05T15:37:49Z

splink/linker.py

+        top_n=10,
+        bottom_n=10,
+        kde_plots=False,
+        distribution_plots=True,


Type annotations

ThomasHepworth · 2023-10-05T15:47:26Z

splink/files/chart_defs/profile_data_kde.json

+      "x": {
+        "type": "quantitative",
+        "field": "value",
+        "title": "Value"
+      },


I'm not familiar enough with vegalite and how it works under the hood, but I am assuming it is automatically adding in commas for our numerical values 👇

Perhaps we should change this behaviour so that it only adds commas if there are more than four digits?

Have ammended- if you're interested this is the difference between setting a variable to "quantitative" (as above) or "nominal" in the "type" field within vegalite. (see changed in profile_data_kde_json

I thought I'd fixed this, but I haven't found a way around this. Vegalite seems to autoformat these labels like this because the data type of "x" is quantitiative. If we set this to nominal, as we do in other charts, this will fix the labels, but produce graphs like this:

Obviously, we do not want KDE plots of non-numeric data, not least for the fact that these will render as wide as there are values in the data.

It also means that columns with valid numeric data end up like this:

When they should be like this:

This behaviour won't happen if formatted as date since this bug fix but this won't apply to anything that is pulled via SQL statements (which we often do).

ThomasHepworth · 2023-10-05T15:49:10Z

splink/profile_data.py

+_top_n_plot = load_chart_definition("profile_data_top_n.json")
+_bottom_n_plot = load_chart_definition("profile_data_bottom_n.json")
+_kde_plot = load_chart_definition("profile_data_kde.json")
+_correlation_plot = load_chart_definition("profile_data_correlation_heatmap.json")


This isn't used in the script. The linter will fail if you try to run it.

Code left in by mistake from larger PR- removed

ThomasHepworth · 2023-10-05T16:12:38Z

splink/profile_data.py

+    return sql
+
+
+def _get_df_correlations(column_expressions):


I think this code is failing for me in postgres and sqlite. I'll post the test I am running in a second.

For postgres, you'll need docker and to run source scripts/postgres/setup,sh.

From there, you can use:

from tests.helpers import PostgresTestHelper import pandas as pd from tests.basic_settings import get_settings_dict from tests.decorator import mark_with_dialects_excluding from sqlalchemy import create_engine, text engine = create_engine( f"postgresql+psycopg2://splinkognito:splink123!" f"@localhost:5432/splink_db" ) helper = PostgresTestHelper(engine) Linker = helper.Linker brl = helper.brl df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv") from tests.basic_settings import get_settings_dict linker = Linker(df, get_settings_dict(), **helper.extra_linker_args()) linker.profile_numeric_columns(["substr(dob, 1,4)"], top_n=None, bottom_n=None, kde_plots=True, distribution_plots=False)

To generate the error.

I would probably split up your CTE expressions first and then attempt to run the code above in debug mode. That should help you narrow down what's going wrong in the background.

This will ultimately all come down to a missing feature of postgres, but I'm not familiar enough with the db to be able to give you any clues.

Thanks for diagnosing. Code was actually left in in error as this PR was originally meant to be larger, but was cut down. Have removed this code but will save this comment so that I can refer back when doing the other PR.

ThomasHepworth · 2023-10-05T16:23:23Z

splink/profile_data.py

+    sql = f"""
+    WITH column_list AS (
+    SELECT unnest(ARRAY{column_expressions}) AS column_name
+    ),
+    column_data AS (
+    SELECT {', '.join(column_expressions)}
+    FROM __splink__df_concat
+    )
+    SELECT
+    t1.column_name AS column1,
+    t2.column_name AS column2,
+    CORR(d1.val::DOUBLE PRECISION, d2.val::DOUBLE PRECISION) AS correlation
+    FROM
+    column_list AS t1
+    CROSS JOIN
+    column_list AS t2
+    JOIN
+    (
+        SELECT
+        unnest(ARRAY{column_expressions}) AS column_name,
+        unnest(ARRAY[{', '.join(f't.{column_name}' for column_name in column_expressions)}]) AS val
+        FROM column_data AS t
+    ) AS d1 ON t1.column_name = d1.column_name
+    JOIN
+    (
+        SELECT
+        unnest(ARRAY{column_expressions}) AS column_name,
+        unnest(ARRAY[{', '.join(f't.{column_name}' for column_name in column_expressions)}]) AS val
+        FROM column_data AS t
+    ) AS d2 ON t2.column_name = d2.column_name
+    WHERE
+    t1.column_name < t2.column_name
+    GROUP BY
+    t1.column_name,
+    t2.column_name
+    ORDER BY
+    t1.column_name,
+    t2.column_name
+    """


Sorry, I know it's a pain, but on this large SQL statement, can you split it up such that each CTE expression is its own SQL expression?

In Splink, we have a pipeline class that generates CTE expressions automatically when provided with raw SQL statements and table names.

This means:

Debug mode is more effective. Each CTE expression gets run, meaning it's much easier to identify which section of code is breaking.

The code is easier to read.

To summarise each each step and give you some code examples:

Write each individual SQL statement, adding them to a dictionary of the format {"sql": sql, "output_table_name": <str_name>}.

Once you have these, you can queue up each SQL step using self._enqueue_sql. This queues up a SQL statement and assigns it a CTE name.

After step 2, you don't need to make any additional changes. However, for completeness, see our pipeline class. This processes your individual SQL statements, translating them into CTE expressions and generating your final pipeline.

As above, code no longer needed, but thankyou for the explanation here, really helpful.

ThomasHepworth · 2023-10-05T16:28:42Z

Could you add some tests to this code too?

A really basic one I used to test your code was working is:

from .decorator import mark_with_dialects_excluding

@mark_with_dialects_excluding()
def test_profile_data(test_helpers, dialect):
    helper = test_helpers[dialect]
    settings = get_settings_dict()
    Linker = helper.Linker

    df = helper.load_frame_from_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
    linker = Linker(df, settings, **helper.extra_linker_args())

    linker.profile_columns(["first_name", "city", "surname", "email", "substr(dob, 1,4)"], top_n=10, bottom_n=5)
    linker.profile_numeric_columns(["substr(dob, 1,4)"], top_n=None, bottom_n=None, kde_plots=True, distribution_plots=False)

This acts as a very basic integration test to check the methods work on all of our backends.

ThomasHepworth

Thanks Sam, I really love the new chart!! It's easy to read and I think it'll prove really valuable for evaluating numerical data at a glance.

We can easily identify outliers without expending too much effort.

The method is currently broken for postgres and sqlite. I'm not quite sure why or what's wrong, but hopefully the code snippets provided help.

It would be nice to get this merged and released by next week, but there's no pressure!

RobinL · 2023-10-10T07:56:42Z

Hey. Just looked at this in a bit more detail. I think I might be minunderstanding something, but think there may be an issue for continuous numerical data - for which I think it always produces a 'square':

Example code

import altair as alt
import pandas as pd
import numpy as np
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")


# Sort by first_name
df.sort_values("first_name", inplace=True)

# Generate random numbers influenced by alphabetical order
df["random_num_continuous"] = np.linspace(0, 1, len(df)) + np.random.uniform(
    0, 1, len(df)
)
df["random_num_discrete"] = np.round(df["random_num_continuous"] * 2) / 2


chart = (
    alt.Chart(df)
    .transform_density(
        "random_num_continuous", as_=["random_num_continuous", "density"], extent=[0, 2]
    )
    .mark_area()
    .encode(x="random_num_continuous:Q", y="density:Q")
)

display(chart)


chart = (
    alt.Chart(df)
    .transform_density(
        "random_num_discrete", as_=["random_num_discrete", "density"], extent=[0, 2]
    )
    .mark_area()
    .encode(x="random_num_discrete:Q", y="density:Q")
)

display(chart)


settings = {
    "link_type": "dedupe_only",
}


linker = DuckDBLinker(df, settings, connection=":memory:")

# linker.debug_mode=True

linker.profile_numeric_columns(
    ["random_num_continuous", "random_num_discrete"], kde_plots=True
)

In this example, Altair shows this for the random_num_continuous variable using its KDE (transform_density) function:

Whereas Splink shows:

Is the chart actually a KDE? I've not really used them before, but I they seem to perform some type of smoothing to estimate a distribution.

Note that when i round the data (the random_num_discrete variable), the new Splink chart seems to approximate the Altair KDE more closely

… pr, but have since been cutdown

sama-ds added 3 commits June 20, 2023 12:49

Inital functional KDE graphs in profile data

9dcc08c

Minor logic change.

bc54397

sama-ds marked this pull request as draft June 29, 2023 14:07

sama-ds added 2 commits June 29, 2023 15:10

Linting changes and removing print statements used for debugging

2d8b2c8

lint with black

14fbeb9

ThomasHepworth reviewed Jun 29, 2023

View reviewed changes

sama-ds added 6 commits August 10, 2023 15:35

Functional changes - KDE yet to be finished so currently being worked on

1c4aa57

lint with black

36a43f0

Removed code around correlation plot and moved to seperate pull reque…

a4069fb

…st to allow for this functionality to be reviewed.

Added docstring

092df47

Merge remote-tracking branch 'origin/master' into 996_profiling_upgrades

83bdca0

Added example use to tutorial notebook.

d33cf28

sama-ds force-pushed the 996_profiling_upgrades branch from f184686 to d33cf28 Compare August 30, 2023 15:06

lint with black

22fd8e6

sama-ds marked this pull request as ready for review August 30, 2023 15:09

RobinL mentioned this pull request Aug 31, 2023

1064 profile array elements #1397

Closed

6 tasks

sama-ds and others added 3 commits September 22, 2023 13:49

profile_numeric_columns functionality

5038bc6

Merge branch 'master' into 996_profiling_upgrades

879b37b

lint with black

0530de3

sama-ds added 3 commits September 22, 2023 14:21

Updating docs to reflect changes

95f1b44

Minor changes to appease linter

6501773

lint with black

8d732dc

ThomasHepworth reviewed Oct 2, 2023

View reviewed changes

ThomasHepworth reviewed Oct 5, 2023

View reviewed changes

ThomasHepworth requested changes Oct 5, 2023

View reviewed changes

sama-ds added 5 commits November 8, 2023 12:14

Removing un-uncessary files and code that were originally part of the…

5b2ba37

… pr, but have since been cutdown

Making minor changes on formatting a typecasting

9b11ed5

lint with black

33e6891

Adding additional tests and error protection

e194bc1

Fixed error catching

f96fafb

RobinL closed this Jan 8, 2024

RobinL deleted the 996_profiling_upgrades branch August 12, 2024 10:09

996 profiling upgrades #1379

996 profiling upgrades #1379

Conversation

sama-ds commented Jun 29, 2023 • edited by ThomasHepworth Loading

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

PR Checklist

github-actions bot commented Jun 29, 2023 • edited Loading

Test: test_2_rounds_1k_duckdb

Test: test_2_rounds_1k_sqlite

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sama-ds commented Jun 29, 2023

sama-ds commented Aug 30, 2023

RobinL commented Aug 31, 2023 • edited Loading

RobinL commented Aug 31, 2023 • edited Loading

RossKen commented Sep 5, 2023

samnlindsay commented Sep 5, 2023 • edited Loading

Example

RobinL commented Sep 5, 2023 • edited Loading

sama-ds commented Sep 11, 2023

RobinL commented Sep 11, 2023 • edited Loading

sama-ds commented Sep 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasHepworth commented Oct 5, 2023

ThomasHepworth left a comment

Choose a reason for hiding this comment

RobinL commented Oct 10, 2023 • edited Loading

sama-ds commented Jun 29, 2023 •

edited by ThomasHepworth

Loading

github-actions bot commented Jun 29, 2023 •

edited

Loading

RobinL commented Aug 31, 2023 •

edited

Loading

RobinL commented Aug 31, 2023 •

edited

Loading

samnlindsay commented Sep 5, 2023 •

edited

Loading

RobinL commented Sep 5, 2023 •

edited

Loading

RobinL commented Sep 11, 2023 •

edited

Loading

RobinL commented Oct 10, 2023 •

edited

Loading