Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1064 profile array elements #1397

Closed
wants to merge 26 commits into from
Closed

Conversation

afua-moj
Copy link
Contributor

@afua-moj afua-moj commented Jul 5, 2023

Type of PR

  • FEAT

Is your Pull Request linked to an existing Issue or Pull Request?

Solves issue #1064 for DuckDB

Give a brief description for the solution you have provided

Can now profile the individual elements of an array with the DuckDB linker. This is the default behaviour. Can profile the whole array by setting cast_array_as_string= True

Have written a Backend Agnostic Test for this feature (test_profile_arrays_bat). Can edit this and remove test_profile_with_arrays_spark once the feature has been developed in Spark.

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks at tutorial in splink_demos (if appropriate)
  • Added tests (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -5.9%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1857 2023-07-14 09:51:26 1.76617 1.76482 (detached head) cac9b53 Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz 2.0951 GHz cac9b53

Test: test_2_rounds_1k_sqlite

Percentage change: 1.6%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1859 2023-07-14 09:51:26 4.3294 4.32654 (detached head) cac9b53 Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz 2.0951 GHz cac9b53

Click here for vega lite time series charts

@ThomasHepworth
Copy link
Contributor

I think you can ignore the postgres test failure. I am pretty certain that's now been fixed in master.

You're welcome to perform another rebase if you'd like to check and confirm.

column_expressions: str | list[str],
top_n=10,
bottom_n=10,
cast_arrays_as_str=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is False by default too, so users are going to get spammed by the warning.

This won't be so much of an issue once this is implemented across the board, but it might be a little obnoxious if we don't apply it across the board for a while.

@afua-moj afua-moj marked this pull request as ready for review July 6, 2023 18:54
@afua-moj afua-moj requested review from RossKen and ADBond July 6, 2023 18:55
splink/profile_data.py Outdated Show resolved Hide resolved
splink/profile_data.py Outdated Show resolved Hide resolved
'{gn}' as group_name,

(select count(value) from
(select unnest ({col_or_expr}
Copy link
Member

@RobinL RobinL Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a potential issue with unnest here that the function name changes depending on the dialect e.g. in spark it's explode. If so, might consider adding something to each backend similar to how we deal with random samples (where the dialect varies between linkers). Apologies if you're already aware

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up... yes this was on my radar and was more complicated to deal with as the 'translator' didn't translate the desired function, so Tom is going to work on applying this to the other backends. But I will ask him about the solution used with random samples as it's good to know.

@afua-moj afua-moj force-pushed the 1064_profile_array_elements branch from 5cc2023 to c04d98a Compare July 14, 2023 09:49
@RobinL
Copy link
Member

RobinL commented Aug 31, 2023

EDIT: Apologies if you read this, I put it on the wrong issue, it was meant for here

@RobinL RobinL mentioned this pull request Aug 31, 2023
9 tasks
@RobinL RobinL closed this Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants