Releases · moj-analytical-services/splink

19 Sep 14:16

RobinL

v3.3.0.dev02

a0c650f

v3.3.0.dev02 Pre-release

Pre-release

What's Changed

[DOCS] Add examples to docs by @RobinL in #774
[FIX] Fix jaro in duckdb by @RobinL in #775

Full Changelog: v3.3.0.dev01...v3.3.0.dev02

Contributors

RobinL

Assets 2

15 Sep 19:07

RobinL

v3.3.0.dev01

b613a08

v3.3.0.dev01 Pre-release

Pre-release

What's Changed

[DOCS] Add links to videos into main readme by @RobinL in #765
[FEAT] Add percentage difference to comparison level library by @RobinL in #757
[DOCS] Add examples section to docs by @RobinL in #772
[FEAT] Add jaro winkler to duckdb linker now 0.5.0 is a dependency by @RobinL in #766
[FEAT] Waterfall of false positives and false negatives from labels by @RobinL in #763
[FEAT] ROC/Precision recall/truth table from label column name by @RobinL in #773

Full Changelog: v3.2.1...v3.3.0.dev01

Contributors

RobinL

Assets 2

06 Sep 14:01

RobinL

v3.2.1

baced9e

v3.2.1

What's Changed

[MAINT] Make Splink compatible with duckdb v0.5.0 release by @RobinL in #750
[Fix] Make json encoder more robust by @RobinL in #751
[FIX] fix table exists in db by @RobinL in #753
[MAINT] Bump version to 3.2.1 and duckdb to 0.5.0 by @RobinL in #758

Full Changelog: v3.2.0...v3.2.1

Contributors

RobinL

Assets 2

30 Aug 16:43

RobinL

v3.2.0

bb468a6

v3.2.0

What's Changed

There are two minor breaking changes:

(1). settings must now always be provided to instantiate the linker object. The most minimal settings object is {"link_type": your_link_type}
(2). By default, EM training sessions no longer estimate the probability_two_random_records_match. This can be enables by passing an argument explicitly.

Features

[FEAT] Add support for pairwise format of clusters by @ThomasHepworth in #707
[FEAT] Haversine comparison level by @ThomasHepworth in #721
[FEAT] Databricks tweaks pr by @rjc89 in #715
[FEAT] Direct estimation probability two random records match by @RobinL in #734

Other

[DOCS] Update main readme to include clustering by @RobinL in #696
add version tag by @ThomasHepworth in #695
add a custom translation for cast(<val> as double) -> <val>D by @ThomasHepworth in #697
add duckdb helper functions to a separate script by @ThomasHepworth in #700
[docs] fix minor typo in docs by @Thomas-Hirsch in #708
[Docs] Dev guide to transpilation by @RobinL in #711
[MAINT] Log SQL statements before, not after, they are executed in Spark by @RobinL in #714
[DOCS] Update binder link in readme by @RobinL in #716
Adjust input col sql logic by @ThomasHepworth in #725
lint black external contribution by @RobinL in #728
[Docs] Update dev guide to sqlglot and transpilation by @RobinL in #729
better quote unquote by @RobinL in #731
[MAINT] Don't return html by default, it crashes jupyter by @RobinL in #735
Update sqlglot v5 by @ThomasHepworth in #736
Athena fixes by @ThomasHepworth in #738
[DOCS] Add developers guide to building docs locally by @RobinL in #740
[MAINT] Improve implementation of InputColumn and remove transpile by @RobinL in #727
Document save_offline_chart and ensure it works if passed a vega lite chart by @RobinL in #742
[MAINT] Improve analyse blocking by @RobinL in #743
[MAINT] Bump version for prerelease by @RobinL in #744

New Contributors

@Thomas-Hirsch made their first contribution in #708
@rjc89 made their first contribution in #715

Full Changelog: v3.1.0...v3.2.0.dev01

Contributors

RobinL, rjc89, and 2 other contributors

Assets 2

30 Aug 16:08

RobinL

v3.2.0.dev01

082cb90

v3.2.0.dev01 Pre-release

Pre-release

What's Changed

Features

[FEAT] Add support for pairwise format of clusters by @ThomasHepworth in #707
[FEAT] Haversine comparison level by @ThomasHepworth in #721
[FEAT] Databricks tweaks pr by @rjc89 in #715
[FEAT] Direct estimation probability two random records match by @RobinL in #734

Other

[DOCS] Update main readme to include clustering by @RobinL in #696
add version tag by @ThomasHepworth in #695
add a custom translation for cast(<val> as double) -> <val>D by @ThomasHepworth in #697
add duckdb helper functions to a separate script by @ThomasHepworth in #700
[docs] fix minor typo in docs by @Thomas-Hirsch in #708
[Docs] Dev guide to transpilation by @RobinL in #711
[MAINT] Log SQL statements before, not after, they are executed in Spark by @RobinL in #714
[DOCS] Update binder link in readme by @RobinL in #716
Adjust input col sql logic by @ThomasHepworth in #725
lint black external contribution by @RobinL in #728
[Docs] Update dev guide to sqlglot and transpilation by @RobinL in #729
better quote unquote by @RobinL in #731
[MAINT] Don't return html by default, it crashes jupyter by @RobinL in #735
Update sqlglot v5 by @ThomasHepworth in #736
Athena fixes by @ThomasHepworth in #738
[DOCS] Add developers guide to building docs locally by @RobinL in #740
[MAINT] Improve implementation of InputColumn and remove transpile by @RobinL in #727
Document save_offline_chart and ensure it works if passed a vega lite chart by @RobinL in #742
[MAINT] Improve analyse blocking by @RobinL in #743
[MAINT] Bump version for prerelease by @RobinL in #744

New Contributors

@Thomas-Hirsch made their first contribution in #708
@rjc89 made their first contribution in #715

Full Changelog: v3.1.0...v3.2.0.dev01

Contributors

RobinL, rjc89, and 2 other contributors

Assets 2

03 Aug 15:44

RobinL

v3.1.0

eb37154

v3.1.0

What's Changed

Warning
In version 3.1.0 there's a small API change to the SparkLinker that’s backwards incompatible. i.e. it’s a minor violation of semver

The changes affect the SparkLinker only:

The default break_lineage_method will change to parquet
The break_lineage_after_blocking param is renamed to repartition_after_blocking for clarity

Features

Add the ability to use pyarrow + on on disk parquet/csv in duckdb by @ThomasHepworth in #684
Add completeness (by dataset) chart by @samnlindsay in #669
Add cumulative blocking rule comparison chart by @ThomasHepworth in #660
Allow find_matches_to_new_records to take table name as input, in addition to rows by @RobinL in #659

Bugfixes

remove duplicate column selections by @ThomasHepworth in #681
fix em training tooltip by @ThomasHepworth in #665

Maintenance

[MAINT] Clarify sql execution function names by @RobinL in #690
[MAINT] Clarify Spark Linker caching logic by @RobinL in #691
[MAINT] Bump version to 3.1.0 by @RobinL in #693
Fix code formatting on count_num_comparisons_from_blocking_rules_for_prediction by @RobinL in #661
Add salting to spark full test by @RobinL in #655

Docs

Improve customising comparisons topic guide by @RobinL in #667
[DOCS] Performance topic guide, covering blocking by @RobinL in #675
[docs] Add issue template for bug report by @RobinL in #676
[DOCS] Add topic guide for optimising spark jobs by @RobinL in #679
[DOCS] Fix problem with spark docs copy by @RobinL in #685
[Docs] Developers' guide to caching and pipelining by @RobinL in #686
[Docs] Developer guide: Understanding and debugging Splink's computations by @RobinL in #688
[DOCS] Developers' guide to spark caching and pipelining by @RobinL in #689

Full Changelog: v3.0.1...v3.1.0

Contributors

RobinL, samnlindsay, and ThomasHepworth

Assets 2

18 Jul 17:12

RobinL

v3.0.1

0cc6bc6

v3.0.1

What's Changed

Performance improvements

Improve the performance of our training steps by @Th368MoJ in #648
SparkLinker: Improve performance ofestimate_u_using_random_sampling by @RobinL in #641

Features

Add Spark jar and UDF comparison functions to the Spark comparison library by @RobinL in #649

Other

splink_demos for Splink3 are now on the master branch by @RobinL in #637
Topic guide for salting by @RobinL in #638
Add topic guide to documentation covering different execution backends by @RobinL in #643
Fix __repr__ of EMTrainingSession by @RobinL in #645
Version and download numbers badges in readme by @samnlindsay in #650
Add acknowledgement for academic advisors by @RobinL in #652
Update README badges by @Th368MoJ in #651
Bump sqlglot version and 3.0.1 by @RobinL in #653

Full Changelog: v3.0.0...v3.0.1

Contributors

RobinL, samnlindsay, and ThomasHepworth

Assets 2

12 Jul 19:13

RobinL

v3.0.0

1c4e678

Splink 3.0.0

What's changed

splink version 3.0.0 is a complete re-write. The major new features are:

Splink no longer requires Spark. It can now run against multiple backends, including DuckDB, Spark, and AWS Athena.
Term frequency adjustments can be applied more flexibly, with more options - see here
Using the DuckDB backend, close to real time linkage of new records is possible, enabling Splink to be embedded in search services, see here
Many Splink operations are faster across all backends. The most dramatic speedups are for smaller linkages of less than around 1m records, whereby using DuckDB rather than Spark can result in runtimes that are 10x faster or better.
A more comprehensive documentation website is available here
The cluster studio and comparison viewer dashboards are now bundled with Splink rather than being separate packages, making them simpler to use and preventing issues with version incompatibilities

Upgrading

We recommend re-training models. However, we do provide a Splink 2 to 3 converter that attempts to convert a Splink 2 settings dictionary into the Splink 3 equivalent.

This works on a 'best efforts' basis, so is not guaranteed to work for every model

Assets 2

12 Jul 16:22

RobinL

v3.0.0.dev25

a1e707c

v3.0.0.dev25 Pre-release

Pre-release

What's Changed

adjust filepath parquet files are output along by @Th368MoJ in #628
fix filepath verification in athena by @Th368MoJ in #630
fix realtime linking by @RobinL in #629
v25 by @RobinL in #631

Full Changelog: v3.0.0.dev24...v3.0.0.dev25

Contributors

RobinL and ThomasHepworth

Assets 2

12 Jul 12:13

ThomasHepworth

v3.0.0.dev24

b952e4c

v3.0.0.dev24 Pre-release

Pre-release

What's Changed

Athena cluster studio fix for release by @Th368MoJ in #626

Full Changelog: v3.0.0.dev23...v3.0.0.dev24

Contributors

ThomasHepworth

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Features

Other

New Contributors

Contributors

What's Changed

Features

Other

New Contributors

Contributors

What's Changed

Features

Bugfixes

Maintenance

Docs

Contributors

What's Changed

Performance improvements

Features

Other

Contributors

What's changed

Upgrading

What's Changed

Contributors

What's Changed

Contributors

Releases: moj-analytical-services/splink

v3.3.0.dev02

What's Changed

Contributors

v3.3.0.dev01

What's Changed

Contributors

v3.2.1

What's Changed

Contributors

v3.2.0

What's Changed

Features

Other

New Contributors

Contributors

v3.2.0.dev01

What's Changed

Features

Other

New Contributors

Contributors

v3.1.0

What's Changed

Features

Bugfixes

Maintenance

Docs

Contributors

v3.0.1

What's Changed

Performance improvements

Features

Other

Contributors

Splink 3.0.0

What's changed

Upgrading

v3.0.0.dev25

What's Changed

Contributors

v3.0.0.dev24

What's Changed

Contributors