Releases: moj-analytical-services/splink
v3.3.0.dev02
What's Changed
Full Changelog: v3.3.0.dev01...v3.3.0.dev02
v3.3.0.dev01
What's Changed
- [DOCS] Add links to videos into main readme by @RobinL in #765
- [FEAT] Add percentage difference to comparison level library by @RobinL in #757
- [DOCS] Add examples section to docs by @RobinL in #772
- [FEAT] Add jaro winkler to duckdb linker now 0.5.0 is a dependency by @RobinL in #766
- [FEAT] Waterfall of false positives and false negatives from labels by @RobinL in #763
- [FEAT] ROC/Precision recall/truth table from label column name by @RobinL in #773
Full Changelog: v3.2.1...v3.3.0.dev01
v3.2.1
v3.2.0
What's Changed
There are two minor breaking changes:
(1). settings
must now always be provided to instantiate the linker
object. The most minimal settings object is {"link_type": your_link_type}
(2). By default, EM training sessions no longer estimate the probability_two_random_records_match
. This can be enables by passing an argument explicitly.
Features
- [FEAT] Add support for pairwise format of clusters by @ThomasHepworth in #707
- [FEAT] Haversine comparison level by @ThomasHepworth in #721
- [FEAT] Databricks tweaks pr by @rjc89 in #715
- [FEAT] Direct estimation probability two random records match by @RobinL in #734
Other
-
[DOCS] Update main readme to include clustering by @RobinL in #696
-
add version tag by @ThomasHepworth in #695
-
add a custom translation for
cast(<val> as double)
-><val>D
by @ThomasHepworth in #697 -
add duckdb helper functions to a separate script by @ThomasHepworth in #700
-
[docs] fix minor typo in docs by @Thomas-Hirsch in #708
-
[MAINT] Log SQL statements before, not after, they are executed in Spark by @RobinL in #714
-
Adjust input col sql logic by @ThomasHepworth in #725
-
[Docs] Update dev guide to sqlglot and transpilation by @RobinL in #729
-
[MAINT] Don't return html by default, it crashes jupyter by @RobinL in #735
-
Update sqlglot v5 by @ThomasHepworth in #736
-
Athena fixes by @ThomasHepworth in #738
-
[DOCS] Add developers guide to building docs locally by @RobinL in #740
-
[MAINT] Improve implementation of InputColumn and remove transpile by @RobinL in #727
-
Document
save_offline_chart
and ensure it works if passed a vega lite chart by @RobinL in #742
New Contributors
- @Thomas-Hirsch made their first contribution in #708
- @rjc89 made their first contribution in #715
Full Changelog: v3.1.0...v3.2.0.dev01
v3.2.0.dev01
What's Changed
Features
- [FEAT] Add support for pairwise format of clusters by @ThomasHepworth in #707
- [FEAT] Haversine comparison level by @ThomasHepworth in #721
- [FEAT] Databricks tweaks pr by @rjc89 in #715
- [FEAT] Direct estimation probability two random records match by @RobinL in #734
Other
-
[DOCS] Update main readme to include clustering by @RobinL in #696
-
add version tag by @ThomasHepworth in #695
-
add a custom translation for
cast(<val> as double)
-><val>D
by @ThomasHepworth in #697 -
add duckdb helper functions to a separate script by @ThomasHepworth in #700
-
[docs] fix minor typo in docs by @Thomas-Hirsch in #708
-
[MAINT] Log SQL statements before, not after, they are executed in Spark by @RobinL in #714
-
Adjust input col sql logic by @ThomasHepworth in #725
-
[Docs] Update dev guide to sqlglot and transpilation by @RobinL in #729
-
[MAINT] Don't return html by default, it crashes jupyter by @RobinL in #735
-
Update sqlglot v5 by @ThomasHepworth in #736
-
Athena fixes by @ThomasHepworth in #738
-
[DOCS] Add developers guide to building docs locally by @RobinL in #740
-
[MAINT] Improve implementation of InputColumn and remove transpile by @RobinL in #727
-
Document
save_offline_chart
and ensure it works if passed a vega lite chart by @RobinL in #742
New Contributors
- @Thomas-Hirsch made their first contribution in #708
- @rjc89 made their first contribution in #715
Full Changelog: v3.1.0...v3.2.0.dev01
v3.1.0
What's Changed
Warning
In version 3.1.0 there's a small API change to the SparkLinker that’s backwards incompatible. i.e. it’s a minor violation of semver
The changes affect the SparkLinker only:
- The default
break_lineage_method
will change toparquet
- The
break_lineage_after_blocking
param is renamed torepartition_after_blocking
for clarity
Features
- Add the ability to use pyarrow + on on disk parquet/csv in duckdb by @ThomasHepworth in #684
- Add completeness (by dataset) chart by @samnlindsay in #669
- Add cumulative blocking rule comparison chart by @ThomasHepworth in #660
- Allow
find_matches_to_new_records
to take table name as input, in addition to rows by @RobinL in #659
Bugfixes
- remove duplicate column selections by @ThomasHepworth in #681
- fix em training tooltip by @ThomasHepworth in #665
Maintenance
- [MAINT] Clarify sql execution function names by @RobinL in #690
- [MAINT] Clarify Spark Linker caching logic by @RobinL in #691
- [MAINT] Bump version to 3.1.0 by @RobinL in #693
- Fix code formatting on
count_num_comparisons_from_blocking_rules_for_prediction
by @RobinL in #661 - Add salting to spark full test by @RobinL in #655
Docs
- Improve customising comparisons topic guide by @RobinL in #667
- [DOCS] Performance topic guide, covering blocking by @RobinL in #675
- [docs] Add issue template for bug report by @RobinL in #676
- [DOCS] Add topic guide for optimising spark jobs by @RobinL in #679
- [DOCS] Fix problem with spark docs copy by @RobinL in #685
- [Docs] Developers' guide to caching and pipelining by @RobinL in #686
- [Docs] Developer guide: Understanding and debugging Splink's computations by @RobinL in #688
- [DOCS] Developers' guide to spark caching and pipelining by @RobinL in #689
Full Changelog: v3.0.1...v3.1.0
v3.0.1
What's Changed
Performance improvements
- Improve the performance of our training steps by @Th368MoJ in #648
- SparkLinker: Improve performance of
estimate_u_using_random_sampling
by @RobinL in #641
Features
Other
-
splink_demos
for Splink3 are now on the master branch by @RobinL in #637 -
Add topic guide to documentation covering different execution backends by @RobinL in #643
-
Version and download numbers badges in readme by @samnlindsay in #650
-
Add acknowledgement for academic advisors by @RobinL in #652
-
Update README badges by @Th368MoJ in #651
Full Changelog: v3.0.0...v3.0.1
Splink 3.0.0
What's changed
splink
version 3.0.0 is a complete re-write. The major new features are:
-
Splink no longer requires Spark. It can now run against multiple backends, including DuckDB, Spark, and AWS Athena.
-
Term frequency adjustments can be applied more flexibly, with more options - see here
-
Using the DuckDB backend, close to real time linkage of new records is possible, enabling Splink to be embedded in search services, see here
-
Many Splink operations are faster across all backends. The most dramatic speedups are for smaller linkages of less than around 1m records, whereby using DuckDB rather than Spark can result in runtimes that are 10x faster or better.
-
A more comprehensive documentation website is available here
-
The cluster studio and comparison viewer dashboards are now bundled with Splink rather than being separate packages, making them simpler to use and preventing issues with version incompatibilities
Upgrading
We recommend re-training models. However, we do provide a Splink 2 to 3 converter that attempts to convert a Splink 2 settings dictionary into the Splink 3 equivalent.
This works on a 'best efforts' basis, so is not guaranteed to work for every model
v3.0.0.dev25
v3.0.0.dev24
What's Changed
- Athena cluster studio fix for release by @Th368MoJ in #626
Full Changelog: v3.0.0.dev23...v3.0.0.dev24