Skip to content

Releases: moj-analytical-services/splink

v3.3.0.dev02

19 Sep 14:16
a0c650f
Compare
Choose a tag to compare
v3.3.0.dev02 Pre-release
Pre-release

What's Changed

Full Changelog: v3.3.0.dev01...v3.3.0.dev02

v3.3.0.dev01

15 Sep 19:07
b613a08
Compare
Choose a tag to compare
v3.3.0.dev01 Pre-release
Pre-release

What's Changed

  • [DOCS] Add links to videos into main readme by @RobinL in #765
  • [FEAT] Add percentage difference to comparison level library by @RobinL in #757
  • [DOCS] Add examples section to docs by @RobinL in #772
  • [FEAT] Add jaro winkler to duckdb linker now 0.5.0 is a dependency by @RobinL in #766
  • [FEAT] Waterfall of false positives and false negatives from labels by @RobinL in #763
  • [FEAT] ROC/Precision recall/truth table from label column name by @RobinL in #773

Full Changelog: v3.2.1...v3.3.0.dev01

v3.2.1

06 Sep 14:01
Compare
Choose a tag to compare

What's Changed

  • [MAINT] Make Splink compatible with duckdb v0.5.0 release by @RobinL in #750
  • [Fix] Make json encoder more robust by @RobinL in #751
  • [FIX] fix table exists in db by @RobinL in #753
  • [MAINT] Bump version to 3.2.1 and duckdb to 0.5.0 by @RobinL in #758

Full Changelog: v3.2.0...v3.2.1

v3.2.0

30 Aug 16:43
Compare
Choose a tag to compare

What's Changed

There are two minor breaking changes:

(1). settings must now always be provided to instantiate the linker object. The most minimal settings object is {"link_type": your_link_type}
(2). By default, EM training sessions no longer estimate the probability_two_random_records_match. This can be enables by passing an argument explicitly.

Features

Other

New Contributors

Full Changelog: v3.1.0...v3.2.0.dev01

v3.2.0.dev01

30 Aug 16:08
Compare
Choose a tag to compare
v3.2.0.dev01 Pre-release
Pre-release

What's Changed

Features

Other

New Contributors

Full Changelog: v3.1.0...v3.2.0.dev01

v3.1.0

03 Aug 15:44
Compare
Choose a tag to compare

What's Changed

Warning
In version 3.1.0 there's a small API change to the SparkLinker that’s backwards incompatible. i.e. it’s a minor violation of semver

The changes affect the SparkLinker only:

  • The default break_lineage_method will change to parquet
  • The break_lineage_after_blocking param is renamed to repartition_after_blocking for clarity

Features

  • Add the ability to use pyarrow + on on disk parquet/csv in duckdb by @ThomasHepworth in #684
  • Add completeness (by dataset) chart by @samnlindsay in #669
  • Add cumulative blocking rule comparison chart by @ThomasHepworth in #660
  • Allow find_matches_to_new_records to take table name as input, in addition to rows by @RobinL in #659

Bugfixes

Maintenance

  • [MAINT] Clarify sql execution function names by @RobinL in #690
  • [MAINT] Clarify Spark Linker caching logic by @RobinL in #691
  • [MAINT] Bump version to 3.1.0 by @RobinL in #693
  • Fix code formatting on count_num_comparisons_from_blocking_rules_for_prediction by @RobinL in #661
  • Add salting to spark full test by @RobinL in #655

Docs

  • Improve customising comparisons topic guide by @RobinL in #667
  • [DOCS] Performance topic guide, covering blocking by @RobinL in #675
  • [docs] Add issue template for bug report by @RobinL in #676
  • [DOCS] Add topic guide for optimising spark jobs by @RobinL in #679
  • [DOCS] Fix problem with spark docs copy by @RobinL in #685
  • [Docs] Developers' guide to caching and pipelining by @RobinL in #686
  • [Docs] Developer guide: Understanding and debugging Splink's computations by @RobinL in #688
  • [DOCS] Developers' guide to spark caching and pipelining by @RobinL in #689

Full Changelog: v3.0.1...v3.1.0

v3.0.1

18 Jul 17:12
Compare
Choose a tag to compare

What's Changed

Performance improvements

  • Improve the performance of our training steps by @Th368MoJ in #648
  • SparkLinker: Improve performance ofestimate_u_using_random_sampling by @RobinL in #641

Features

  • Add Spark jar and UDF comparison functions to the Spark comparison library by @RobinL in #649

Other

  • splink_demos for Splink3 are now on the master branch by @RobinL in #637

  • Topic guide for salting by @RobinL in #638

  • Add topic guide to documentation covering different execution backends by @RobinL in #643

  • Fix __repr__ of EMTrainingSession by @RobinL in #645

  • Version and download numbers badges in readme by @samnlindsay in #650

  • Add acknowledgement for academic advisors by @RobinL in #652

  • Update README badges by @Th368MoJ in #651

  • Bump sqlglot version and 3.0.1 by @RobinL in #653

Full Changelog: v3.0.0...v3.0.1

Splink 3.0.0

12 Jul 19:13
Compare
Choose a tag to compare

What's changed

splink version 3.0.0 is a complete re-write. The major new features are:

  • Splink no longer requires Spark. It can now run against multiple backends, including DuckDB, Spark, and AWS Athena.

  • Term frequency adjustments can be applied more flexibly, with more options - see here

  • Using the DuckDB backend, close to real time linkage of new records is possible, enabling Splink to be embedded in search services, see here

  • Many Splink operations are faster across all backends. The most dramatic speedups are for smaller linkages of less than around 1m records, whereby using DuckDB rather than Spark can result in runtimes that are 10x faster or better.

  • A more comprehensive documentation website is available here

  • The cluster studio and comparison viewer dashboards are now bundled with Splink rather than being separate packages, making them simpler to use and preventing issues with version incompatibilities

Upgrading

We recommend re-training models. However, we do provide a Splink 2 to 3 converter that attempts to convert a Splink 2 settings dictionary into the Splink 3 equivalent.

This works on a 'best efforts' basis, so is not guaranteed to work for every model

v3.0.0.dev25

12 Jul 16:22
a1e707c
Compare
Choose a tag to compare
v3.0.0.dev25 Pre-release
Pre-release

What's Changed

  • adjust filepath parquet files are output along by @Th368MoJ in #628
  • fix filepath verification in athena by @Th368MoJ in #630
  • fix realtime linking by @RobinL in #629
  • v25 by @RobinL in #631

Full Changelog: v3.0.0.dev24...v3.0.0.dev25

v3.0.0.dev24

12 Jul 12:13
Compare
Choose a tag to compare
v3.0.0.dev24 Pre-release
Pre-release

What's Changed

  • Athena cluster studio fix for release by @Th368MoJ in #626

Full Changelog: v3.0.0.dev23...v3.0.0.dev24