Skip to content

Latest commit



71 lines (46 loc) · 3.91 KB

File metadata and controls

71 lines (46 loc) · 3.91 KB


All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.


[0.6.1] - 2021-10-19

  • n_blocks Added "guesstimate" as default value for n_blocks. This will guess an optimal number of blocks based on empirical observation.


[0.6.0] - 2021-09-21


  • matrix-blocking/splitting as a performance-enhancer (see for details)
  • new keyword arguments force_symmetries and n_blocks (see for details)
  • new dependency on packages topn and sparse_dot_topn_for_blocks to help with the matrix-blocking
  • capability to reuse a previously initialized StringGrouper (that is, the corpus can now persist across high-level function calls like match_strings(). See for details.)


  • Improved the performance of the function match_most_similar.
  • The Series duplicates is now the left operand, while master is the right operand in the underlying left-join operation that does the string-matching.
  • Changed the default value of the keyword argument max_n_matches to the total number of strings in master. (max_n_matches is now defined as the maximum number of matches allowed per string in duplicates [or master if duplicates is not given]).

[0.5.0] - 2021-06-11


  • Added new keyword argument tfidf_matrix_dtype (the datatype for the tf-idf values of the matrix components). Allowed values are numpy.float32 and numpy.float64 (used by the required external package sparse_dot_topn version 0.3.1). Default is numpy.float32. (Note: numpy.float32 often leads to faster processing and a smaller memory footprint albeit less numerical precision than numpy.float64.)


  • Changed dependency on sparse_dot_topn from version 0.2.9 to 0.3.1
  • Changed the default datatype for cosine similarities from numpy.float64 to numpy.float32 to boost computational performance at the expense of numerical precision.
  • Changed the default value of the keyword argument max_n_matches from 20 to the number of strings in duplicates (or master, if duplicates is not given).
  • Changed warning issued when the condition [include_zeroes=True and min_similarity ≤ 0 and max_n_matches is not sufficiently high to capture all nonzero-similarity-matches] is met to an exception.


  • Removed the keyword argument suppress_warning

[0.4.0] - 2021-04-11


  • Added group representative functionality - by default the centroid is used. From @ParticularMiner

  • Added string_grouper_utils package with additional group-representative functionality:

    • new_group_rep_by_earliest_timestamp
    • new_group_rep_by_completeness
    • new_group_rep_by_highest_weight

    From @ParticularMiner

  • Original indices are now added by default to output of group_similar_strings, match_most_similar and match_strings. From @ParticularMiner

  • compute_pairwise_similarities function From @ParticularMiner


  • Default group representative is now the centroid. Used to be the first string in the series belonging to a group. From @ParticularMiner
  • Output of match_most_similar and match_strings is now a pandas.DataFrame object instead of a pandas.Series by default. From @ParticularMiner
  • Fixed a bug which occurs when min_similarity=0. From @ParticularMiner