Skip to content

Commit

Permalink
Add warning for exact match approximation in high record volumes
Browse files Browse the repository at this point in the history
  • Loading branch information
aditya-balachander committed Dec 18, 2024
1 parent 09980f8 commit db4f5b6
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 0 deletions.
6 changes: 6 additions & 0 deletions cumulusci/tasks/bulkdata/select_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,6 +343,12 @@ def annoy_post_process(
threshold: T.Union[float, None],
) -> T.Tuple[T.List[dict], list]:
"""Processes the query results for the similarity selection strategy using Annoy algorithm for large number of records"""
# Add warning when threshold is 0
if threshold is not None and threshold == 0:
logger.warning(
"Warning: A threshold of 0 may miss exact matches in high volumes. Use a small value like 0.1 for better accuracy."
)

selected_records = []
insertion_candidates = []

Expand Down
3 changes: 3 additions & 0 deletions docs/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,9 @@ This parameter is **optional**; if not specified, no threshold will be applied a

This feature is particularly useful during version upgrades, where records that closely match can be selected, while those that do not match sufficiently can be inserted into the target org.

**Important Note:**
For high volumes of records, an approximation algorithm is applied to improve performance. In such cases, setting a threshold of `0` may not guarantee the selection of exact matches, as the algorithm can assign a small non-zero similarity score to exact matches. To ensure accurate selection, it is recommended to set the threshold to a small value slightly greater than `0`, such as `0.1`. This ensures both precision and efficiency in the selection process.

---

#### Example
Expand Down

0 comments on commit db4f5b6

Please sign in to comment.