Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in lookup_contactbyname algorithm #403

Open
MarcMichalsky opened this issue Sep 14, 2023 · 1 comment
Open

Bug in lookup_contactbyname algorithm #403

MarcMichalsky opened this issue Sep 14, 2023 · 1 comment

Comments

@MarcMichalsky
Copy link
Contributor

The algorithm that suggests contacts has a weakness.

When we search for a contact such as "Max Mustermann", we get a wrong probability if there are other contacts with partially matching but shorter names, such as "Max" or "Mustermann".

The problem is that the number of matches of all single name mutations with the same contact has no influence on the probability so far.

Also, I think that squaring the $new_probability should be done before comparing it to the old $probability. Otherwise, there could be a small but unintended shift in each iteration.

I did my best to find a better algorithm and if you like it, I can do a PR.

Here is my code for testing:

<?php

/**
 * Fake api function for testing.
 */
function fake_civicrm_api3($entity, $action, $params)
{
    switch ($params['name']) {
        case "Max":
            return [
                "values" => [
                    ["id" => 1, "sort_name" => "Mustermann, Max"],
                    ["id" => 2, "sort_name" => "Mustardman, Max"],
                    ["id" => 3, "sort_name" => "Max"],
                ]
            ];
        case "Mustermann":
            return [
                "values" => [
                    ["id" => 1, "sort_name" => "Mustermann, Max"],
                    ["id" => 4, "sort_name" => "Mustermann, Marc"],
                    ["id" => 5, "sort_name" => "Mustermann"],
                ]
            ];
        case "Marie":
            return [
                "values" => [
                    ["id" => 6, "sort_name" => "Zimmermann, Marie"],
                    ["id" => 7, "sort_name" => "Meier, Marie"],
                ]
            ];
        default:
            return ["values" => []];
    }
}

/**
 * Original algorithm
 */
function _original_civicrm_api3_banking_lookup_contactbyname_api($name_mutations, $params) {
    $contacts_found = array();
    // query quicksearch for each combination
    foreach ($name_mutations as $name_mutation) {
        $result = fake_civicrm_api3('Contact', 'getquick', array('name' => $name_mutation));
        foreach ($result['values'] as $contact) {
            // get the current maximum similarity...
            if (isset($contacts_found[$contact['id']])) {
                $probability = $contacts_found[$contact['id']];
            } else {
                $probability = 0.0;
            }

            // now, we'll have to find the maximum similarity with any of the name mutations
            $compare_name = strtolower($contact['sort_name']);
            foreach ($name_mutations as $name_mutation) {
                $new_probability = 0.0;
                similar_text(strtolower($name_mutation), $compare_name, $new_probability);
                //error_log("Compare '$name_mutation' to '".$contact['sort_name']."' => $new_probability");
                $new_probability /= 100.0;
                if ($new_probability > $probability) {
                    // square value for better distribution, multiply by 0.999 to avoid 100% match based on name
                    $probability = $new_probability * $new_probability * 0.999;
                }
            }

            $contacts_found[$contact['id']] = $probability;
        }
    }

    return $contacts_found;
}

/**
 * New and improved algorithm
 */
function _new_civicrm_api3_banking_lookup_contactbyname_api($name_mutations, $params) {
    $contacts_found = [];
    $occurrences = [];
    // query quicksearch for each combination
    foreach ($name_mutations as $name_mutation) {
        $result = fake_civicrm_api3('Contact', 'getquick', array('name' => $name_mutation));
        foreach ($result['values'] as $contact) {
            $occurrences[$contact['id']] = isset($occurrences[$contact['id']]) ? $occurrences[$contact['id']] + 1 : 1;
            // get the current maximum similarity...
            if (isset($contacts_found[$contact['id']])) {
                $probability = $contacts_found[$contact['id']];
            } else {
                $probability = 0.0;
            }

            // now, we'll have to find the maximum similarity with any of the name mutations
            $compare_name = strtolower($contact['sort_name']);
            foreach ($name_mutations as $name_mutation) {
                $new_probability = 0.0;
                similar_text(strtolower($name_mutation), $compare_name, $new_probability);
                //error_log("Compare '$name_mutation' to '".$contact['sort_name']."' => $new_probability");
                $new_probability /= 100.0;
                // square value for better distribution, multiply by 0.999 to avoid 100% match based on name
                $new_probability = $new_probability * $new_probability * 0.999; // Do this before comparison with the old $probability
                if ($new_probability > $probability) $probability = $new_probability;
            }
            $contacts_found[$contact['id']] = $probability;
        }
    }

    $max_occurrence = max($occurrences);
    foreach ($contacts_found as $contact_id_found => $contact_probability) {
        $contacts_found[$contact_id_found] = $contact_probability * $occurrences[$contact_id_found] / $max_occurrence;
    }
    return $contacts_found;
};

$name_mutations = ["Max", "und", "Marie", "Mustermann"];

$results_original = _original_civicrm_api3_banking_lookup_contactbyname_api($name_mutations, []);
$results_new = _new_civicrm_api3_banking_lookup_contactbyname_api($name_mutations, []);

foreach ($results_original as $contact_id => $probability_original) {
    $probability_new = $results_new[$contact_id];
    print("[$contact_id => ['probability_original' => $probability_original], ['probability_new' => $probability_new]],\n");
}

Result:

[1 => ['probability_original' => 0.63936], ['probability_new' => 0.63936]],
[2 => ['probability_original' => 0.4091904], ['probability_new' => 0.2045952]],
[3 => ['probability_original' => 0.999], ['probability_new' => 0.4995]],
[6 => ['probability_original' => 0.26859259259259], ['probability_new' => 0.1342962962963]],
[7 => ['probability_original' => 0.20640495867769], ['probability_new' => 0.17283737024221]],
[4 => ['probability_original' => 0.59112426035503], ['probability_new' => 0.29556213017751]],
[5 => ['probability_original' => 0.999], ['probability_new' => 0.4995]],
@MarcMichalsky
Copy link
Contributor Author

Maybe that was a bit too cryptic. I'll try again here.

We are trying to find matching contacts for the account holder "Max and Marie Mustermann".

Here is the comparison of the results of the original algorithm compared to the results of the improvement.

Contact Original Probability New Probability Ranking old Ranking new
"Mustermann, Max" 0.63936 0.63936 2 1
"Mustardman, Max" 0.4091904 0.2045952 4 4
"Max" 0.999 0.4995 1 2
"Zimmermann, Marie" 0.26859259259259 0.1342962962963 6 6
"Meier, Marie" 0.20640495867769 0.17283737024221 5 5
"Mustermann, Marc" 0.59112426035503 0.29556213017751 3 3
"Mustermann" 0.999 0.4995 1 2

We see that original algorithm always lets the shorter match win. That's a problem.

I have adapted the algorithm so that it takes into account the number of hits for the same contact for all name components (name_mutations) in the weighting.

However, the search via getquick is somewhat outdated, but the alternative search via SQL unfortunately has the same problem. The only problem is that the same solution cannot be applied there, as the result of the SQL query does not return the same contacts more than once. This means that it is not possible to weight them, at least not without making further queries. In addition, the SQL search only searches for name mutations that are at the beginning of the sort_name:

$sql_clauses[] = "(`sort_name` LIKE '{$name_mutation}%')";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant