Unique counting in grantnav #1008

michaelwood · 2023-09-14T18:05:07Z

Counts using elasticsearch for uniqueness are prone to errors or "fuzzyness" after a certain number of documents (40k):
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#CO313-1

This is one of the trade offs for being able to calculate aggregates that doesn't require very large computing resources.

We count big totals in grantnav in two ways:

The total number of documents returned by a query
The total number of unique values in a field using cardinality aggregates

We count big totals in grantnav in two places:

Summary area table of on the results page
The totals on the home page

Search summary aggregates (appears on table in search pages):

SEARCH_SUMMARY_AGGREGATES = {
    "recipient_orgs": {"cardinality": {"field": "recipientOrganization.id", "precision_threshold": 40000}},
    "funding_orgs": {"cardinality": {"field": "fundingOrganization.id", "precision_threshold": 40000}},
    "recipient_indi": {"cardinality": {"field": "recipientIndividual.id", "precision_threshold": 40000}},
    "currency_stats": {
        "terms": {"field": "currency"},
        "aggs": {
            "amount_stats": {"stats": {"field": "amountAwarded"}},
            "largest_grant": {"top_hits": {"size": 1, "sort": [{"amountAwarded": {"order": "desc"}}]}},
            "smallest_grant": {"top_hits": {"size": 1, "sort": [{"amountAwarded": {"order": "asc"}}]}},
        }
    },
    "min_date": {"min": {"field": "awardDate"}},
    "max_date": {"max": {"field": "awardDate"}},
    "earliest_grant": {"top_hits": {"size": 1, "sort": [{"awardDate": {"order": "asc"}}]}},
    "latest_grant": {"top_hits": {"size": 1, "sort": [{"awardDate": {"order": "desc"}}]}},
}

Search summary aggregates with canonical org ids:

SEARCH_SUMMARY_AGGREGATES = {
    "recipient_orgs": {"cardinality": {"field": "additional_data.GNCanonicalRecipientOrgId", "precision_threshold": 40000}},
    "funding_orgs": {"cardinality": {"field": "additional_data.GNCanonicalFundingOrgId", "precision_threshold": 40000}},
    "recipient_indi": {"cardinality": {"field": "recipientIndividual.id", "precision_threshold": 40000}},

We also count things on the home page:

def totals_query():
    query = {"query": {"match_all": {}}}

    counts = {
        'grants': get_results(query)['hits']['total'],
        'funders': get_results(query, data_type='funder')['hits']['total'],
        'recipient_orgs': get_results(query, data_type='recipient')['hits']['total'],
        'recipient_indi': get_results(
            {
                "size": 0,  # Don't return the docs just the agg
                "aggs": {
                    "recipient_indi": {
                        "cardinality": {
                            "field": "recipientIndividual.id", "precision_threshold": 40000
                        }
                    }
                }
            }
        )["aggregations"]["recipient_indi"]["value"]
    }
    return counts

In the totals count for the home page we're asking "how many documents exist in the recipients dataset", "how many documents exist in the funder dataset", instead of asking "how many unique ids are in the results" . This is because we know we exported a unique list from the datastore so we can just assume that all the documents that exist are unique. The datastore (postgresql) is very good at counting unique things so is a good authority on that.

If we count the unique number of org ids instead of the number of documents we get a different number from the total number of documents, this though.

    def moo():
        test = get_results(
                {
                "_source": "id",
                #  "size": 0,  # Don't return the docs just the agg
                    "aggs": {
                        "total_uniq": {
                            "cardinality": {
                                "field": "id", "precision_threshold": 40000
                            }
                        }
                    }
                },
        data_type="recipient", size="400000")

returns:
{'total_uniq': 360689 , 'total_number_of_docs': 361907}

In reality we are pretty sure these two numbers should match because there should only be one document per unique recipient as determined by the datastore by its org-id.

We're at the maximum precision threshold available in elasticsearch so we need to think of a different way to work out unique values. Composite aggregations are often used but these aren't recommended for large datasets https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

Some initial options here:

We could re-asses if we show the search summary box if no query has been applied.
We could make sure that the counts are known to be approximates
Make the elasticsearch database smaller by partitioning it and writing server side logic to combine results
Do all the counting on the server in python rather than elasticsearch

The text was updated successfully, but these errors were encountered:

Related to: #1008

michaelwood added a commit that referenced this issue Oct 13, 2023

frontend: views: Count using de-duplicated recipients/funders

d89fbfd

Related to: #1008

michaelwood added a commit that referenced this issue Oct 13, 2023

frontend: views: Count using de-duplicated recipients/funders

77e3d02

Related to: #1008

michaelwood added a commit that referenced this issue Oct 16, 2023

frontend: views: Count using de-duplicated recipients/funders

494bf13

Related to: #1008

michaelwood added a commit that referenced this issue Oct 16, 2023

frontend: views: Count using de-duplicated recipients/funders

73faea4

Related to: #1008

michaelwood added a commit that referenced this issue Oct 16, 2023

frontend: views: Count using de-duplicated recipients/funders

92a25d6

Related to: #1008

michaelwood added a commit that referenced this issue Oct 16, 2023

frontend: views: Count using de-duplicated recipients/funders

ad99498

Related to: #1008

michaelwood added a commit that referenced this issue Oct 18, 2023

frontend: views: Count using de-duplicated recipients/funders

8dc753a

Related to: #1008

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unique counting in grantnav #1008

Unique counting in grantnav #1008

michaelwood commented Sep 14, 2023 •

edited

Loading

Unique counting in grantnav #1008

Unique counting in grantnav #1008

Comments

michaelwood commented Sep 14, 2023 • edited Loading

michaelwood commented Sep 14, 2023 •

edited

Loading