Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique counting in grantnav #1008

Open
michaelwood opened this issue Sep 14, 2023 · 0 comments
Open

Unique counting in grantnav #1008

michaelwood opened this issue Sep 14, 2023 · 0 comments

Comments

@michaelwood
Copy link
Member

michaelwood commented Sep 14, 2023

Counts using elasticsearch for uniqueness are prone to errors or "fuzzyness" after a certain number of documents (40k):
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#CO313-1

This is one of the trade offs for being able to calculate aggregates that doesn't require very large computing resources.

We count big totals in grantnav in two ways:

  1. The total number of documents returned by a query
  2. The total number of unique values in a field using cardinality aggregates

We count big totals in grantnav in two places:

  1. Summary area table of on the results page
  2. The totals on the home page

Search summary aggregates (appears on table in search pages):

SEARCH_SUMMARY_AGGREGATES = {
    "recipient_orgs": {"cardinality": {"field": "recipientOrganization.id", "precision_threshold": 40000}},
    "funding_orgs": {"cardinality": {"field": "fundingOrganization.id", "precision_threshold": 40000}},
    "recipient_indi": {"cardinality": {"field": "recipientIndividual.id", "precision_threshold": 40000}},
    "currency_stats": {
        "terms": {"field": "currency"},
        "aggs": {
            "amount_stats": {"stats": {"field": "amountAwarded"}},
            "largest_grant": {"top_hits": {"size": 1, "sort": [{"amountAwarded": {"order": "desc"}}]}},
            "smallest_grant": {"top_hits": {"size": 1, "sort": [{"amountAwarded": {"order": "asc"}}]}},
        }
    },
    "min_date": {"min": {"field": "awardDate"}},
    "max_date": {"max": {"field": "awardDate"}},
    "earliest_grant": {"top_hits": {"size": 1, "sort": [{"awardDate": {"order": "asc"}}]}},
    "latest_grant": {"top_hits": {"size": 1, "sort": [{"awardDate": {"order": "desc"}}]}},
}

Search summary aggregates with canonical org ids:

SEARCH_SUMMARY_AGGREGATES = {
    "recipient_orgs": {"cardinality": {"field": "additional_data.GNCanonicalRecipientOrgId", "precision_threshold": 40000}},
    "funding_orgs": {"cardinality": {"field": "additional_data.GNCanonicalFundingOrgId", "precision_threshold": 40000}},
    "recipient_indi": {"cardinality": {"field": "recipientIndividual.id", "precision_threshold": 40000}},

We also count things on the home page:

def totals_query():
    query = {"query": {"match_all": {}}}

    counts = {
        'grants': get_results(query)['hits']['total'],
        'funders': get_results(query, data_type='funder')['hits']['total'],
        'recipient_orgs': get_results(query, data_type='recipient')['hits']['total'],
        'recipient_indi': get_results(
            {
                "size": 0,  # Don't return the docs just the agg
                "aggs": {
                    "recipient_indi": {
                        "cardinality": {
                            "field": "recipientIndividual.id", "precision_threshold": 40000
                        }
                    }
                }
            }
        )["aggregations"]["recipient_indi"]["value"]
    }
    return counts

In the totals count for the home page we're asking "how many documents exist in the recipients dataset", "how many documents exist in the funder dataset", instead of asking "how many unique ids are in the results" . This is because we know we exported a unique list from the datastore so we can just assume that all the documents that exist are unique. The datastore (postgresql) is very good at counting unique things so is a good authority on that.

If we count the unique number of org ids instead of the number of documents we get a different number from the total number of documents, this though.

    def moo():
        test = get_results(
                {
                "_source": "id",
                #  "size": 0,  # Don't return the docs just the agg
                    "aggs": {
                        "total_uniq": {
                            "cardinality": {
                                "field": "id", "precision_threshold": 40000
                            }
                        }
                    }
                },
        data_type="recipient", size="400000")

returns:
{'total_uniq': 360689 , 'total_number_of_docs': 361907}

In reality we are pretty sure these two numbers should match because there should only be one document per unique recipient as determined by the datastore by its org-id.

We're at the maximum precision threshold available in elasticsearch so we need to think of a different way to work out unique values. Composite aggregations are often used but these aren't recommended for large datasets https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

Some initial options here:

  • We could re-asses if we show the search summary box if no query has been applied.
  • We could make sure that the counts are known to be approximates
  • Make the elasticsearch database smaller by partitioning it and writing server side logic to combine results
  • Do all the counting on the server in python rather than elasticsearch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant