-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unique counting in grantnav #1008
Comments
michaelwood
added a commit
that referenced
this issue
Oct 13, 2023
michaelwood
added a commit
that referenced
this issue
Oct 13, 2023
michaelwood
added a commit
that referenced
this issue
Oct 16, 2023
michaelwood
added a commit
that referenced
this issue
Oct 16, 2023
michaelwood
added a commit
that referenced
this issue
Oct 16, 2023
michaelwood
added a commit
that referenced
this issue
Oct 16, 2023
michaelwood
added a commit
that referenced
this issue
Oct 18, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Counts using elasticsearch for uniqueness are prone to errors or "fuzzyness" after a certain number of documents (40k):
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#CO313-1
This is one of the trade offs for being able to calculate aggregates that doesn't require very large computing resources.
We count big totals in grantnav in two ways:
We count big totals in grantnav in two places:
Search summary aggregates (appears on table in search pages):
Search summary aggregates with canonical org ids:
We also count things on the home page:
In the totals count for the home page we're asking "how many documents exist in the recipients dataset", "how many documents exist in the funder dataset", instead of asking "how many unique ids are in the results" . This is because we know we exported a unique list from the datastore so we can just assume that all the documents that exist are unique. The datastore (postgresql) is very good at counting unique things so is a good authority on that.
If we count the unique number of org ids instead of the number of documents we get a different number from the total number of documents, this though.
returns:
{'total_uniq': 360689 , 'total_number_of_docs': 361907}
In reality we are pretty sure these two numbers should match because there should only be one document per unique recipient as determined by the datastore by its org-id.
We're at the maximum precision threshold available in elasticsearch so we need to think of a different way to work out unique values. Composite aggregations are often used but these aren't recommended for large datasets https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
Some initial options here:
The text was updated successfully, but these errors were encountered: