Optimize count returns #3

jotegui · 2016-05-13T11:12:09Z

Currently, the way counts are calculated imply retrieving the full list of records and then returning just the length of the array. This is highly inefficient (e.g. it took more than 2h to get the volume of records mentioning mvz)

The text was updated successfully, but these errors were encountered:

tucotuco · 2016-05-23T12:54:44Z

I do not know of a way to get counts efficiently AND accurately with GAE. However, for the case in question of small record sets, I believe the estimated count is a good enough estimate and could be used to make a determination.

jotegui · 2016-05-24T11:01:05Z

You are right, @tucotuco , I was not familiar with Google's search api and I guess I was expecting a bit too much, like a count method or so... So, it seems the only way of counting records is to actually retrieve them and return the length of the array. sigh

Actually, given this difficulty and the current structure, I have been thinking on omitting this whole issue, and here is why:

There is little (if any) potential use for a method such as count from the users' perspective.
Record counts are actually only useful for direct calls to the download api, since portal downloads come after a search event, where record count is already calculated. And direct downloads via the portal-web have already been implemented.
If we enable a new parameter in the search API (like format), where users can decide whether to get records in JSON or TXT format, they will be able to download via that method. But that makes the distinction between both methods a bit blurry...
We can use an approach such as GBIF's: put a hard limit on the number of records retrievable via direct call to the search API, and suggest to use the download API for larger searches...

Again, just thinking out loud here...

tucotuco · 2016-05-24T12:28:51Z

I agree with all of these observations.

On Tue, May 24, 2016 at 8:01 AM, Javier Otegui [email protected]
wrote:

You are right, @tucotuco https://github.com/tucotuco , I was not
familiar with Google's search api and I guess I was expecting a bit too
much, like a count method or so... So, it seems the only way of counting
records is to actually retrieve them and return the length of the array.
sigh

Actually, given this difficulty and the current structure, I have been
thinking on omitting this whole issue, and here is why:

There is little (if any) potential use for a method such as count
from the users' perspective.

Record counts are actually only useful for direct calls to the
download api, since portal downloads come after a search event, where
record count is already calculated. And direct downloads via the
portal-web have already been implemented.

If we enable a new parameter in the search API (like format), where
users can decide whether to get records in JSON or TXT format, they
will be able to download via that method. But that makes the distinction
between both methods a bit blurry...

We can use an approach such as GBIF's: put a hard limit on the
number of records retrievable via direct call to the search API, and
suggest to use the download API for larger searches...

Again, just thinking out loud here...

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#3 (comment)

jotegui added the critical label May 13, 2016

jotegui mentioned this issue May 13, 2016

Allow direct download of small (<=1K) recordsets #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize count returns #3

Optimize count returns #3

jotegui commented May 13, 2016

tucotuco commented May 23, 2016

jotegui commented May 24, 2016

tucotuco commented May 24, 2016

Optimize count returns #3

Optimize count returns #3

Comments

jotegui commented May 13, 2016

tucotuco commented May 23, 2016

jotegui commented May 24, 2016

tucotuco commented May 24, 2016