Reduce the cost of making a dataset instance #322

xbito · 2018-11-14T12:35:52Z

We were exploring using scrunch to produce some count information in an internal website. But we realized that loading the count was pretty slow, taking 5-10 seconds to display the results.

Then we noticed the amount of requests going to Crunch, and found that there are a number of calls that are made at the moment you make an instance that are slowing the process significantly for datasets that are relatively large (tens of thousands of variables):

INFO:__main__:Running: ds = get_mutable_dataset('185264f6f5924235afbcfba1d717f0f7')
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): app.crunch.io:443
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/ HTTP/1.1" 401 168
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "POST /api/public/login/ HTTP/1.1" 204 0
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/ HTTP/1.1" 200 401
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/feature_flag/?feature_name=old_projects_order HTTP/1.1" 200 160
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/datasets/185264f6f5924235afbcfba1d717f0f7/ HTTP/1.1" 200 919
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/datasets/185264f6f5924235afbcfba1d717f0f7/variables/ HTTP/1.1" 200 588962
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/datasets/185264f6f5924235afbcfba1d717f0f7/variables/hier/ HTTP/1.1" 200 24574
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/datasets/185264f6f5924235afbcfba1d717f0f7/settings/ HTTP/1.1" 200 222
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/datasets/185264f6f5924235afbcfba1d717f0f7/folders/ HTTP/1.1" 200 617
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/datasets/185264f6f5924235afbcfba1d717f0f7/folders/ HTTP/1.1" 200 617
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/datasets/185264f6f5924235afbcfba1d717f0f7/folders/hidden/ HTTP/1.1" 200 168
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/datasets/185264f6f5924235afbcfba1d717f0f7/folders/ HTTP/1.1" 200 617
DEBUG:urllib3.connectionpool:https://app.crunch.io:443 "GET /api/datasets/185264f6f5924235afbcfba1d717f0f7/folders/trash/ HTTP/1.1" 200 166

I believe those are related to loading self.folders, self._vars and self.order at init time. Can we make those lazy loaded?

The text was updated successfully, but these errors were encountered:

jjdelc · 2018-11-16T21:00:44Z

What counts are you trying to obtain here?

Why not just use straight pycrunch and avoid all the Scrunch magic that is not necessary here?

jjdelc · 2018-11-16T21:02:09Z

Still, those requests look extremely redundant, it's definitely the usage of chained methods self.folders.hidden' and then self.folders.trash' and such that make the same GET to /folders/ to get the .folders part.

xbito · 2018-11-17T11:58:03Z

What counts are you trying to obtain here?

Why not just use straight pycrunch and avoid all the Scrunch magic that is not necessary here?

We actually took that approach. Though, I feel like we should have the option to make scrunch a bit leaner.

xbito assigned jjdelc Nov 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the cost of making a dataset instance #322

Reduce the cost of making a dataset instance #322

xbito commented Nov 14, 2018

jjdelc commented Nov 16, 2018

jjdelc commented Nov 16, 2018

xbito commented Nov 17, 2018

Reduce the cost of making a dataset instance #322

Reduce the cost of making a dataset instance #322

Comments

xbito commented Nov 14, 2018

jjdelc commented Nov 16, 2018

jjdelc commented Nov 16, 2018

xbito commented Nov 17, 2018