Add scraper celery task to update Durham images #113

copelco · 2014-05-08T01:15:47Z

No description provided.

tamcap · 2014-05-22T00:15:36Z

Not sure how to approach this one... I used a sitescraper to reverse the address into property_id. I don't mind sharing that code so we could incorporate that.

However:

Running it for the already processed establishments is pointless, as their addresses usually don't change. Thus the property_id stays the same, and the image won't change. We use the most recent image that Durham County has for a given property_id (we actually hotlink to their site...)
We could run it on import for new establishments / the ones that moved in an async fashion (even through celery).

vrocha · 2014-05-22T14:30:07Z

I think one of the problems we have is that some of the images are no
longer found. If we were to scrape the images in a regular basis, every
week or every other week, we could update which properties do have an image
and which one don't and maybe even get images for establishment that did
not have one previously.

On Wed, May 21, 2014 at 8:15 PM, Marek Laska [email protected]:

Not sure how to approach this one... I used a sitescraper to reverse the
address into property_id. I don't mind sharing that code so we could
incorporate that.

However:

Running it for the already processed establishments is pointless, as
their addresses usually don't change. Thus the property_id stays the same,
and the image won't change. We use the most recent image that Durham County
has for a given property_id (we actually hotlink to their site...)

We could run it on import for new establishments / the ones that moved
in an async fashion (even through celery).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/113#issuecomment-43833356
.

tamcap · 2014-05-23T02:00:31Z

OK, to summarize this and #112:

we need to separate the picture url logic from the view into the model
at update / manually:
a) for new establishments (address updates) a site scraper tries to determine property_id
b) if establishment has property_id, check if Durham County is serving an image or 404 and populate the image_url
c) if no property_id - image_url stays blank

Does this sound reasonable?

copelco · 2014-05-23T02:17:27Z

Sounds right. I'd suggest the scraping/photo_url check be it's own standalone task in eatsmart.locations.durham that's run on a regular interval (like once night). It can just skip over establishments with valid photo_urls.

copelco added this to the Phase 3 milestone May 8, 2014

copelco added the enhancement label May 8, 2014

vrocha mentioned this issue Jul 1, 2014

[refs #112] Adds a new field to the Inspections model. #138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scraper celery task to update Durham images #113

Add scraper celery task to update Durham images #113

copelco commented May 8, 2014

tamcap commented May 22, 2014

vrocha commented May 22, 2014

Running it for the already processed establishments is pointless, as
their addresses usually don't change. Thus the property_id stays the same,
and the image won't change. We use the most recent image that Durham County
has for a given property_id (we actually hotlink to their site...)

tamcap commented May 23, 2014

copelco commented May 23, 2014

Add scraper celery task to update Durham images #113

Add scraper celery task to update Durham images #113

Comments

copelco commented May 8, 2014

tamcap commented May 22, 2014

vrocha commented May 22, 2014

Running it for the already processed establishments is pointless, as their addresses usually don't change. Thus the property_id stays the same, and the image won't change. We use the most recent image that Durham County has for a given property_id (we actually hotlink to their site...)

tamcap commented May 23, 2014

copelco commented May 23, 2014

Running it for the already processed establishments is pointless, as
their addresses usually don't change. Thus the property_id stays the same,
and the image won't change. We use the most recent image that Durham County
has for a given property_id (we actually hotlink to their site...)