Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scraper celery task to update Durham images #113

Open
copelco opened this issue May 8, 2014 · 4 comments
Open

Add scraper celery task to update Durham images #113

copelco opened this issue May 8, 2014 · 4 comments
Milestone

Comments

@copelco
Copy link
Member

copelco commented May 8, 2014

No description provided.

@copelco copelco added this to the Phase 3 milestone May 8, 2014
@tamcap
Copy link
Contributor

tamcap commented May 22, 2014

Not sure how to approach this one... I used a sitescraper to reverse the address into property_id. I don't mind sharing that code so we could incorporate that.

However:

  • Running it for the already processed establishments is pointless, as their addresses usually don't change. Thus the property_id stays the same, and the image won't change. We use the most recent image that Durham County has for a given property_id (we actually hotlink to their site...)
  • We could run it on import for new establishments / the ones that moved in an async fashion (even through celery).

@vrocha
Copy link
Contributor

vrocha commented May 22, 2014

I think one of the problems we have is that some of the images are no
longer found. If we were to scrape the images in a regular basis, every
week or every other week, we could update which properties do have an image
and which one don't and maybe even get images for establishment that did
not have one previously.

On Wed, May 21, 2014 at 8:15 PM, Marek Laska [email protected]:

Not sure how to approach this one... I used a sitescraper to reverse the
address into property_id. I don't mind sharing that code so we could
incorporate that.

However:

Running it for the already processed establishments is pointless, as
their addresses usually don't change. Thus the property_id stays the same,
and the image won't change. We use the most recent image that Durham County
has for a given property_id (we actually hotlink to their site...)

We could run it on import for new establishments / the ones that moved
in an async fashion (even through celery).


Reply to this email directly or view it on GitHubhttps://github.com//issues/113#issuecomment-43833356
.

@tamcap
Copy link
Contributor

tamcap commented May 23, 2014

OK, to summarize this and #112:

  • we need to separate the picture url logic from the view into the model
  • at update / manually:
  • a) for new establishments (address updates) a site scraper tries to determine property_id
  • b) if establishment has property_id, check if Durham County is serving an image or 404 and populate the image_url
  • c) if no property_id - image_url stays blank

Does this sound reasonable?

@copelco
Copy link
Member Author

copelco commented May 23, 2014

Sounds right. I'd suggest the scraping/photo_url check be it's own standalone task in eatsmart.locations.durham that's run on a regular interval (like once night). It can just skip over establishments with valid photo_urls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants