Skip to content

ambertch/php-product-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This is a basic page scraper I wrote for fun

Page retrieval and parsing: 
- site specific custom Handlers are kept in a subfolder and autoloaded as necessary. Each Handler is a subclass of ProductScraper. The Handler calls its parent's universal scraper functions using its own query, and processes the result.
- custom Handlers can be tested out by calling ProductScraper::getInfo('http://store.americanapparel.net/rsa642sg.html'); in index.php, as I have written a sample Handler for American Apparel.

- php 5's DOM extension allows us to parse the page. The string returned by curl is turned into a DOMDocument. Via the defaultScraper or site specific Handlers, that and a DOMXPath object along with an xpath query is passed to the universal scraper functions.

- I'm choosing to grab data with XPath instead of a simple regex/strpos(). I think there is an advantage in robustness for a production implementation, by traversing the DOM.

- having the universal scrapers return price and description as a DOMNodeList confers numerous advantages. For example it allows a custom Handler to process multiple pricings in a modular way if the item is on sale, or if multiple pricing options are available. Similarly, a product's description may often be contained in multiple nodes - returning a single string is less conducive to semantic processing.

- The default handler actually does not use the universal price scraper. This design decision was made because the default scraper should be as dumb and general as possible, so it will just preg_match() for a float matching /\$\d+\.\d\d/. The custom handlers will use the universal scraper and do this in a more robust fashion, retrieving the known node containing the price.    



What this code is missing: Error handling...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages