-
Notifications
You must be signed in to change notification settings - Fork 1
ambertch/php-product-scraper
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a basic page scraper I wrote for fun Page retrieval and parsing: - site specific custom Handlers are kept in a subfolder and autoloaded as necessary. Each Handler is a subclass of ProductScraper. The Handler calls its parent's universal scraper functions using its own query, and processes the result. - custom Handlers can be tested out by calling ProductScraper::getInfo('http://store.americanapparel.net/rsa642sg.html'); in index.php, as I have written a sample Handler for American Apparel. - php 5's DOM extension allows us to parse the page. The string returned by curl is turned into a DOMDocument. Via the defaultScraper or site specific Handlers, that and a DOMXPath object along with an xpath query is passed to the universal scraper functions. - I'm choosing to grab data with XPath instead of a simple regex/strpos(). I think there is an advantage in robustness for a production implementation, by traversing the DOM. - having the universal scrapers return price and description as a DOMNodeList confers numerous advantages. For example it allows a custom Handler to process multiple pricings in a modular way if the item is on sale, or if multiple pricing options are available. Similarly, a product's description may often be contained in multiple nodes - returning a single string is less conducive to semantic processing. - The default handler actually does not use the universal price scraper. This design decision was made because the default scraper should be as dumb and general as possible, so it will just preg_match() for a float matching /\$\d+\.\d\d/. The custom handlers will use the universal scraper and do this in a more robust fashion, retrieving the known node containing the price. What this code is missing: Error handling...
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published