24.12.2018

- fix interface to scrape() [DONE]
- add to Github

24.1.2018

- fix issue #3: add functionality to add keyword file

27.1.2019

- Add functionality to block images and CSS from loading as described here:
    https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
    https://www.scrapehero.com/how-to-build-a-web-scraper-using-puppeteer-and-node-js/

29.1.2019

- implement proxy support functionality
    - implement proxy check

- implement scraping more than 1 page
    - do it for google
    - and bing
- implement duckduckgo scraping

30.1.2019

- modify all scrapers to use the generic class where it makes sense
    - Bing, Baidu, Google, Duckduckgo

7.2.2019

- add num_requests to test cases [done]

25.2.2019

- https://antoinevastel.com/crawler/2018/09/20/parallel-crawler-puppeteer.html
- add support for browsing with multiple browsers, use this neat library:
- https://github.com/thomasdondorf/puppeteer-cluster [done]

28.2.2019

- write test case for multiple browsers/proxies
- write test case and example for multiple tabs with bing
- make README.md nicer. https://github.com/thomasdondorf/puppeteer-cluster/blob/master/README.md as template

11.6.2019

- TODO: fix amazon scraping
- change api of remaining test cases [done]
- TODO: implement custom search engine parameters on scrape()

12.6.2019

- remove unnecessary sleep() calls and replace with waitFor selectors

16.7.2019

resolve issues
- fix this #37 [done]
use puppeteer stealth plugin: https://www.npmjs.com/package/puppeteer-extra-plugin-stealth
- we will need to load at the concurrency impl of puppeteer-cluster [no typescript support :(), I will not support this right now]
user random user agents plugin: https://github.com/intoli/user-agents [done]
add screenshot capability (make the screen after parsing)
- store as b64 [done]

12.8.2019

add static test case for bing [done]
add options that minimize html_output flag: clean_html_output will remove all JS and CSS from the html clean_data_images removes all data images from the html [done]

13.8.2019

Write test case for clean html output [done]
Consider better compression algorithm. [done] There is the brotli algorithm, but this is only supported in very recent versions of nodejs
what else can we remove from the dom [done] Removing comment nodes now! They are large in BING.
remove all whitespace and \n and \t from html

TODO:

fix googlenewsscraper waiting for results and parsing. remove the static sleep [done]
when using multiple browsers and random user agent, pass a random user agent to each perBrowserOptions
dont create a new tab when opening a new scraper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO.md

TODO.md

24.12.2018

24.1.2018

27.1.2019

29.1.2019

30.1.2019

7.2.2019

25.2.2019

28.2.2019

11.6.2019

12.6.2019

16.7.2019

12.8.2019

13.8.2019

TODO:

Files

TODO.md

Latest commit

History

TODO.md

File metadata and controls

24.12.2018

24.1.2018

27.1.2019

29.1.2019

30.1.2019

7.2.2019

25.2.2019

28.2.2019

11.6.2019

12.6.2019

16.7.2019

12.8.2019

13.8.2019

TODO: