Skip to content

vtexdocs/docsearch-scraper

Repository files navigation

DocSearch scraper

This repository is a fork of the DocSearch Scraper and its goal is to update it with any necessary changes to make the crawler run and get all elements inside of shadow DOM.

Installation and Usage

Dependencies

  • python 3.6
  • pip
  • Specific version of pipenv
    pip3 install pipenv==2018.11.26

→ Installation to STATIC PAGES

(1) Clone this repository

(2) Create your config.json file with everything you need extracted from your pages by the crawler.

(3) Create your .env file with:

  • APPLICATION_ID: The ID of the application you want to store the crawler extractions in.
  • API_KEY: API key for your targeted application. Make sure to use an API key with write access to your index. It needs the ACL addObject, editSettings and deleteIndex.

(4) Install and create the isolated environment

pipenv install

(5) Copy the contents of the utils/webclient.py file in this repository and paste it in the webclient.py file inside the directory of scrapy where the project dependencies were installed, located at virtualenvs/created-environment/lib/python3.6/site-packages/scrapy/core/downloader/webclient.py

(6) Turn on your project with

pipenv shell

(7) So, in the shell prompt, put:

./docsearch run ./path/to/your/config.json

→ Installation to CLIENT-SIDE PAGES

In the case of Client-Side Rendering, we need one more step in this configuration. First, "js_render": true is required in the config file. So that the crawler knows that the page will not be static. Also, the "js_wait" attribute will define how long the crawler should wait for the page to load (in seconds).

Below, only the different steps compared to the previous case:

(3) Some websites rendering require JavaScript. Our crawler relies on a headless chrome emulation. You will need to set up a ChromeDriver. Install the driver suited to your OS and the version of your Chrome. We do recommend to use the latest version.

(3.5) Create your .env file with:

  • APPLICATION_ID: The ID of the application you want to store the crawler extractions in.
  • API_KEY: API key for your targeted application. Make sure to use an API key with write access to your index. It needs the ACL addObject, editSettings and deleteIndex.
  • CHROMEDRIVER_PATH: Put the path that targets the downloaded extracted driver.

IMPORTANT for Action configuration (scrape only few documents without reset index)

Considering the default usability for running this scraper in VTEX Documents repositories as an github action, the following instructions are needed. (1) In your config.json, the start_urls atribute is required and your first position will always be the page root URL. Ex: https://domain.com/ (1) Also, set the attribute is_file_update: true in your config.json

Useful links