Hello you,
pipenv is used to manage the whole Python environment.
- Install pipenv (this must be done only once)
- Run
pipenv install --dev
to create a Python virtual environment if unexisting and to install the Scraper's dependencies (this will have to be done each time dependencies are modified) - Run
pipenv shell
to enter the Python virtual environment (this will have to be done each time you get back to work on the Scraper)
WARNING! Please be aware that the scraper sends auth headers to every scraped site, so use allowed_domains
to adjust the scope accordingly!
If it happens to you to scrape sites protected by Cloudflare Access, you
have to set appropriate HTTP headers. Values for these headers are taken
from env variables CF_ACCESS_CLIENT_ID
and CF_ACCESS_CLIENT_SECRET
.
In case of Google Cloud Identity-Aware Proxy, please specify these env variables:
IAP_AUTH_CLIENT_ID
- # pick client ID of the application you are connecting toIAP_AUTH_SERVICE_ACCOUNT_JSON
- # generate in Actions -> Create key -> JSON
Websites that need JavaScript for rendering are passed through ChromeDriver.
Download the version suited to your OS and then update the
CHROMEDRIVER_PATH
in your .env
file.
You should be ready to go.
See the dedicated page on Algolia's documentation web site.
The code is checked against linting rules by the CI, with pylint
(which is installed by pipenv
as a dev package).
To run the linter, run the following command at the root of your clone:
pipenv run pylint scraper cli deployer
To run the full test suite, run ./docsearch test
.
If you are Algolia employee and want to manage a DocSearch account, you'll need
to copy and paste the required variables in your .env
file.
Ping the @DocSearch team to get access to those credentials.
Clone the configurations repository in your docsearch-scraper directory. Command to run from the scraper root:
git clone [email protected]:algolia/docsearch-configs.git configs/public
The CLI will then have more commands for you to run.
To spot why a crawl fail without watching the logs, we have defined some custom exit status:
Exit code | Reason |
---|---|
3 | No record extracted from the crawl |
4 | Too much hits returned from the crawl |
5 | The configuration provided is not a valid JSON |
6 | The endpoint to call is incorrect |
7 | Credentials used to request are not set |