Skip to content

Commit

Permalink
Support the upcoming proxy API of Zyte API (#108)
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio authored Oct 19, 2023
1 parent 3761acb commit 5541784
Show file tree
Hide file tree
Showing 8 changed files with 512 additions and 205 deletions.
26 changes: 16 additions & 10 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,27 @@ jobs:
strategy:
matrix:
include:
- python-version: 2.7
- python-version: 3.8
env:
TOXENV: py27,stack-scrapy-1.4,stack-scrapy-1.5
- python-version: 3.5
TOXENV: min
- python-version: 3.8
env:
TOXENV: py35,stack-scrapy-1.8-py3,stack-scrapy-2.0-py3,stack-scrapy-2.1-py3,stack-scrapy-2.2-py3,stack-scrapy-2.3-py3
- python-version: 3.6
TOXENV: py
- python-version: 3.9
env:
TOXENV: py36,stack-scrapy-2.2-py3,stack-scrapy-2.3-py3,stack-scrapy-2.4-py3
- python-version: 3.7
TOXENV: py
- python-version: "3.10"
env:
TOXENV: py37,stack-scrapy-2.2-py3,stack-scrapy-2.3-py3,stack-scrapy-2.4-py3
- python-version: 3.8
TOXENV: py
- python-version: "3.11"
env:
TOXENV: py
- python-version: "3.11"
env:
TOXENV: security
- python-version: "3.11"
env:
TOXENV: py38,security,docs,stack-scrapy-2.2-py3,stack-scrapy-2.3-py3,stack-scrapy-2.4-py3
TOXENV: docs
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand Down
42 changes: 42 additions & 0 deletions docs/headers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
Headers
=======

The Zyte proxy API services that you can use with this downloader middleware
each support a different set of HTTP request and response headers that give
you access to additional features. You can find more information about those
headers in the documentation of each service, `Zyte API’s <zyte-api-headers>`_
and `Zyte Smart Proxy Manager’s <spm-headers>`_.

.. _zyte-api-headers: https://docs.zyte.com/zyte-api/usage/proxy-api.html
.. _spm-headers: https://docs.zyte.com/smart-proxy-manager.html#request-headers

If you try to use a header for one service while using the other service, this
downloader middleware will try to translate your header into the right header
for the target service and, regardless of whether or not translation was done,
the original header will be dropped.

Also, response headers that can be translated will be always translated,
without dropping the original header, so code expecting a response header from
one service can work even if a different service was used.

Translation is supported for the following headers:

========================= ===========================
Zyte API Zyte Smart Proxy Manager
========================= ===========================
``Zyte-Client`` ``X-Crawlera-Client``
``Zyte-Device`` ``X-Crawlera-Profile``
``Zyte-Error`` ``X-Crawlera-Error``
``Zyte-Geolocation`` ``X-Crawlera-Region``
``Zyte-JobId`` ``X-Crawlera-JobId``
``Zyte-Override-Headers`` ``X-Crawlera-Profile-Pass``
========================= ===========================

Also, if a request is not being proxied and includes a header for any of these
services, it will be dropped, to prevent leaking data to external websites.
This downloader middleware assumes that a header prefixed with ``Zyte-`` is a
Zyte API header, and that a header prefixed with ``X-Crawlera-`` is a Zyte
Smart Proxy Manager header, even if they are not known headers otherwise.

When dropping a header, be it as part of header translation or to avoid leaking
data, a warning message with details will be logged.
160 changes: 97 additions & 63 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,104 +2,138 @@
scrapy-zyte-smartproxy |version| documentation
==============================================

scrapy-zyte-smartproxy is a `Scrapy downloader middleware`_ to interact with
`Zyte Smart Proxy Manager`_ (formerly Crawlera) automatically.
.. toctree::
:hidden:

headers
settings
news

scrapy-zyte-smartproxy is a `Scrapy downloader middleware`_ to use one of
Zyte’s proxy APIs: either the proxy API of `Zyte API`_ or `Zyte Smart Proxy
Manager`_ (formerly Crawlera).

.. _Scrapy downloader middleware: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
.. _Zyte API: https://docs.zyte.com/zyte-api/get-started.html
.. _Zyte Smart Proxy Manager: https://www.zyte.com/smart-proxy-manager/

Configuration
=============

.. toctree::
:caption: Configuration
#. Add the downloader middleware to your ``DOWNLOADER_MIDDLEWARES`` Scrapy
setting:

.. code-block:: python
:caption: settings.py
* Add the Zyte Smart Proxy Manager middleware including it into the ``DOWNLOADER_MIDDLEWARES`` in your ``settings.py`` file::
DOWNLOADER_MIDDLEWARES = {
...
'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610
}
DOWNLOADER_MIDDLEWARES = {
...
'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610
}
#. Enable the middleware and configure your API key, either through Scrapy
settings:

* Then there are two ways to enable it
.. code-block:: python
:caption: settings.py
* Through ``settings.py``::
ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = 'apikey'
ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = 'apikey'
Or through spider attributes:

* Through spider attributes::
.. code-block:: python
class MySpider:
zyte_smartproxy_enabled = True
zyte_smartproxy_apikey = 'apikey'
class MySpider(scrapy.Spider):
zyte_smartproxy_enabled = True
zyte_smartproxy_apikey = 'apikey'
.. _ZYTE_SMARTPROXY_URL:

* (optional) If you are not using the default Zyte Smart Proxy Manager proxy (``http://proxy.zyte.com:8011``),
for example if you have a dedicated or private instance,
make sure to also set ``ZYTE_SMARTPROXY_URL`` in ``settings.py``, e.g.::
#. Set the ``ZYTE_SMARTPROXY_URL`` Scrapy setting as needed:

ZYTE_SMARTPROXY_URL = 'http://myinstance.zyte.com:8011'
- To use the proxy API of Zyte API, set it to
``http://api.zyte.com:8011``:

How to use it
=============
.. code-block:: python
:caption: settings.py
.. toctree::
:caption: How to use it
:hidden:
ZYTE_SMARTPROXY_URL = "http://api.zyte.com:8011"
settings
- To use the default Zyte Smart Proxy Manager endpoint, leave it unset.

:doc:`settings`
All configurable Scrapy Settings added by the Middleware.
- To use a custom Zyte Smart Proxy Manager endpoint, in case you have a
dedicated or private instance, set it to your custom endpoint. For
example:

.. code-block:: python
:caption: settings.py
With the middleware, the usage of Zyte Smart Proxy Manager is automatic, every request will go through Zyte Smart Proxy Manager without nothing to worry about.
If you want to *disable* Zyte Smart Proxy Manager on a specific Request, you can do so by updating `meta` with `dont_proxy=True`::
ZYTE_SMARTPROXY_URL = "http://myinstance.zyte.com:8011"
scrapy.Request(
'http://example.com',
meta={
'dont_proxy': True,
...
},
)
Usage
=====

Once the downloader middleware is properly configured, every request goes
through the configured Zyte proxy API.

Remember that you are now making requests to Zyte Smart Proxy Manager, and the Zyte Smart Proxy Manager service will be the one actually making the requests to the different sites.
.. _override:

If you need to specify special `Zyte Smart Proxy Manager headers <https://docs.zyte.com/smart-proxy-manager.html#request-headers>`_, just apply them as normal `Scrapy headers <https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.headers>`_.
Although the plugin configuration only allows defining a single proxy API
endpoint and API key, it is possible to override them for specific requests, so
that you can use different combinations for different requests within the same
spider.

Here we have an example of specifying a Zyte Smart Proxy Manager header into a Scrapy request::
To **override** which combination of endpoint and API key is used for a given
request, set ``proxy`` in the request metadata to a URL indicating both the
target endpoint and the API key to use. For example:

scrapy.Request(
'http://example.com',
headers={
'X-Crawlera-Max-Retries': 1,
...
},
)
.. code-block:: python
Remember that you could also set which headers to use by default by all
requests with `DEFAULT_REQUEST_HEADERS <http://doc.scrapy.org/en/1.0/topics/settings.html#default-request-headers>`_
scrapy.Request(
"https://topscrape.com",
meta={
"proxy": "http://[email protected]:8011",
...
},
)
.. note:: Zyte Smart Proxy Manager headers are removed from requests when the middleware is activated but Zyte Smart Proxy Manager
is disabled. For example, if you accidentally disable Zyte Smart Proxy Manager via ``zyte_smartproxy_enabled = False``
but keep sending ``X-Crawlera-*`` headers in your requests, those will be removed from the
request headers.
.. TODO: Check that a colon after the API key is not needed in this case.
This Middleware also adds some configurable Scrapy Settings, check :ref:`the complete list here <settings>`.
To **disable** proxying altogether for a given request, set ``dont_proxy`` to
``True`` on the request metadata:

All the rest
============
.. code-block:: python
.. toctree::
:caption: All the rest
:hidden:
scrapy.Request(
"https://topscrape.com",
meta={
"dont_proxy": True,
...
},
)
news
You can set `Zyte API proxy headers`_ or `Zyte Smart Proxy Manager headers`_ as
regular `Scrapy headers`_, e.g. using the ``headers`` parameter of ``Request``
or using the DEFAULT_REQUEST_HEADERS_ setting. For example:

.. code-block:: python
scrapy.Request(
"https://topscrape.com",
headers={
"Zyte-Geolocation": "FR",
...
},
)
.. _Zyte API proxy headers: https://docs.zyte.com/zyte-api/usage/proxy-api.html
.. _Zyte Smart Proxy Manager headers: https://docs.zyte.com/smart-proxy-manager.html#request-headers
.. _Scrapy headers: https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.headers
.. _DEFAULT_REQUEST_HEADERS: https://doc.scrapy.org/en/latest/topics/settings.html#default-request-headers

For information about proxy-specific header processing, see :doc:`headers`.

:doc:`news`
See what has changed in recent scrapy-zyte-smartproxy versions.
See also :ref:`settings` for the complete list of settings that this downloader
middleware supports.
37 changes: 23 additions & 14 deletions docs/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,53 +3,62 @@ Settings
========

This Scrapy downloader middleware adds some settings to configure how to work
with Zyte Smart Proxy Manager.
with your Zyte proxy API.

ZYTE_SMARTPROXY_APIKEY
----------------------

Default: ``None``

Unique Zyte Smart Proxy Manager API key provided for authentication.
Default API key for your Zyte proxy API service.

Note that Zyte API and Zyte Smart Proxy Manager have different API keys.

You can :ref:`override this value on specific requests <override>`.


ZYTE_SMARTPROXY_URL
-------------------

Default: ``'http://proxy.zyte.com:8011'``

Zyte Smart Proxy Manager instance URL, it varies depending on adquiring a private or dedicated instance. If Zyte Smart Proxy Manager didn't provide
you with a private instance URL, you don't need to specify it.
Default endpoint for your Zyte proxy API service.

For guidelines on setting a value, see the :ref:`initial configuration
instructions <ZYTE_SMARTPROXY_URL>`.

You can :ref:`override this value on specific requests <override>`.

ZYTE_SMARTPROXY_MAXBANS
-----------------------

Default: ``400``

Number of consecutive bans from Zyte Smart Proxy Manager necessary to stop the spider.
Number of consecutive bans necessary to stop the spider.

ZYTE_SMARTPROXY_DOWNLOAD_TIMEOUT
--------------------------------

Default: ``190``

Timeout for processing Zyte Smart Proxy Manager requests. It overrides Scrapy's ``DOWNLOAD_TIMEOUT``.
Timeout for processing proxied requests. It overrides Scrapy's ``DOWNLOAD_TIMEOUT``.

ZYTE_SMARTPROXY_PRESERVE_DELAY
------------------------------

Default: ``False``

If ``False`` Sets Scrapy's ``DOWNLOAD_DELAY`` to ``0``, making the spider to crawl faster. If set to ``True``, it will
If ``False`` sets Scrapy's ``DOWNLOAD_DELAY`` to ``0``, making the spider to crawl faster. If set to ``True``, it will
respect the provided ``DOWNLOAD_DELAY`` from Scrapy.

ZYTE_SMARTPROXY_DEFAULT_HEADERS
-------------------------------

Default: ``{}``

Default headers added only to Zyte Smart Proxy Manager requests. Headers defined on ``DEFAULT_REQUEST_HEADERS`` will take precedence as long as the ``ZyteSmartProxyMiddleware`` is placed after the ``DefaultHeadersMiddleware``. Headers set on the requests have precedence over the two settings.
Default headers added only to proxied requests. Headers defined on ``DEFAULT_REQUEST_HEADERS`` will take precedence as long as the ``ZyteSmartProxyMiddleware`` is placed after the ``DefaultHeadersMiddleware``. Headers set on the requests have precedence over the two settings.

* This is the default behavior, ``DefaultHeadersMiddleware`` default priority is ``400`` and we recommend ``ZyteSmartProxyMiddleware`` priority to be ``610``
* This is the default behavior, ``DefaultHeadersMiddleware`` default priority is ``400`` and we recommend ``ZyteSmartProxyMiddleware`` priority to be ``610``.

ZYTE_SMARTPROXY_BACKOFF_STEP
----------------------------
Expand All @@ -70,9 +79,9 @@ ZYTE_SMARTPROXY_FORCE_ENABLE_ON_HTTP_CODES

Default: ``[]``

List of HTTP response status codes that warrant enabling Zyte Smart Proxy Manager for the
corresponding domain.
List of HTTP response status codes that warrant enabling your Zyte proxy API
service for the corresponding domain.

When a response with one of these HTTP status codes is received after a request
that did not go through Zyte Smart Proxy Manager, the request is retried with Zyte Smart Proxy Manager, and any
new request to the same domain is also sent through Zyte Smart Proxy Manager.
When a response with one of these HTTP status codes is received after an
unproxied request, the request is retried with your Zyte proxy API service, and
any new request to the same domain is also proxied.
Loading

0 comments on commit 5541784

Please sign in to comment.