-
Notifications
You must be signed in to change notification settings - Fork 88
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support the upcoming proxy API of Zyte API (#108)
- Loading branch information
Showing
8 changed files
with
512 additions
and
205 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
Headers | ||
======= | ||
|
||
The Zyte proxy API services that you can use with this downloader middleware | ||
each support a different set of HTTP request and response headers that give | ||
you access to additional features. You can find more information about those | ||
headers in the documentation of each service, `Zyte API’s <zyte-api-headers>`_ | ||
and `Zyte Smart Proxy Manager’s <spm-headers>`_. | ||
|
||
.. _zyte-api-headers: https://docs.zyte.com/zyte-api/usage/proxy-api.html | ||
.. _spm-headers: https://docs.zyte.com/smart-proxy-manager.html#request-headers | ||
|
||
If you try to use a header for one service while using the other service, this | ||
downloader middleware will try to translate your header into the right header | ||
for the target service and, regardless of whether or not translation was done, | ||
the original header will be dropped. | ||
|
||
Also, response headers that can be translated will be always translated, | ||
without dropping the original header, so code expecting a response header from | ||
one service can work even if a different service was used. | ||
|
||
Translation is supported for the following headers: | ||
|
||
========================= =========================== | ||
Zyte API Zyte Smart Proxy Manager | ||
========================= =========================== | ||
``Zyte-Client`` ``X-Crawlera-Client`` | ||
``Zyte-Device`` ``X-Crawlera-Profile`` | ||
``Zyte-Error`` ``X-Crawlera-Error`` | ||
``Zyte-Geolocation`` ``X-Crawlera-Region`` | ||
``Zyte-JobId`` ``X-Crawlera-JobId`` | ||
``Zyte-Override-Headers`` ``X-Crawlera-Profile-Pass`` | ||
========================= =========================== | ||
|
||
Also, if a request is not being proxied and includes a header for any of these | ||
services, it will be dropped, to prevent leaking data to external websites. | ||
This downloader middleware assumes that a header prefixed with ``Zyte-`` is a | ||
Zyte API header, and that a header prefixed with ``X-Crawlera-`` is a Zyte | ||
Smart Proxy Manager header, even if they are not known headers otherwise. | ||
|
||
When dropping a header, be it as part of header translation or to avoid leaking | ||
data, a warning message with details will be logged. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,104 +2,138 @@ | |
scrapy-zyte-smartproxy |version| documentation | ||
============================================== | ||
|
||
scrapy-zyte-smartproxy is a `Scrapy downloader middleware`_ to interact with | ||
`Zyte Smart Proxy Manager`_ (formerly Crawlera) automatically. | ||
.. toctree:: | ||
:hidden: | ||
|
||
headers | ||
settings | ||
news | ||
|
||
scrapy-zyte-smartproxy is a `Scrapy downloader middleware`_ to use one of | ||
Zyte’s proxy APIs: either the proxy API of `Zyte API`_ or `Zyte Smart Proxy | ||
Manager`_ (formerly Crawlera). | ||
|
||
.. _Scrapy downloader middleware: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html | ||
.. _Zyte API: https://docs.zyte.com/zyte-api/get-started.html | ||
.. _Zyte Smart Proxy Manager: https://www.zyte.com/smart-proxy-manager/ | ||
|
||
Configuration | ||
============= | ||
|
||
.. toctree:: | ||
:caption: Configuration | ||
#. Add the downloader middleware to your ``DOWNLOADER_MIDDLEWARES`` Scrapy | ||
setting: | ||
|
||
.. code-block:: python | ||
:caption: settings.py | ||
* Add the Zyte Smart Proxy Manager middleware including it into the ``DOWNLOADER_MIDDLEWARES`` in your ``settings.py`` file:: | ||
DOWNLOADER_MIDDLEWARES = { | ||
... | ||
'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610 | ||
} | ||
DOWNLOADER_MIDDLEWARES = { | ||
... | ||
'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610 | ||
} | ||
#. Enable the middleware and configure your API key, either through Scrapy | ||
settings: | ||
|
||
* Then there are two ways to enable it | ||
.. code-block:: python | ||
:caption: settings.py | ||
* Through ``settings.py``:: | ||
ZYTE_SMARTPROXY_ENABLED = True | ||
ZYTE_SMARTPROXY_APIKEY = 'apikey' | ||
ZYTE_SMARTPROXY_ENABLED = True | ||
ZYTE_SMARTPROXY_APIKEY = 'apikey' | ||
Or through spider attributes: | ||
|
||
* Through spider attributes:: | ||
.. code-block:: python | ||
class MySpider: | ||
zyte_smartproxy_enabled = True | ||
zyte_smartproxy_apikey = 'apikey' | ||
class MySpider(scrapy.Spider): | ||
zyte_smartproxy_enabled = True | ||
zyte_smartproxy_apikey = 'apikey' | ||
.. _ZYTE_SMARTPROXY_URL: | ||
|
||
* (optional) If you are not using the default Zyte Smart Proxy Manager proxy (``http://proxy.zyte.com:8011``), | ||
for example if you have a dedicated or private instance, | ||
make sure to also set ``ZYTE_SMARTPROXY_URL`` in ``settings.py``, e.g.:: | ||
#. Set the ``ZYTE_SMARTPROXY_URL`` Scrapy setting as needed: | ||
|
||
ZYTE_SMARTPROXY_URL = 'http://myinstance.zyte.com:8011' | ||
- To use the proxy API of Zyte API, set it to | ||
``http://api.zyte.com:8011``: | ||
|
||
How to use it | ||
============= | ||
.. code-block:: python | ||
:caption: settings.py | ||
.. toctree:: | ||
:caption: How to use it | ||
:hidden: | ||
ZYTE_SMARTPROXY_URL = "http://api.zyte.com:8011" | ||
settings | ||
- To use the default Zyte Smart Proxy Manager endpoint, leave it unset. | ||
|
||
:doc:`settings` | ||
All configurable Scrapy Settings added by the Middleware. | ||
- To use a custom Zyte Smart Proxy Manager endpoint, in case you have a | ||
dedicated or private instance, set it to your custom endpoint. For | ||
example: | ||
|
||
.. code-block:: python | ||
:caption: settings.py | ||
With the middleware, the usage of Zyte Smart Proxy Manager is automatic, every request will go through Zyte Smart Proxy Manager without nothing to worry about. | ||
If you want to *disable* Zyte Smart Proxy Manager on a specific Request, you can do so by updating `meta` with `dont_proxy=True`:: | ||
ZYTE_SMARTPROXY_URL = "http://myinstance.zyte.com:8011" | ||
scrapy.Request( | ||
'http://example.com', | ||
meta={ | ||
'dont_proxy': True, | ||
... | ||
}, | ||
) | ||
Usage | ||
===== | ||
|
||
Once the downloader middleware is properly configured, every request goes | ||
through the configured Zyte proxy API. | ||
|
||
Remember that you are now making requests to Zyte Smart Proxy Manager, and the Zyte Smart Proxy Manager service will be the one actually making the requests to the different sites. | ||
.. _override: | ||
|
||
If you need to specify special `Zyte Smart Proxy Manager headers <https://docs.zyte.com/smart-proxy-manager.html#request-headers>`_, just apply them as normal `Scrapy headers <https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.headers>`_. | ||
Although the plugin configuration only allows defining a single proxy API | ||
endpoint and API key, it is possible to override them for specific requests, so | ||
that you can use different combinations for different requests within the same | ||
spider. | ||
|
||
Here we have an example of specifying a Zyte Smart Proxy Manager header into a Scrapy request:: | ||
To **override** which combination of endpoint and API key is used for a given | ||
request, set ``proxy`` in the request metadata to a URL indicating both the | ||
target endpoint and the API key to use. For example: | ||
|
||
scrapy.Request( | ||
'http://example.com', | ||
headers={ | ||
'X-Crawlera-Max-Retries': 1, | ||
... | ||
}, | ||
) | ||
.. code-block:: python | ||
Remember that you could also set which headers to use by default by all | ||
requests with `DEFAULT_REQUEST_HEADERS <http://doc.scrapy.org/en/1.0/topics/settings.html#default-request-headers>`_ | ||
scrapy.Request( | ||
"https://topscrape.com", | ||
meta={ | ||
"proxy": "http://[email protected]:8011", | ||
... | ||
}, | ||
) | ||
.. note:: Zyte Smart Proxy Manager headers are removed from requests when the middleware is activated but Zyte Smart Proxy Manager | ||
is disabled. For example, if you accidentally disable Zyte Smart Proxy Manager via ``zyte_smartproxy_enabled = False`` | ||
but keep sending ``X-Crawlera-*`` headers in your requests, those will be removed from the | ||
request headers. | ||
.. TODO: Check that a colon after the API key is not needed in this case. | ||
This Middleware also adds some configurable Scrapy Settings, check :ref:`the complete list here <settings>`. | ||
To **disable** proxying altogether for a given request, set ``dont_proxy`` to | ||
``True`` on the request metadata: | ||
|
||
All the rest | ||
============ | ||
.. code-block:: python | ||
.. toctree:: | ||
:caption: All the rest | ||
:hidden: | ||
scrapy.Request( | ||
"https://topscrape.com", | ||
meta={ | ||
"dont_proxy": True, | ||
... | ||
}, | ||
) | ||
news | ||
You can set `Zyte API proxy headers`_ or `Zyte Smart Proxy Manager headers`_ as | ||
regular `Scrapy headers`_, e.g. using the ``headers`` parameter of ``Request`` | ||
or using the DEFAULT_REQUEST_HEADERS_ setting. For example: | ||
|
||
.. code-block:: python | ||
scrapy.Request( | ||
"https://topscrape.com", | ||
headers={ | ||
"Zyte-Geolocation": "FR", | ||
... | ||
}, | ||
) | ||
.. _Zyte API proxy headers: https://docs.zyte.com/zyte-api/usage/proxy-api.html | ||
.. _Zyte Smart Proxy Manager headers: https://docs.zyte.com/smart-proxy-manager.html#request-headers | ||
.. _Scrapy headers: https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.headers | ||
.. _DEFAULT_REQUEST_HEADERS: https://doc.scrapy.org/en/latest/topics/settings.html#default-request-headers | ||
|
||
For information about proxy-specific header processing, see :doc:`headers`. | ||
|
||
:doc:`news` | ||
See what has changed in recent scrapy-zyte-smartproxy versions. | ||
See also :ref:`settings` for the complete list of settings that this downloader | ||
middleware supports. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.