-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
344 lines (206 loc) · 38.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Will's thoughts</title>
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:type" content="website">
<meta property="og:title" content="Will's thoughts">
<meta property="og:url" content="http://hantconny.github.io/index.html">
<meta property="og:site_name" content="Will's thoughts">
<meta property="og:locale" content="zh_CN">
<meta property="article:author" content="从前有个包子他睡着了">
<meta name="twitter:card" content="summary">
<link rel="alternate" href="/atom.xml" title="Will's thoughts" type="application/atom+xml">
<link rel="shortcut icon" href="/favicon.png">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/index.min.css">
<link rel="stylesheet" href="/css/style.css">
<link rel="stylesheet" href="/fancybox/jquery.fancybox.min.css">
<meta name="generator" content="Hexo 6.3.0"></head>
<body>
<div id="container">
<div id="wrap">
<header id="header">
<div id="banner"></div>
<div id="header-outer" class="outer">
<div id="header-title" class="inner">
<h1 id="logo-wrap">
<a href="/" id="logo">Will's thoughts</a>
</h1>
</div>
<div id="header-inner" class="inner">
<nav id="main-nav">
<a id="main-nav-toggle" class="nav-icon"><span class="fa fa-bars"></span></a>
<a class="main-nav-link" href="/">Home</a>
<a class="main-nav-link" href="/archives">Archives</a>
</nav>
<nav id="sub-nav">
<a class="nav-icon" href="/atom.xml" title="RSS 订阅"><span class="fa fa-rss"></span></a>
<a class="nav-icon nav-search-btn" title="搜索"><span class="fa fa-search"></span></a>
</nav>
<div id="search-form-wrap">
<form action="//google.com/search" method="get" accept-charset="UTF-8" class="search-form"><input type="search" name="q" class="search-form-input" placeholder="搜索"><button type="submit" class="search-form-submit"></button><input type="hidden" name="sitesearch" value="http://hantconny.github.io"></form>
</div>
</div>
</div>
</header>
<div class="outer">
<section id="main">
<article id="post-2023-10-19-scrapy-stuff" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
<div class="article-meta">
<a href="/2023/10/19/2023-10-19-scrapy-stuff/" class="article-date">
<time class="dt-published" datetime="2023-10-19T07:08:09.000Z" itemprop="datePublished">2023-10-19</time>
</a>
<div class="article-category">
<a class="article-category-link" href="/categories/onenote/">onenote</a>
</div>
</div>
<div class="article-inner">
<header class="article-header">
<h1 itemprop="name">
<a class="p-name article-title" href="/2023/10/19/2023-10-19-scrapy-stuff/">Scrapy</a>
</h1>
</header>
<div class="e-content article-entry" itemprop="articleBody">
<h3 id="安装"><a href="#安装" class="headerlink" title="安装"></a>安装</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_">$ </span><span class="language-bash">pip install scrapy -i https://mirrors.ustc.edu.cn/pypi/web/simple</span></span><br></pre></td></tr></table></figure>
<p>pip install时,如果要临时切换为国内镜像,可以使用<code>-i</code>指定国内镜像地址</p>
<p>常用的国内镜像地址:</p>
<p>中科大:<a target="_blank" rel="noopener" href="https://mirrors.ustc.edu.cn/pypi/web/simple">https://mirrors.ustc.edu.cn/pypi/web/simple</a></p>
<p>清华:<a target="_blank" rel="noopener" href="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple/">https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple/</a></p>
<h3 id="验证"><a href="#验证" class="headerlink" title="验证"></a>验证</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line">scrapy -v</span><br><span class="line">Scrapy 2.11.0 - no active project</span><br><span class="line"></span><br><span class="line">Usage:</span><br><span class="line"> scrapy <command> [options] [args]</span><br><span class="line"></span><br><span class="line">Available commands:</span><br><span class="line"> bench Run quick benchmark test</span><br><span class="line"> fetch Fetch a URL using the Scrapy downloader</span><br><span class="line"> genspider Generate new spider using pre-defined templates</span><br><span class="line"> runspider Run a self-contained spider (without creating a project)</span><br><span class="line"> settings Get settings values</span><br><span class="line"> shell Interactive scraping console</span><br><span class="line"> startproject Create new project</span><br><span class="line"> version Print Scrapy version</span><br><span class="line"> view Open URL in browser, as seen by Scrapy</span><br><span class="line"></span><br><span class="line"> [ more ] More commands available when run from project directory</span><br><span class="line"></span><br><span class="line">Use "scrapy <command> -h" to see more info about a command</span><br></pre></td></tr></table></figure>
<p>可以看到,如果在scrapy项目内运行的话,还有更多的命令可用。</p>
<h3 id="创建项目"><a href="#创建项目" class="headerlink" title="创建项目"></a>创建项目</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_">$ </span><span class="language-bash">scrapy startproject fz_spider</span></span><br></pre></td></tr></table></figure>
<p>项目结构</p>
<figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">fz_spider/</span><br><span class="line">├── __init__.py</span><br><span class="line">├── items.py</span><br><span class="line">├── middlewares.py</span><br><span class="line">├── pipelines.py</span><br><span class="line">├── settings.py</span><br><span class="line">└── spiders</span><br><span class="line"> └── __init__.py</span><br><span class="line">└── scrapy.cfg </span><br></pre></td></tr></table></figure>
<p>在fz_spider目录下重新运行<code>scrapy -h</code>,会看到更多可用命令:</p>
<figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_">$ </span><span class="language-bash">scrapy -h</span></span><br><span class="line">Scrapy 2.11.0 - active project: fz_spider</span><br><span class="line"></span><br><span class="line">Usage:</span><br><span class="line"> scrapy <command> [options] [args]</span><br><span class="line"></span><br><span class="line">Available commands:</span><br><span class="line"> bench Run quick benchmark test</span><br><span class="line"> check Check spider contracts</span><br><span class="line"> crawl Run a spider</span><br><span class="line"> edit Edit spider</span><br><span class="line"> fetch Fetch a URL using the Scrapy downloader</span><br><span class="line"> genspider Generate new spider using pre-defined templates</span><br><span class="line"> list List available spiders</span><br><span class="line"> parse Parse URL (using its spider) and print the results</span><br><span class="line"> runspider Run a self-contained spider (without creating a project)</span><br><span class="line"> settings Get settings values</span><br><span class="line"> shell Interactive scraping console</span><br><span class="line"> startproject Create new project</span><br><span class="line"> version Print Scrapy version</span><br><span class="line"> view Open URL in browser, as seen by Scrapy</span><br><span class="line"></span><br><span class="line">Use "scrapy <command> -h" to see more info about a command</span><br></pre></td></tr></table></figure>
<p>这里主要关注crawl命令,通过该命令可以在命令行启动已经编写好的爬虫。</p>
<h3 id="创建爬虫"><a href="#创建爬虫" class="headerlink" title="创建爬虫"></a>创建爬虫</h3><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_">$ </span><span class="language-bash">scrapy genspider fzapp <span class="string">"abc.com"</span></span></span><br></pre></td></tr></table></figure>
<p>这其实创建了一个名为fzapp的爬虫,目标域名为abc.com。会在spiders目录下生成一个名为fzapp.py的文件:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> scrapy</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">FzappSpider</span>(scrapy.Spider):</span><br><span class="line"> name = <span class="string">"fzapp"</span></span><br><span class="line"> allowed_domains = [<span class="string">"abc.com"</span>]</span><br><span class="line"> start_urls = [<span class="string">"https://abc.com"</span>]</span><br><span class="line"></span><br><span class="line"> <span class="keyword">def</span> <span class="title function_">parse</span>(<span class="params">self, response</span>):</span><br><span class="line"> <span class="keyword">pass</span></span><br></pre></td></tr></table></figure>
<p>Python的文件名和类名并没有要求一致,因此可以修改文件名,也可以修改类名。比如可以将文件名修改为fz_app.py。也可以将类名修改为FzAppSpider。</p>
<p>这个文件没什么特殊,也可以不使用genspider命令,直接创建爬虫文件,然后继承scrapy.Spider,并实现parse方法。</p>
<p>这里有三个类变量,分别是name,allowed_domains和start_urls。对我们来说,因为不是爬取固定页面,所以有用的只有name。</p>
<p>name是爬虫的名称,虽然允许修改,但修改后运行时的名称也需要相应修改。例如,如果将<code>name = "fzapp"</code>修改为<code>name = "fz_app"</code>,则运行爬虫时,需要修改为<code>scrapy crawl fz_zpp</code>。</p>
<h3 id="settings-py"><a href="#settings-py" class="headerlink" title="settings.py"></a>settings.py</h3><p>可以在该文件中设置日志:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">LOG_LEVEL = <span class="string">"DEBUG"</span></span><br><span class="line">LOG_FILE = <span class="string">"./spider.log"</span></span><br></pre></td></tr></table></figure>
<p>可以控制爬虫是否遵循目标网站的robots.txt。该文件会定义哪些资源是爬虫可爬的,哪些是不允许爬取的。默认是遵顼该约定的,当然也可以设置为False。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ROBOTSTXT_OBEY = <span class="literal">False</span></span><br></pre></td></tr></table></figure>
<p>控制并发,默认的并发是16</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">CONCURRENT_REQUESTS = <span class="number">32</span></span><br></pre></td></tr></table></figure>
<p>是否启用Telnet Console,不知道这个有什么用,所以就禁掉了。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">TELNETCONSOLE_ENABLED = <span class="literal">False</span></span><br></pre></td></tr></table></figure>
<p>此外,请求头DEFAULT_REQUEST_HEADERS也在这里设置。</p>
<p>控制重试。默认会进行3次重试。如果不想重试,则使用<code>RETRY_ENABLED = False</code>关闭重试。可以针对特定的HTTP响应进行重试。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">RETRY_ENABLED = <span class="literal">False</span></span><br><span class="line"><span class="comment"># RETRY_TIMES = 3</span></span><br><span class="line"><span class="comment"># RETRY_HTTP_CODES = [500, 502, 503, 504, 408]</span></span><br></pre></td></tr></table></figure>
<h3 id="在PyCharm中调试"><a href="#在PyCharm中调试" class="headerlink" title="在PyCharm中调试"></a>在PyCharm中调试</h3><p>在Run/Debug Configurations中新建一个Python的配置,将Script path设置为scrapy提供的cmdline.py,可以用everything直接搜索并复制路径。一般是<code>C:\Users\Administrator\AppData\Local\Programs\Python\Python38\Lib\site-packages\scrapy\cmdline.py</code>。将Parameters设置为crawl fz_app。</p>
<h3 id="start-requests方法"><a href="#start-requests方法" class="headerlink" title="start_requests方法"></a>start_requests方法</h3><p>我们要爬取的url并不是一个固定的url,而是包含在一个文本文件中的url列表,因此还需要实现start_requests方法。</p>
<p>该方法必须返回一个可迭代对象。该对象包含了spider用于爬取的Request。当spider启动爬取且未指定URL时,该方法被调用。 </p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">start_requests</span>(<span class="params">self</span>):</span><br><span class="line"> meta = {</span><br><span class="line"> <span class="comment"># 'dont_redirect': True,</span></span><br><span class="line"> <span class="comment">## handle_httpstatus_all:True会处理所有的http status,默认只会处理200-300之间的正确响应码</span></span><br><span class="line"> <span class="string">'handle_httpstatus_all'</span>: <span class="literal">True</span></span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'other_202309271402'</span>, encoding=<span class="string">'utf-8'</span>) <span class="keyword">as</span> input_data:</span><br><span class="line"> urls = input_data.readlines()</span><br><span class="line"> <span class="keyword">for</span> iurl <span class="keyword">in</span> urls:</span><br><span class="line"> <span class="keyword">yield</span> Request(url=<span class="string">'http://{url}'</span>.<span class="built_in">format</span>(url=iurl), callback=self.parse, meta=meta)</span><br></pre></td></tr></table></figure>
<p>Request对象有几个需要关注的keyword参数:</p>
<p>一个是url,这个很明显,就是需要爬取的URL。</p>
<p>一个是callback,它指定了回调函数为parse方法。</p>
<p>最后是meta,它指定了爬虫的一些细节,如,是否进行302重定向,是否处理所有的http status code。以处理所有http status code为例,scrapy默认只会处理状态码为200-300之间的正确响应,对于4xx和5xx的响应码会丢弃,也就不会进入parse方法。如果我们想记录那些4xx和5xx的URL,并在事后进行分析,就需要在meta中添加对应的参数配置(handle_httpstatus_all设置为True)。</p>
<h3 id="如何处理异常的请求"><a href="#如何处理异常的请求" class="headerlink" title="如何处理异常的请求"></a>如何处理异常的请求</h3><p>即使在start_requests方法中指定了处理所有响应状态,也不能保证不遗漏。当网站无法响应时,会抛出异常,这时候就不会进入spider,而需要额外的中间件进行处理。</p>
<p>可以定义一个ExceptionMiddleware,并在process_exception方法中对异常的站点进行记录:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">FzRecordExceptionMiddleware</span>:</span><br><span class="line"> <span class="keyword">def</span> <span class="title function_">process_request</span>(<span class="params">self, request, spider</span>):</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">None</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">def</span> <span class="title function_">process_response</span>(<span class="params">self, request, response, spider</span>):</span><br><span class="line"> <span class="keyword">return</span> response</span><br><span class="line"></span><br><span class="line"> <span class="keyword">def</span> <span class="title function_">process_exception</span>(<span class="params">self, request, exception, spider</span>):</span><br><span class="line"> cfg = ConfigParser()</span><br><span class="line"> cfg.read(<span class="string">'config.ini'</span>)</span><br><span class="line"> _storage_root = cfg.get(<span class="string">'storage'</span>, <span class="string">'root'</span>)</span><br><span class="line"> _storage_ex = cfg.get(<span class="string">'storage'</span>, <span class="string">'exception'</span>)</span><br><span class="line"> <span class="keyword">if</span> <span class="keyword">not</span> os.path.exists(_storage_ex):</span><br><span class="line"> os.makedirs(_storage_ex)</span><br><span class="line"></span><br><span class="line"> ex_summary_file = os.path.join(_storage_ex, <span class="string">'exception_summary'</span>)</span><br><span class="line"> <span class="keyword">with</span> <span class="built_in">open</span>(ex_summary_file, encoding=<span class="string">'utf-8'</span>, mode=<span class="string">'a'</span>) <span class="keyword">as</span> ex_response:</span><br><span class="line"> ex_response.write(<span class="string">'|'</span>.join([request.url, exception.MESSAGE, <span class="string">'\n'</span>]))</span><br><span class="line"> </span><br><span class="line"> <span class="keyword">return</span> <span class="literal">None</span></span><br></pre></td></tr></table></figure>
<p>要使自定义的中间件生效,还需要再settings.py中定义中间件的顺序:</p>
<figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">DOWNLOADER_MIDDLEWARES = {</span><br><span class="line"> # "fz_spider.middlewares.FzSpiderDownloaderMiddleware": 543,</span><br><span class="line"> "fz_spider.middlewares.FzRecordExceptionMiddleware": 543,</span><br><span class="line">}</span><br></pre></td></tr></table></figure>
<h3 id="如何定义代理中间件"><a href="#如何定义代理中间件" class="headerlink" title="如何定义代理中间件"></a>如何定义代理中间件</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">DeepindarkHttpProxyMiddleware</span>:</span><br><span class="line"> <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params">self, proxy</span>):</span><br><span class="line"> self._proxy = proxy</span><br><span class="line"></span><br><span class="line"><span class="meta"> @classmethod</span></span><br><span class="line"> <span class="keyword">def</span> <span class="title function_">from_crawler</span>(<span class="params">cls, crawler</span>):</span><br><span class="line"> <span class="keyword">return</span> cls(proxy=crawler.settings.get(<span class="string">'PROXIES'</span>))</span><br><span class="line"></span><br><span class="line"> <span class="keyword">def</span> <span class="title function_">process_request</span>(<span class="params">self, request, spider</span>):</span><br><span class="line"> request.meta[<span class="string">'proxy'</span>] = self._proxy</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">None</span></span><br></pre></td></tr></table></figure>
<p>在settings.py中配置:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">DOWNLOADER_MIDDLEWARES = {</span><br><span class="line"> <span class="comment"># "deepindark.middlewares.DeepindarkDownloaderMiddleware": 543,</span></span><br><span class="line"> <span class="comment"># 增加一个代理中间件,在该中间件中为请求设置试用HTTP代理</span></span><br><span class="line"> <span class="string">"deepindark.middlewares.DeepindarkHttpProxyMiddleware"</span>: <span class="number">543</span>,</span><br><span class="line"> <span class="comment"># "deepindark.middlewares.DeepindarkUserAgentMiddleware": 544</span></span><br><span class="line"> <span class="string">"deepindark.middlewares.BypassCloudflare"</span>: <span class="number">400</span></span><br><span class="line">}</span><br></pre></td></tr></table></figure>
<p>Scrapy和requests不一样,requests原生就能支持使用socks5协议进行代理,而Scrapy暂时还不支持。</p>
<h3 id="如何从settings-py中获取值"><a href="#如何从settings-py中获取值" class="headerlink" title="如何从settings.py中获取值"></a>如何从settings.py中获取值</h3><p>可以从Crawler对象和Spider对象中获取settings的值。在中间件中,process_request,process_response方法都会提供Spider对象作为参数,此时就可以从Spider对象中获取settings.py中的配置项:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">spider.settings.get(<span class="string">'PROXIES'</span>)</span><br><span class="line">spider.settings[<span class="string">'PROXIES'</span>]</span><br></pre></td></tr></table></figure>
<p>而如果使用到from_crawler一类的classmethod,则会提供Crawler对象作为参数:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">@classmethod</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">from_crawler</span>(<span class="params">cls, crawler</span>):</span><br><span class="line"> <span class="keyword">return</span> cls(proxy=crawler.settings.get(<span class="string">'PROXIES'</span>))</span><br></pre></td></tr></table></figure>
<p>此时可以从Crawler对象获取settings.py中的配置项:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">crawler.settings.get(<span class="string">'PROXIES'</span>)</span><br><span class="line">crawler.settings[<span class="string">'PROXIES'</span>]</span><br></pre></td></tr></table></figure>
<h3 id="如何绕过Cloudflare的防封策略"><a href="#如何绕过Cloudflare的防封策略" class="headerlink" title="如何绕过Cloudflare的防封策略"></a>如何绕过Cloudflare的防封策略</h3><p>需要使用cloudscraper</p>
<figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta prompt_">$ </span><span class="language-bash">pip install cloudscraper -i https://mirrors.ustc.edu.cn/pypi/web/simple</span></span><br></pre></td></tr></table></figure>
<p>然后一样定义一个中间件:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">BypassCloudflare</span>:</span><br><span class="line"> <span class="keyword">def</span> <span class="title function_">process_response</span>(<span class="params">self, request, response, spider</span>):</span><br><span class="line"> <span class="keyword">if</span> response.status == <span class="number">403</span>:</span><br><span class="line"> <span class="keyword">if</span> spider.name == <span class="string">'onion666'</span>:</span><br><span class="line"> url = request.url</span><br><span class="line"> rsp = spider.browser.get(url, proxies={<span class="string">'http'</span>: spider.settings[<span class="string">'PROXIES'</span>],</span><br><span class="line"> <span class="string">'https'</span>: spider.settings[<span class="string">'PROXIES'</span>]}, headers={<span class="string">'referer'</span>: url})</span><br><span class="line"> <span class="keyword">return</span> HtmlResponse(url=url, body=rsp.text, encoding=<span class="string">"utf-8"</span>, request=request)</span><br><span class="line"> <span class="keyword">return</span> response</span><br></pre></td></tr></table></figure>
<p>因为Cloudflare会对爬虫直接返回403,所以使用状态码进行判断。随着爬取的网站变多,spider.name的判断也需要扩充。</p>
<p><code>spider.browser</code>是定义在对应爬虫中的一个变量,该变量是一个CloudScraper对象实例。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">Onion666Spider</span>(scrapy.Spider):</span><br><span class="line"> <span class="string">"""</span></span><br><span class="line"><span class="string"> 爬取onion666列出的所有站点</span></span><br><span class="line"><span class="string"> """</span></span><br><span class="line"> name = <span class="string">"onion666"</span></span><br><span class="line"> start_urls = [<span class="string">'http://666666666tjjjeweu5iikuj7hkpke5phvdylcless7g4dn6vma2xxcad.onion/'</span>]</span><br><span class="line"> browser = cloudscraper.create_scraper()</span><br></pre></td></tr></table></figure>
<h3 id="request-meta和request-headers"><a href="#request-meta和request-headers" class="headerlink" title="request.meta和request.headers"></a>request.meta和request.headers</h3><p>meta中保存的是和Scrapy相关的内容,如:download_timeout是默认携带的,控制下载超时时间,proxy设置代理,dont_redirect控制是否允许重定向,handle_httpstatus_all控制是否处理所有的http状态码。</p>
<p>而headers保存的是和HTTP请求相关的内容,如:User-Agent设置浏览器信息从而绕过一些封堵检测,Content-Type设置允许接受什么类型的响应。</p>
</div>
<footer class="article-footer">
<a data-url="http://hantconny.github.io/2023/10/19/2023-10-19-scrapy-stuff/" data-id="clnwufron00003ccm0zx8ewxw" data-title="Scrapy" class="article-share-link"><span class="fa fa-share">分享</span></a>
<ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/Scrapy/" rel="tag">Scrapy</a></li></ul>
</footer>
</div>
</article>
<nav id="page-nav">
<span class="page-number current">1</span><a class="page-number" href="/page/2/">2</a><a class="page-number" href="/page/3/">3</a><span class="space">…</span><a class="page-number" href="/page/18/">18</a><a class="extend next" rel="next" href="/page/2/">下一页 »</a>
</nav>
</section>
<aside id="sidebar">
<div class="widget-wrap">
<h3 class="widget-title">分类</h3>
<div class="widget">
<ul class="category-list"><li class="category-list-item"><a class="category-list-link" href="/categories/onenote/">onenote</a></li><li class="category-list-item"><a class="category-list-link" href="/categories/%E6%9D%82%E5%BF%B5/">杂念</a></li></ul>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">标签</h3>
<div class="widget">
<ul class="tag-list" itemprop="keywords"><li class="tag-list-item"><a class="tag-list-link" href="/tags/ASN-1/" rel="tag">ASN.1</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Apache/" rel="tag">Apache</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Bootstrap/" rel="tag">Bootstrap</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/CMD/" rel="tag">CMD</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/DISQUS/" rel="tag">DISQUS</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Downton-Abbey/" rel="tag">Downton Abbey</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Grey-s-Anatomy/" rel="tag">Grey's Anatomy</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Jekyll/" rel="tag">Jekyll</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Liquid/" rel="tag">Liquid</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Miss-Right/" rel="tag">Miss Right?</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Oracle/" rel="tag">Oracle</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/PHP/" rel="tag">PHP</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/PHPWIND/" rel="tag">PHPWIND</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/PLSQL-Developer/" rel="tag">PLSQL Developer</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Photographic/" rel="tag">Photographic</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Scrapy/" rel="tag">Scrapy</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Tapir/" rel="tag">Tapir</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/WAMP/" rel="tag">WAMP</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/Wedding/" rel="tag">Wedding</a></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/%E6%9D%82%E5%BF%B5/" rel="tag">杂念</a></li></ul>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">标签云</h3>
<div class="widget tagcloud">
<a href="/tags/ASN-1/" style="font-size: 10px;">ASN.1</a> <a href="/tags/Apache/" style="font-size: 10px;">Apache</a> <a href="/tags/Bootstrap/" style="font-size: 10px;">Bootstrap</a> <a href="/tags/CMD/" style="font-size: 10px;">CMD</a> <a href="/tags/DISQUS/" style="font-size: 10px;">DISQUS</a> <a href="/tags/Downton-Abbey/" style="font-size: 10px;">Downton Abbey</a> <a href="/tags/Grey-s-Anatomy/" style="font-size: 10px;">Grey's Anatomy</a> <a href="/tags/Jekyll/" style="font-size: 20px;">Jekyll</a> <a href="/tags/Liquid/" style="font-size: 10px;">Liquid</a> <a href="/tags/Miss-Right/" style="font-size: 10px;">Miss Right?</a> <a href="/tags/Oracle/" style="font-size: 10px;">Oracle</a> <a href="/tags/PHP/" style="font-size: 15px;">PHP</a> <a href="/tags/PHPWIND/" style="font-size: 10px;">PHPWIND</a> <a href="/tags/PLSQL-Developer/" style="font-size: 10px;">PLSQL Developer</a> <a href="/tags/Photographic/" style="font-size: 10px;">Photographic</a> <a href="/tags/Scrapy/" style="font-size: 10px;">Scrapy</a> <a href="/tags/Tapir/" style="font-size: 10px;">Tapir</a> <a href="/tags/WAMP/" style="font-size: 15px;">WAMP</a> <a href="/tags/Wedding/" style="font-size: 15px;">Wedding</a> <a href="/tags/%E6%9D%82%E5%BF%B5/" style="font-size: 20px;">杂念</a>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">归档</h3>
<div class="widget">
<ul class="archive-list"><li class="archive-list-item"><a class="archive-list-link" href="/archives/2023/10/">十月 2023</a></li><li class="archive-list-item"><a class="archive-list-link" href="/archives/2014/03/">三月 2014</a></li><li class="archive-list-item"><a class="archive-list-link" href="/archives/2013/09/">九月 2013</a></li><li class="archive-list-item"><a class="archive-list-link" href="/archives/2013/08/">八月 2013</a></li><li class="archive-list-item"><a class="archive-list-link" href="/archives/2013/07/">七月 2013</a></li><li class="archive-list-item"><a class="archive-list-link" href="/archives/2013/06/">六月 2013</a></li><li class="archive-list-item"><a class="archive-list-link" href="/archives/2013/05/">五月 2013</a></li><li class="archive-list-item"><a class="archive-list-link" href="/archives/2008/06/">六月 2008</a></li></ul>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">最新文章</h3>
<div class="widget">
<ul>
<li>
<a href="/2023/10/19/2023-10-19-scrapy-stuff/">Scrapy</a>
</li>
<li>
<a href="/2023/10/19/2023-10-19-parsing-asn-dot-one-with-java/">使用Java解析ASN.1</a>
</li>
<li>
<a href="/2014/03/29/2014-03-29-plsqldeveloper-to-connect-to-remote-oracle/">PLSQL Developer安装及配置</a>
</li>
<li>
<a href="/2013/09/20/2013-09-20-wedding-blessing/">Wedding Blessing</a>
</li>
<li>
<a href="/2013/08/13/2013-08-13-new-job/">新工作</a>
</li>
</ul>
</div>
</div>
</aside>
</div>
<footer id="footer">
<div class="outer">
<div id="footer-info" class="inner">
© 2023 从前有个包子他睡着了<br>
Powered by <a href="https://hexo.io/" target="_blank">Hexo</a>
</div>
</div>
</footer>
</div>
<nav id="mobile-nav">
<a href="/" class="mobile-nav-link">Home</a>
<a href="/archives" class="mobile-nav-link">Archives</a>
</nav>
<script src="/js/jquery-3.6.4.min.js"></script>
<script src="/fancybox/jquery.fancybox.min.js"></script>
<script src="/js/script.js"></script>
</div>
</body>
</html>