-
Notifications
You must be signed in to change notification settings - Fork 3
/
index.html
484 lines (387 loc) · 24.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta http-equiv="X-UA-Compatible" content="ie=edge" />
<link rel="icon" sizes="16x16 32x32" href="https://www.deepcrawl.com/wp-content/uploads/2015/03/DC-1.png">
<title>The Top 10 Things Developers Need to Know About SEO</title>
<link href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" rel="stylesheet"
integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous" />
<meta name="title" content="The Top 10 Things Developers Need to Know About SEO">
<meta name="description"
content="A Simple API containing a curated set of metrics about the health of a webpage from the technical SEO point of view. It should act as a starting point for an engineer who likes to play and learn more about extracting insights from web pages for the purposes of SEO or testing.">
<!-- Open Graph / Facebook -->
<meta property="og:type" content="website">
<meta property="og:url" content="https://deepcrawl.github.io/top10-seo-list-for-developers/">
<meta property="og:title" content="Top 10 things Developer need to know about SEO">
<meta property="og:description"
content="A Simple API containing a curated set of metrics about the health of a webpage from the technical SEO point of view. It should act as a starting point for an engineer who likes to play and learn more about extracting insights from web pages for the purposes of SEO or testing.">
<meta property="og:image"
content="https://repository-images.githubusercontent.com/211297643/b832a600-f349-11e9-90a7-028e920736da">
<!-- Twitter -->
<meta property="twitter:card" content="summary_large_image">
<meta property="twitter:url" content="https://deepcrawl.github.io/top10-seo-list-for-developers/">
<meta property="twitter:title" content="Top 10 things Developer need to know about SEO">
<meta property="twitter:description"
content="A Simple API containing a curated set of metrics about the health of a webpage from the technical SEO point of view. It should act as a starting point for an engineer who likes to play and learn more about extracting insights from web pages for the purposes of SEO or testing.">
<meta property="twitter:image"
content="https://repository-images.githubusercontent.com/211297643/b832a600-f349-11e9-90a7-028e920736da">
<style>
.jumbotron {
background-color: #24333e;
color: white;
border-radius: 0%;
}
.jumbotron h1 {
font-weight: 100;
}
.jumbotron .brace {
position: fixed;
font-size: 52rem;
top: -32rem;
transform: rotate(20deg);
opacity: .05;
left: 5rem;
pointer-events: none;
}
body {
background: #f4f4f4;
font-size: 1.1rem;
}
.gist-file {
border-color: #f8f8ff !important;
padding: 1em !important;
background-color: #f8f8ff;
}
.prettyprint {
padding: 1em !important;
}
.btn-primary {
background: #7eac4a;
border-color: #7eac4a;
}
.btn-primary:hover,
.btn-primary:focus,
.btn-primary.focus,
.btn-primary.active,
.btn-primary:active,
.open>.dropdown-toggle.btn-primary {
background: #679f2d !important;
border-color: #679f2d !important;
box-shadow: none;
}
</style>
</head>
<body>
<div class="jumbotron">
<span class="brace">{}</span>
<div class="container">
<h1>The Top 10 Things Developers Need to Know About SEO</h1>
<p class="lead">
A simple API containing a curated set metrics about the health of a webpage from a technical SEO
point of
view. Hopefully this API will act as a starting point for any engineer who likes to play and learn more about
extracting
insights from web pages for the purposes of SEO or testing.
</p>
<a href="https://github.com/ali-habibzadeh/top10-seo-list-for-developers"
class="btn btn-primary btn-lg">Github
Repository</a>
</div>
</div>
<div class="container">
<div class="col-12">
<h2>Express Server and Router</h2>
<p>Create an instance of Express server and set it to use our apiRouter on <code>/api</code> path and
listen on
a port.</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/index.ts?footer=no"></script>
<p>
Create an instance of Express Router and add a <code>get</code> route matcher on path
<code>/page-health</code> to use our main handler function. This now makes your API route
<code>/api/page-health</code>.
</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/api/routes.ts?footer=no"></script>
<p>
Now let's create our handler function. It needs to be able to read the query string parameter for
<code>?url=</code> and validate it. If it checks out we can pass it to our render service to process it.
Otherwise we send an error response to the user to correct the faulty parameter.
</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/api/page-health.route.ts?footer=no"></script>
<p>
Here is how we validate the URL. URL class throws an error when the passed string can not be constructed
into
a
URL instance, so it's an easy way to validate a string.
</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/utils/url.utils.ts?footer=no"></script>
<p>Now we are done with our general scaffolding for the API side, let's build some software!</p>
<h2>Rendering Pages</h2>
<p>Our core service for the app is our rendering service, with a few simple duties:</p>
<ul>
<li>Take the URL passed to it from the API router</li>
<li>Create a puppeteer browser</li>
<li>Create a page and response object and pass it to all our metrics</li>
<li>Return the results from all metrics</li>
<li>Close browser</li>
</ul>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/page-render.service.ts?footer=no"></script>
<p>But, as always, the devil is in the details and there are few gotchas involved:</p>
<h4>Chrome switches</h4>
<p>
We don't want to run too many of the APIs Chrome comes with out of the box, as they will slow down our
service.
To overcome this
we will use a custom <code>launchOptions</code> for puppeteer's <code>launch</code> function, so
we
can pass in a set a chrome switches</a>
that would disable most of what we won't need.
</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/constants/launch-options.ts?footer=no"></script>
<p>We won't go into much detail here, but you can check out the list of
<a href="https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/constants/chrome-switches.ts?footer=no"
target="_blank">switches passed into chrome</a>
for disabling various features and Web APIs. You can also see the <a
href="https://peter.sh/experiments/chromium-command-line-switches/" target="_blank">full
list of
args that chrome takes</a>.</p>
<h4>Rejection tokens</h4>
<p>
We don't want to request any possibly large assets of the page, especially if they don't help us
in
calculation
of our metrics. So we need to abort the requests that are made to <a
href="https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/constants/blocked-resource-types.ts?footer=no"
target="_blank">such resources</a>.
</p>
<p>We also don't want our test to run any analytics events, so we would need to create a list of <a
href="https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/constants/analytics-rejections.ts?footer=no"
target="_blank">typical analytics servers</a> too.</p>
<p>And finally, we need to also block any request made to well known <a
href="https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/constants/ad-rejections.ts?footer=no"
target="_blank">advertising servers</a>, as they just slow down the whole process.</p>
<p>Now we are able to easily check whether the request URL in ResourceType is matching any of these
types. If so, we can <code>.abort</code> the request, otherwise we simply
<code>.continue</code></p>.
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/page-request.handler.ts?footer=no"></script>
<p>This constitutes all of the lifting that needs to be done before we can finally add some metrics to our API.
</p>
<h2>The metrics</h2>
<p>We want all of our metrics to have the same blueprint, as we simply want to pass them all as an array into
our
rendering service and get a consistent set of results from them. To do so we need to create a blueprint
abstract class which they will all extend.</p>
<p>The base metric will set a few standards for our metrics:</p>
<ul>
<li>All metrics will get their results from puppeteer's <code>Page</code> and <code>Response</code></li>
<li>We need the response payloads to provide information about the type of data returned from each
metric.
We do this for better usability of our data, so another program can easily read what type the data
has
so it can pick the right type of handling logic against that metric.</li>
<li>We also want all metrics to have a unique name so that they can be easily picked from the list if
needed.</li>
<li>All metrics should implement a public method called <code>getMetricValue</code> which provides the
name
and the final value of that metric. This is then used by the <code>getMetric</code> method from
BaseMetric to add the other information we need for all the metrics.</li>
</ul>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/base-types/base-metric.ts?footer=no"></script>
<h3>Internal links</h3>
<p>The easiest way of determining if a link is internal is to check that it is not external. We
simply
do
this
by checking if the href's hostname is different to the current page's href hostname. See
<code>isExternalLink</code>
method below.</p>
<p>After this we need 4 pieces of information about our links:</p>
<ul>
<li>Easy: <code>href</code></li>
<li>Easy: Text content</li>
<li>Tricky: List of all event listeners attached</li>
<li>Tricky: Is the link healthy or not</li>
</ul>
<p>There is no "easy" way to extract all the event listeners attached to an element (inline or from
script)
without doing lots of work. Not to worry as <a
href="https://chromedevtools.github.io/devtools-protocol/">Chrome
Developer Protocol</a> (CDP) can come to the rescue.</p>
<p>How, you ask? CDP, amongst other tools, has access to the console API of chrome. Console API contains a
function
in
the global namespace called <code>getEventListeners</code>. This function is available when you open the
developer toolbar in chrome but normally isn't part of the global namespace.</p>
<p>This function takes a node as an argument and returns an object containing all the event listeners attached
to
that
element, making CDP an ideal choice for extracting our link data.</p>
<p>You can create a CDP session through puppeteer's page object, which is perfect for us as we decided all
our
metrics will have access to the Page and Response object. To make this a little more encapsulated, we can
wrap
all the functionality we need from CDP into a class called <code>CDPSessionClient</code> so
we won't pollute our metric class with its implementation detail.
</p>
<p>So now our <code>InternalLinks</code> metric class can extract the first three pieces of information for us.
For
health checking we will create our <code>LinkHealthChecker</code> class</p>:
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/internal-links/internal-links.ts?footer=no"></script>
<h3>CDPSessionClient</h3>
<br>
<ul>
<li>
<p><b>Getting the <code>href</code> attribute</b>: For attribute we will use the
<code>DOM.getAttributes</code> command in CDP. The response of this
command is an array of attribute names and their values in the same array (yeah, a little weird,
but
it will do.)
So all we have to do is find what the index of `href` is in that response array and the value
after
that index will be
the href value.</p>
</li>
<li>
<p><b>Getting event listeners</b>: As discussed earlier, we will use the debugger function for this
through the
<code>DOMDebugger.getEventListeners</code> command. Notice that unlike DOM normal API, CDP
mainly
operates on nodeId (Unique DOM node identifier).</p>
</li>
<li>
<p><b>Getting text content</b>: CDP does not have a command to get the text content of a node, so we
use cheerio for that. Using
the
<code>DOM.getOuterHTML</code> we can get the HTML of the link, however, this would contain HTML, so
by
parsing it using cheerio we can use the <code>.text()</code> method to get the combined text
content.</p>
</li>
</ul>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/cdp/cdp-session-client.ts?footer=no"></script>
<h3>LinkHealthChecker</h3>
<p>This class uses the raw information we gathered using CDP. Health checker considers a link healthy when:
</p>
<ul>
<li>It has a value for the href attribute</li>
<li>That value is not <code>#</code> based</li>
<li>That value is not javascript code (<code>javascript:</code>)</li>
<li>Not having event handlers is considered better but we allow them for now</li>
</ul>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/internal-links/health-check/link-health-checker.ts?footer=no"></script>
<h3>No Index</h3>
<p>With simple metrics like this we don't have to go too deep into CDP, as we can simply rely on puppeteer's
high
level API.</p>
<p>Two things make a page NoIndex for bots:</p>
<ul>
<li>
Having a
<code><meta name='robots|googlebot|bingbot....' content='noindex'/></code> on
the
page.
</li>
<li>
Having a <code>X-Robots-Tag=noindex</code> in the response headers
</li>
</ul>
<p>As you can see, robots meta tag is not mononymous so the easiest way to determine meta noindex is by
looking
for meta tags where their <code>content</code> attribute value is <code>noindex</code>. We can easily do
this
by
passing a page function into <code>page.evaluate</code></p>
<p>For the headers, we need to use the puppeteer's response object and look for the key
<code>X-Robots-Tag</code>
and
check if its value is <code>noindex</code></p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/noindex/noindex.ts?footer=no"></script>
<h3>Performance</h3>
<p>One of the newer Web APIs introduced recently is the Performance API which provides access to many of the
performance-related information you would want about a page.
Each piece of information in this API is called an entry. Each entry has a type which is the group the
entry
belongs to.
</p>
<p>For our purposes we are focusing on the page rendering performance, for which its metric belongs to the entryType
called <code>paint</code>. This is the category that contains metrics such as <code>first-paint</code>
and
<code>first-contentful-paint</code> (FCP).</p>
<p>As far as FCP goes the lower the start time the better. <a
href="https://web.dev/first-contentful-paint#how-lighthouse-determines-your-fcp-score"
target="_blank">Lighthouse uses a particular scoring system for this</a>, which you may embrace.</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/performance/performance.ts?footer=no"></script>
<h3>Redirect chain</h3>
<p>Puppeteer's request object contains the chain of requests, however, if there are no redirects and the
request
was successful, the chain will be empty. For SEO purposes it's always nice to have the final link in
the
chain included too, so for this reason we always manually add the final link ourselves.</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/redirect-chain/redirect-chain.ts?footer=no"></script>
<h3>Responsive</h3>
<p>What makes a page responsive is a complex debate, however, we propose for simplicity that the minimum
requirement
for being responsive is for the width of the page to follow the width of the device.</p>
<p>So based on this, if a page at least has a viewport meta tag where its content attribute contains
<code>width=device-width</code> then we consider this page to be responsive.</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/responsive/responsive.ts?footer=no"></script>
<h3>robots.txt</h3>
<p>
For robots we will be using a library called <a href="https://github.com/samclarke/robots-parser"
target="_blank">robots-parser</a>. This tool is especially useful when your site has a large
robots.txt
file
and finding the specific rule that matches a url can be hard.
</p>
<p>You may wish to extend this so the API also allows you to pass custom user-agents to the parser, or even
switch them out with the newly open sourced <a href="https://github.com/google/robotstxt"
target="_blank">Google
robots.txt
parser</a>. It is written in C++ so you would have to use a child process or tools like shell.js to
interact with the C++ binary. You can also do it using node-gyp by creating a <a
href="https://nodejs.org/api/addons.html" target="_blank">node C++ native add-on</a> that
would call the robots_main directly from source. I may be sharing some of these techniques in future
posts.
</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/robots/allowed-robots-txt.ts?footer=no"></script>
<h3>schema.org</h3>
<p>For extracting schema, we use a library called <a href="https://github.com/indix/web-auto-extractor"
target="_blank">Web
Auto Extractor</a> which can extract Microdata, RDFa-lite, JSON-LD and some other random meta tags.</p>
<p>In future posts we may show how to validate this extracted data against actual schema definitions</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/schema-org/schema-org.ts?footer=no"></script>
<h3>Status code</h3>
<p>Puppeteer's response object contains a utility method for status, which is perfect for adding to our
metrics.</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/status/status.ts?footer=no"></script>
<h3>tf-idf</h3>
<p>Node's <a href="https://github.com/NaturalNode/natural">natural</a> library has a good implementation of
tf-idf where we can extract the list of all the terms from the corpus in order of their importance. For
our purposes we can limit this to the first 10 only.</p>
<script
src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/tfidf/tfidf.ts?footer=no"></script>
<p>🏁 I hope you have enjoyed this post and find it useful. Feel free to change, extend and add your own
metrics.</p>
</div>
</div>
</body>
</html>