index.html

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta http-equiv="X-UA-Compatible" content="ie=edge" />
    <link rel="icon" sizes="16x16 32x32" href="https://www.deepcrawl.com/wp-content/uploads/2015/03/DC-1.png">
    <title>The Top 10 Things Developers Need to Know About SEO</title>
    <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" rel="stylesheet"
        integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous" />

    <meta name="title" content="The Top 10 Things Developers Need to Know About SEO">
    <meta name="description"
        content="A Simple API containing a curated set of metrics about the health of a webpage from the technical SEO point of view. It should act as a starting point for an engineer who likes to play and learn more about extracting insights from web pages for the purposes of SEO or testing.">

    <!-- Open Graph / Facebook -->
    <meta property="og:type" content="website">
    <meta property="og:url" content="https://deepcrawl.github.io/top10-seo-list-for-developers/">
    <meta property="og:title" content="Top 10 things Developer need to know about SEO">
    <meta property="og:description"
        content="A Simple API containing a curated set of metrics about the health of a webpage from the technical SEO point of view. It should act as a starting point for an engineer who likes to play and learn more about extracting insights from web pages for the purposes of SEO or testing.">
    <meta property="og:image"
        content="https://repository-images.githubusercontent.com/211297643/b832a600-f349-11e9-90a7-028e920736da">

    <!-- Twitter -->
    <meta property="twitter:card" content="summary_large_image">
    <meta property="twitter:url" content="https://deepcrawl.github.io/top10-seo-list-for-developers/">
    <meta property="twitter:title" content="Top 10 things Developer need to know about SEO">
    <meta property="twitter:description"
        content="A Simple API containing a curated set of metrics about the health of a webpage from the technical SEO point of view. It should act as a starting point for an engineer who likes to play and learn more about extracting insights from web pages for the purposes of SEO or testing.">
    <meta property="twitter:image"
        content="https://repository-images.githubusercontent.com/211297643/b832a600-f349-11e9-90a7-028e920736da">
    <style>
        .jumbotron {
            background-color: #24333e;
            color: white;
            border-radius: 0%;
        }

        .jumbotron h1 {
            font-weight: 100;
        }

        .jumbotron .brace {
            position: fixed;
            font-size: 52rem;
            top: -32rem;
            transform: rotate(20deg);
            opacity: .05;
            left: 5rem;
            pointer-events: none;
        }

        body {
            background: #f4f4f4;
            font-size: 1.1rem;
        }

        .gist-file {
            border-color: #f8f8ff !important;
            padding: 1em !important;
            background-color: #f8f8ff;
        }

        .prettyprint {
            padding: 1em !important;
        }

        .btn-primary {
            background: #7eac4a;
            border-color: #7eac4a;
        }

        .btn-primary:hover,
        .btn-primary:focus,
        .btn-primary.focus,
        .btn-primary.active,
        .btn-primary:active,
        .open>.dropdown-toggle.btn-primary {
            background: #679f2d !important;
            border-color: #679f2d !important;
            box-shadow: none;
        }
    </style>
</head>

<body>

    <div class="jumbotron">
        <span class="brace">{}</span>
        <div class="container">
            <h1>The Top 10 Things Developers Need to Know About SEO</h1>
            <p class="lead">
                A simple API containing a curated set metrics about the health of a webpage from a technical SEO
                point of
                view. Hopefully this API will act as a starting point for any engineer who likes to play and learn more about
                extracting
                insights from web pages for the purposes of SEO or testing.
            </p>
            <a href="https://github.com/ali-habibzadeh/top10-seo-list-for-developers"
                class="btn btn-primary btn-lg">Github
                Repository</a>
        </div>
    </div>

    <div class="container">

        <div class="col-12">
            <h2>Express Server and Router</h2>

            <p>Create an instance of Express server and set it to use our apiRouter on <code>/api</code> path and
                listen on
                a port.</p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/index.ts?footer=no"></script>

            <p>
                Create an instance of Express Router and add a <code>get</code> route matcher on path
                <code>/page-health</code> to use our main handler function. This now makes your API route
                <code>/api/page-health</code>.
            </p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/api/routes.ts?footer=no"></script>

            <p>
                Now let's create our handler function. It needs to be able to read the query string parameter for
                <code>?url=</code> and validate it. If it checks out we can pass it to our render service to process it.
                Otherwise we send an error response to the user to correct the faulty parameter.
            </p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/api/page-health.route.ts?footer=no"></script>

            <p>
                Here is how we validate the URL. URL class throws an error when the passed string can not be constructed
                into
                a
                URL instance, so it's an easy way to validate a string.
            </p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/utils/url.utils.ts?footer=no"></script>

            <p>Now we are done with our general scaffolding for the API side, let's build some software!</p>

            <h2>Rendering Pages</h2>

            <p>Our core service for the app is our rendering service, with a few simple duties:</p>

            <ul>
                <li>Take the URL passed to it from the API router</li>
                <li>Create a puppeteer browser</li>
                <li>Create a page and response object and pass it to all our metrics</li>
                <li>Return the results from all metrics</li>
                <li>Close browser</li>
            </ul>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/page-render.service.ts?footer=no"></script>

            <p>But, as always, the devil is in the details and there are few gotchas involved:</p>


            <h4>Chrome switches</h4>
            <p>
                We don't want to run too many of the APIs Chrome comes with out of the box, as they will slow down our
                service.
                To overcome this
                we will use a custom <code>launchOptions</code> for puppeteer's <code>launch</code> function, so
                we
                can pass in a set a chrome switches</a>
                that would disable most of what we won't need.
            </p>
            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/constants/launch-options.ts?footer=no"></script>

            <p>We won't go into much detail here, but you can check out the list of
                <a href="https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/constants/chrome-switches.ts?footer=no"
                    target="_blank">switches passed into chrome</a>
                for disabling various features and Web APIs. You can also see the <a
                    href="https://peter.sh/experiments/chromium-command-line-switches/" target="_blank">full
                    list of
                    args that chrome takes</a>.</p>

            <h4>Rejection tokens</h4>
            <p>
                We don't want to request any possibly large assets of the page, especially if they don't help us
                in
                calculation
                of our metrics. So we need to abort the requests that are made to <a
                    href="https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/constants/blocked-resource-types.ts?footer=no"
                    target="_blank">such resources</a>.
            </p>

            <p>We also don't want our test to run any analytics events, so we would need to create a list of <a
                    href="https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/constants/analytics-rejections.ts?footer=no"
                    target="_blank">typical analytics servers</a> too.</p>

            <p>And finally, we need to also block any request made to well known <a
                    href="https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/constants/ad-rejections.ts?footer=no"
                    target="_blank">advertising servers</a>, as they just slow down the whole process.</p>

            <p>Now we are able to easily check whether the request URL in ResourceType is matching any of these
                types. If so, we can <code>.abort</code> the request, otherwise we simply
                <code>.continue</code></p>.
            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/config/page-request.handler.ts?footer=no"></script>


            <p>This constitutes all of the lifting that needs to be done before we can finally add some metrics to our API.
            </p>

            <h2>The metrics</h2>

            <p>We want all of our metrics to have the same blueprint, as we simply want to pass them all as an array into
                our
                rendering service and get a consistent set of results from them. To do so we need to create a blueprint
                abstract class which they will all extend.</p>

            <p>The base metric will set a few standards for our metrics:</p>

            <ul>
                <li>All metrics will get their results from puppeteer's <code>Page</code> and <code>Response</code></li>
                <li>We need the response payloads to provide information about the type of data returned from each
                    metric.
                    We do this for better usability of our data, so another program can easily read what type the data
                    has
                    so it can pick the right type of handling logic against that metric.</li>
                <li>We also want all metrics to have a unique name so that they can be easily picked from the list if
                    needed.</li>
                <li>All metrics should implement a public method called <code>getMetricValue</code> which provides the
                    name
                    and the final value of that metric. This is then used by the <code>getMetric</code> method from
                    BaseMetric to add the other information we need for all the metrics.</li>
            </ul>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/base-types/base-metric.ts?footer=no"></script>


            <h3>Internal links</h3>

            <p>The easiest way of determining if a link is internal is to check that it is not external. We
                simply
                do
                this
                by checking if the href's hostname is different to the current page's href hostname. See
                <code>isExternalLink</code>
                method below.</p>

            <p>After this we need 4 pieces of information about our links:</p>

            <ul>
                <li>Easy: <code>href</code></li>
                <li>Easy: Text content</li>
                <li>Tricky: List of all event listeners attached</li>
                <li>Tricky: Is the link healthy or not</li>
            </ul>

            <p>There is no "easy" way to extract all the event listeners attached to an element (inline or from
                script)
                without doing lots of work. Not to worry as <a
                    href="https://chromedevtools.github.io/devtools-protocol/">Chrome
                    Developer Protocol</a> (CDP) can come to the rescue.</p>

            <p>How, you ask? CDP, amongst other tools, has access to the console API of chrome. Console API contains a
                function
                in
                the global namespace called <code>getEventListeners</code>. This function is available when you open the
                developer toolbar in chrome but normally isn't part of the global namespace.</p>

            <p>This function takes a node as an argument and returns an object containing all the event listeners attached
                to
                that
                element, making CDP an ideal choice for extracting our link data.</p>

            <p>You can create a CDP session through puppeteer's page object, which is perfect for us as we decided all
                our
                metrics will have access to the Page and Response object. To make this a little more encapsulated, we can
                wrap
                all the functionality we need from CDP into a class called <code>CDPSessionClient</code> so
                we won't pollute our metric class with its implementation detail.
            </p>

            <p>So now our <code>InternalLinks</code> metric class can extract the first three pieces of information for us.
                For
                health checking we will create our <code>LinkHealthChecker</code> class</p>:
            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/internal-links/internal-links.ts?footer=no"></script>

            <h3>CDPSessionClient</h3>
            <br>

            <ul>
                <li>
                    <p><b>Getting the <code>href</code> attribute</b>: For attribute we will use the
                        <code>DOM.getAttributes</code> command in CDP. The response of this
                        command is an array of attribute names and their values in the same array (yeah, a little weird,
                        but
                        it will do.)
                        So all we have to do is find what the index of `href` is in that response array and the value
                        after
                        that index will be
                        the href value.</p>
                </li>
                <li>
                    <p><b>Getting event listeners</b>: As discussed earlier, we will use the debugger function for this
                        through the
                        <code>DOMDebugger.getEventListeners</code> command. Notice that unlike DOM normal API, CDP
                        mainly
                        operates on nodeId (Unique DOM node identifier).</p>
                </li>
                <li>
                    <p><b>Getting text content</b>: CDP does not have a command to get the text content of a node, so we
                        use cheerio for that. Using
                        the
                        <code>DOM.getOuterHTML</code> we can get the HTML of the link, however, this would contain HTML, so
                        by
                        parsing it using cheerio we can use the <code>.text()</code> method to get the combined text
                        content.</p>
                </li>
            </ul>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/page-rendering/cdp/cdp-session-client.ts?footer=no"></script>

            <h3>LinkHealthChecker</h3>

            <p>This class uses the raw information we gathered using CDP. Health checker considers a link healthy when:
            </p>

            <ul>
                <li>It has a value for the href attribute</li>
                <li>That value is not <code>#</code> based</li>
                <li>That value is not javascript code (<code>javascript:</code>)</li>
                <li>Not having event handlers is considered better but we allow them for now</li>
            </ul>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/internal-links/health-check/link-health-checker.ts?footer=no"></script>

            <h3>No Index</h3>

            <p>With simple metrics like this we don't have to go too deep into CDP, as we can simply rely on puppeteer's
                high
                level API.</p>

            <p>Two things make a page NoIndex for bots:</p>

            <ul>
                <li>
                    Having a
                    <code>&lt;meta name=&#39;robots|googlebot|bingbot....&#39; content=&#39;noindex&#39;/&gt;</code> on
                    the
                    page.
                </li>
                <li>
                    Having a <code>X-Robots-Tag=noindex</code> in the response headers
                </li>
            </ul>

            <p>As you can see, robots meta tag is not mononymous so the easiest way to determine meta noindex is by
                looking
                for meta tags where their <code>content</code> attribute value is <code>noindex</code>. We can easily do
                this
                by
                passing a page function into <code>page.evaluate</code></p>

            <p>For the headers, we need to use the puppeteer's response object and look for the key
                <code>X-Robots-Tag</code>
                and
                check if its value is <code>noindex</code></p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/noindex/noindex.ts?footer=no"></script>

            <h3>Performance</h3>

            <p>One of the newer Web APIs introduced recently is the Performance API which provides access to many of the
                performance-related information you would want about a page.
                Each piece of information in this API is called an entry. Each entry has a type which is the group the
                entry
                belongs to.
            </p>

            <p>For our purposes we are focusing on the page rendering performance, for which its metric belongs to the entryType
                called <code>paint</code>. This is the category that contains metrics such as <code>first-paint</code>
                and
                <code>first-contentful-paint</code> (FCP).</p>

            <p>As far as FCP goes the lower the start time the better. <a
                    href="https://web.dev/first-contentful-paint#how-lighthouse-determines-your-fcp-score"
                    target="_blank">Lighthouse uses a particular scoring system for this</a>, which you may embrace.</p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/performance/performance.ts?footer=no"></script>

            <h3>Redirect chain</h3>

            <p>Puppeteer's request object contains the chain of requests, however, if there are no redirects and the
                request
                was successful, the chain will be empty. For SEO purposes it's always nice to have the final link in
                the
                chain included too, so for this reason we always manually add the final link ourselves.</p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/redirect-chain/redirect-chain.ts?footer=no"></script>

            <h3>Responsive</h3>

            <p>What makes a page responsive is a complex debate, however, we propose for simplicity that the minimum
                requirement
                for being responsive is for the width of the page to follow the width of the device.</p>

            <p>So based on this, if a page at least has a viewport meta tag where its content attribute contains
                <code>width=device-width</code> then we consider this page to be responsive.</p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/responsive/responsive.ts?footer=no"></script>


            <h3>robots.txt</h3>

            <p>
                For robots we will be using a library called <a href="https://github.com/samclarke/robots-parser"
                    target="_blank">robots-parser</a>. This tool is especially useful when your site has a large
                robots.txt
                file
                and finding the specific rule that matches a url can be hard.
            </p>

            <p>You may wish to extend this so the API also allows you to pass custom user-agents to the parser, or even
                switch them out with the newly open sourced <a href="https://github.com/google/robotstxt"
                    target="_blank">Google
                    robots.txt
                    parser</a>. It is written in C++ so you would have to use a child process or tools like shell.js to
                interact with the C++ binary. You can also do it using node-gyp by creating a <a
                    href="https://nodejs.org/api/addons.html" target="_blank">node C++ native add-on</a> that
                would call the robots_main directly from source. I may be sharing some of these techniques in future
                posts.
            </p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/robots/allowed-robots-txt.ts?footer=no"></script>

            <h3>schema.org</h3>

            <p>For extracting schema, we use a library called <a href="https://github.com/indix/web-auto-extractor"
                    target="_blank">Web
                    Auto Extractor</a> which can extract Microdata, RDFa-lite, JSON-LD and some other random meta tags.</p>

            <p>In future posts we may show how to validate this extracted data against actual schema definitions</p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/schema-org/schema-org.ts?footer=no"></script>

            <h3>Status code</h3>

            <p>Puppeteer's response object contains a utility method for status, which is perfect for adding to our
                metrics.</p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/status/status.ts?footer=no"></script>

            <h3>tf-idf</h3>

            <p>Node's <a href="https://github.com/NaturalNode/natural">natural</a> library has a good implementation of
                tf-idf where we can extract the list of all the terms from the corpus in order of their importance. For
                our purposes we can limit this to the first 10 only.</p>

            <script
                src="https://gist-it.appspot.com/https://github.com/ali-habibzadeh/top10-seo-list-for-developers/blob/master/src/metrics/metric-items/tfidf/tfidf.ts?footer=no"></script>

            <p>🏁 I hope you have enjoyed this post and find it useful. Feel free to change, extend and add your own
                metrics.</p>
        </div>

    </div>
</body>

</html>