Skip to content

giuliogatto/Using-prerender.io-to-help-Single-Page-Application-s-SEO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Using prerender.io to help Single Page Application’s SEO

Single Page Applications (SPAs) are everywhere and Javascript frameworks to build them are proliferating, but it’s still quite difficult to find precise and reliable information on how to optimize them for search engine crawlers. Google is the only search engine that claims to be able to index Single Page Applications’ content, but in my direct experience as of beginning of 2018, the results are yet not optimal, even when following their AJAX crawling scheme (which is actually deprecated since 2015, but hasn’t been replaced by an alternative solution).

The problem that search engine crawlers are facing when trying to index SPAs content is the asynchronous rendering of the DOM, executed by Javascript, that happens AFTER page loading is finished (DOMContentLoaded or window.onload event in older browsers). This is especially true if ajax calls to load external resources are involved, which is the case in most SPAs: those calls can take quite a long time to return, up to some seconds. At that point in time, google bot (or any other search engine bot that I am aware of) has lost its ‘patience’ and has already returned the available and incomplete document to the search engine index. The complication of course doesn’t exist in ‘classic’ server side rendering where a full document is (almost) always served to the crawlers.

enter image description here

This is a quite tricky issue, probably not 100% solvable. Consider those applications that need to continuously update their content at a very fast rate: for example a betting application where matches, results and odds can change within seconds. How can the search engine bot know when the DOM is fully rendered if the content is constantly updating? I am pretty sure that the good engineers at google (or bing, yahoo etc) are trying to find a solution for indexing even this type of data, but at the moment I guess it’s not a solved problem especially because there’s a limit to the amount of times that a bot can visit the same url.

A possible solution is a partial return to server side rendering which is not to be interpreted as a full return to the old model before SPAs: it only means that the first rendering of the DOM is generated by the server and all subsequent updates are re-rendered directly in the browser. In other words the browser downloads a fully rendered initial DOM as well as the javascript code for the application that will be responsible for the future updates when the state of the application change. This technique can be very effective but comes with a few caveats: first of all it means that the amount of code to be developed is bigger. Independently of what language and framework the server is using, generating a complete first version of the DOM can be quite complicated. Think about all the asynchronous calls that the browser make to gather data from external services, now those calls are moved also into the server logic and therefore introduce more complexity to the system. You can clearly see why implementing an isomorphic architecture is appealing at this point, because it means to develop code that will be able to run both in the server AND in the browser. We all know that less code means less bugs. Another caveat is the increased workload of the server, because the server is now doing some part of the browser work. On top of that serving a fully rendered DOM means an important boost in network traffic, too. If the content is for example a long list, instead of passing a lightweight JSON array to the browser asynchronously, the server is now passing a list of html tags that include the content and at the same time define how the content must appear to the user. This obviously swells up the amount of data in a very important way. So going partially back to a mixed server/client rendering model is a very effective fix, given the drawbacks that I’ve just described.

A second solution is to implement an architecture that move the asynchronous and problematic part of the rendering to an external service, but only for search engine bots. The service would be responsible to serve the complete DOM after all ajax requests are returned: in other words a pre-rendered cached version of the URL.

enter image description here

From a traditional SEO point of view treating search engine web spiders differently from common users (AKA cloaking) is frowned upon, but if you think about it, there would only be a problem if the content served to the bot differs from what’s served to the common users. In this case the content is more complete, not different: it is indeed exactly the same content that the common users can see and the bot can’t, because of the asynchronous rendering process. This solution is simpler than server side rendering because it’s not affecting directly either the SPAs or the application server and leave their codebase unchanged. It still adds some complexity to the system, but this complexity is totally decoupled from the server logic. Pre-rendering happens with the help of an external service and only happens for search engine bots. For this specific reason (decoupling) at the moment I personally tend to prefer prerender to server side rendering. Of course this might change in the future when mature server side rendering frameworks will be available for production (in the node js ecosystem next.js looks very interesting for example).

Now that we have properly framed the problem, let’s see how prerender.io works. Prerender.io comes in 2 ‘flavours’, either as an open source software to install and run on your server or as SAAS with a free tie up to 250 urls/pages. The SAAS service is an all round and very interesting solution that offers an online interface where all your cached pages are listed, including stats that show when the search engine crawlers visited your site. Despite that, I decided to go the DIY way and installed it on a Centos 7 server to have more control over the configuration, and to check prerender.io internal working. In order to be able to render our SPAs, Prerender.io depends on a so called ‘Headless browser’ , a browser without graphical user interface: so I installed Headless Chrome in my OS following this simple guide.

With Headless Chrome up and running, installing the software with NPM is as easy as:

npm install prerender

Next step is to start a very basic node server that will be at the core of our pre-rendering service:

// server.js
const prerender = require(‘prerender’);
var server = prerender({
 chromeFlags: [ ‘ — no-sandbox’, ‘ — headless’, ‘ — disable-gpu’, ‘ — remote-debugging-port=9222’, ‘ — hide-scrollbars’ ],
 logRequests: true
});
server.start();
node server.js

The configuration is pretty self-explanatory and it’s worth noting that the ‘ — headless’ flag is most important to assure that the browser is running without a graphical user interface. Also, the logRequests property set to true is very useful if you want to debug your server. By the way, if your machine is behind a firewall make sure to open port 3000 and you should already be able to test your server with an example url:

curl http://your-prerender-server:3000/render?url=https://www.your-spa.com/products/product1

If everything is working you should see an output that shows a fully rendered DOM of your SPA in its initial state. Try it with different urls and parameters and see how the generated HTML document changes accordingly.

Good! We are now halfway through our SEO optimization process. We have a prerender.io service up and running but we still haven’t redirected the search engine bots to it. What we need now is a middleware able to intercept the bots: kindly prerender.io comes to the rescue with a list of options for many possible server architectures. In my case my Centos 7 server is running apache, so a simple .htaccess directive is all I need. Place an .htaccess file with the following content in the main folder of your SPA:

<IfModule mod_rewrite.c> 
    RewriteEngine On 
    <IfModule mod_proxy_http.c> 
        RewriteCond %{HTTP_USER_AGENT}   baiduspider|facebookexternalhit|twitterbot|rogerbot|linkedinbot|embedly|quora\ link\ preview|showyoubot|outbrain|pinterest|slackbot|vkShare|W3C_Validator [NC,OR] 
        RewriteCond %{QUERY_STRING} _escaped_fragment_ 
        RewriteRule ^(?!.*?(\.js|\.css|\.xml|\.less|\.png|\.jpg|\.jpeg|\.gif|\.pdf|\.doc|\.txt|\.ico|\.rss|\.zip|\.mp3|\.rar|\.exe|\.wmv|\.doc|\.avi|\.ppt|\.mpg|\.mpeg|\.tif|\.wav|\.mov|\.psd|\.ai|\.xls|\.mp4|\.m4a|\.swf|\.dat|\.dmg|\.iso|\.flv|\.m4v|\.torrent|\.ttf|\.woff))(.*) http://your-prerender-server/http://your-spa.com/$2 [P,L] 
    </IfModule>
</IfModule>

(If you wonder why Google bot is not in the list of intercepted bots, that’s because it supports the escaped_fragment parameter, which is specified in the second RewriteCond of the .htaccess file)

Restart apache to activate the redirection and the prerendering system should now be complete and active: you can test it by faking the browser useragent with curl, for example by pretending to be googlebot:

curl — user-agent “Googlebot/2.1 (+https://www.your-spa.com/products/product1)" -v $@

… and you should see the same output as if calling directly the prerender server. You have now implemented your own prerender service!

About

Using prerender.io to help Single Page Application’s SEO

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published