Fix google rankings of docs by using weighting / google sitemaps #79

ukd1 · 2019-08-08T22:15:56Z

For me I get:

aka, linking to https://crystal-lang.org/api/0.20.1/HTTP/Client.html

I actually want https://crystal-lang.org/api/0.30.0/HTTP/Client.html, or "latest" (https://crystal-lang.org/api/latest/HTTP/Client.html).

This can be done, with sitemap (https://en.wikipedia.org/wiki/Sitemaps) using weighting, aka priority:

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url>
        <loc>https://crystal-lang.org/api/0.20.1/HTTP/Client.html</loc>
        <lastmod>2019-07-10</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.1</priority>
    </url>
    <url>
        <loc>https://crystal-lang.org/api/0.30.0/HTTP/Client.html</loc>
        <lastmod>2019-07-10</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.0</priority>
    </url>
</urlset>

etc...

The text was updated successfully, but these errors were encountered:

straight-shoota · 2019-08-08T22:30:12Z

There is essentially discussed crystal-lang/crystal#5952
The idea was to use canonical references to the latest url. But that also has some issues. And most importantly, it currently doesn't cover doc pages < 0.25.0 so these still show up on the search results (but there are no between 0.25.0 and 0.30.0 because they all point to latest).

Using a sitemap seems like a smart alternative to solve this issue. 👍

bcardiff · 2019-08-09T12:53:13Z

Does someone know if there is a tool for generating the sitemap does some recursive checks over directories? Unless it can be automated it won't happen.

Doing a pass over old docs to set the canonical seems more likely if that fixes the issue.

straight-shoota · 2019-08-09T16:58:23Z

This should be fairly simple to automate. Essentially you only need to extract all local links from each version's index.html in the API docs. That's already the URLs for the sitemaps. All but those links from the most recent version get a lower priority.
Each released API docs should be in a distinct sitemap file and all can be combined together in a sitemap index.

I can try to put this together if nobody else is interested.

ukd1 · 2019-08-12T16:53:30Z

@straight-shoota nice, didn't think to check for the issue on the main repo somehow...lol. I'd be down for helping on this, but tbh, I have no idea how the docs or old versions are built - so it might be easier if you do it. If you'd like a hand / pair / lmk.

RX14 · 2019-08-25T13:26:31Z

Doing a pass over old docs to set the canonical seems more likely if that fixes the issue.

this can be done with a simple sed script to adjust the header for old versions. The doc pages don't need to be regenerated.

RX14 · 2019-08-25T13:27:25Z

I'd suggest redoing thse old pages before working on a sitemap

straight-shoota · 2019-09-05T10:28:48Z

@RX14 The problem with canonical links is that docs for older versions vanish from the search results because the search algorithm treats them as duplicate content. But in fact, they're not duplicate and users might have a need to find documentation for older versions as well. For example, when upgrading to a new version, you may need to read the documentation of deprecated features not available in the current version in order to find a suitable replacement.

Using a sitemap is a superior solution because it allows to assign priorities to individual pages and outdated pages don't vanish completely, they just won't be as prominent as more recent once. I think it should eventually replace the canonical link.

straight-shoota · 2019-09-05T12:31:19Z

I've put together a simple program to automatically generate sitemaps for https://crystal-lang.org/api

It's available at https://github.com/straight-shoota/crystal_docs_sitemap

Generated output: output.tar.gz

The output contents should be published at https://crystal-lang.org/api/ and search engines need to be informed about the sitemap (see https://www.sitemaps.org/protocol.html#informing).

straight-shoota · 2019-10-18T13:02:24Z

@bcardiff WDYT?

bcardiff · 2019-10-18T13:25:11Z

I agree the sitemap is worth having and is a good solution. I don't think having the sitemap checked in the repo is the right thing.

From a workflow point of view what it would make sense is to have a tool to change an existing sitemap with some operations:

Add dir content as it will be reached from specific url prefix
Set the priority for all routes matching a specific prefix

At least that workflow will play well with the release process, where we have a local dir with the new api documentation to upload.

I am unsure how to keep the sitemap up to date with respect to the content from jekyll itself regarding lastmod of existing pages and new posts. Maybe there is a 3rd action

Update last most for a subset of dirs as it will be reached from specific url prefix.

That way we can update those params without iterating the whole content.

So, in essence, is having an approach to update rather than create a sitemap.

straight-shoota · 2019-10-18T15:11:08Z

I don't think having the sitemap checked in the repo is the right thing.

Agreed. It just needs to be generated and put into an S3 bucket. Ideally, a rebuild should be triggered after the nightly API docs have been updated from master.

I am unsure how to keep the sitemap up to date with respect to the content from jekyll itself regarding lastmod of existing pages and new posts.

Currently, these sitemaps are only for /api, so Jekyll is not even involved. This is perfectly fine, the sitemap doesn't need to incorporate all pages on the domain. Getting the priorities right for the API versions is the main issue here, and I'd like to get that fixed before considering other parts of the website. They can be tackled individually (for example, Jekyll can simply build its own sitemap), we just need to reference all sitemaps from the sitemapindex.

So, in essence, is having an approach to update rather than create a sitemap.

Sure, we can do that. I just figured it would be easier to simply run the generator and push the result to S3 without having to synchronize first.

In practice, there are two events that would require an update to the sitemaps:

Every day the updated nightly API docs are published for master. This only needs a rebuild of the sitemap for /api/master.
When a new Crystal version is released, we need to build the sitemap for the new release and rebuild for the last x releases in order to update the priority. x is currently 3: The last 2 versions get priorities (0.5, 0.3) and the one after that needs to be set to the default (0.1)

The priority adjustments could actually just be implemented with a simple grep. The contents don't change when a release age, thus there is no need to actually rebuild the sitemap.

Considering all this, it might actually be the best solution to integrate the sitemap generation into the doc generator. This problem is not specific to the stdlib and this way all shards API docs could benefit.
This is really trivial to implement, it just spits out another file. And won't require additional configuration, there is already --canonical-base-url and priority could just be 1.0 by default. Maybe a --sitemap-priority option could be useful, but it's not necessary.

To build the sitemaps for legacy releases, we can just use https://github.com/straight-shoota/crystal_docs_sitemap That's a one-time thing.

With this, updates to master sitemap don't need any additional action because the updated sitemap is already provided by the doc generator.
When a new release is added, we need to add it to the sitemapindex and update the sitemaps for the last releases, but this could just be s/priority="1.0"/priority="0.5"/ etc.

bcardiff · 2019-10-18T15:20:16Z

Currently, these sitemaps are only for /api, so Jekyll is not even involved

Wouldn't that prevent indexing other pages?

Every day the updated nightly API docs are published for master.

I thought we didn't want sitemap for master. Is mostly used for preview (edit: sorry you mention it at the end)

When a new Crystal version is released,

I'm ok downloading the whole docs for a first time generation (edit: or using the proposed script), but upon a crystal version release I don't have locally all the bucket of docs. And I don't want to require to download it. What I do have is the -doc.tar.gz artifact that is pushed. I was thinking of injecting the new paths there, without actually retrieving them from http or the bucket. Hence the proposed transformations 1 and 2.

straight-shoota · 2019-10-18T16:44:32Z

Wouldn't that prevent indexing other pages?

No, sitemaps are not used as an exclusive source. Search engines still employ their regular crawling. They just augment the results or help discover pages that would otherwise not be discovered. See https://webmasters.stackexchange.com/questions/114425/if-i-remove-urls-from-an-xml-sitemap-will-google-still-index-them

straight-shoota · 2019-10-18T17:12:55Z

I thought we didn't want sitemap for master. Is mostly used for preview

I guess it's not strictly necessary, but when the doc generator puts out the sitemap anyway, this requires no extra effort at all.

What I do have is the -doc.tar.gz artifact that is pushed.

My suggestion is that the sitemap is generated directly by the docs generator, thus it would already be included in the doc.tar.gz.
Each API version has its own sitemap (sitemap.xml), which would be located at /api/{{version}}/sitemap.xml.

When publishing a new release, you would just push the contents of doc.tar.gz and the new sitemap is online. It needs to be referenced in the sitemap index, so that's adding one line to that file:

<sitemap loc="https://crystal-lang.org/api/{{version}}/sitemap.xml" lastmod="{{`date --rfc-3339=date`}}" />

And you would need to grab /api/{{version-1}}/sitemap.xml, /api/{{version-2}}/sitemap.xml, /api/{{version-3}}/sitemap.xml, replace the priorities and push them back up.

This could all be placed in a simple shell script which could automatically retrieve the files from S3, apply the changes and push them back up. I haven't tested this but the general idea looks like this:

CURRENT_VERSION=$1

aws s3 cp $S3_BUCKET/sitemapindex.xml sitemapindex.xml

sed '$ i\  <sitemap loc="https://crystal-lang.org/api/$CURRENT_VERSION/sitemap.xml" lastmod="$(date --rfc-3339=date)" />' -i sitemapindex.xml

aws s3 cp sitemapindex.xml $S3_BUCKET/sitemapindex.xml

ARGV=("$@")

for (( i=2; i < $#; i++ )); do
  version=$ARGV[$i]
  case $i in
    2)
      priority=0.5
      ;;
    3)
      priority=0.3
      ;;
    *)
      priority=0.1
  esac

  aws s3 cp $S3_BUCKET/$version/sitemap.xml sitemap-$version.xml

  sed "s/priority=\"\\d\\.\\d/priority=\"$priority\"/" -i sitemap-$version.xml

  aws s3 cp sitemap-$version.xml $S3_BUCKET/$version/sitemap.xml
done

bcardiff · 2019-10-18T17:22:10Z

Ok, let's make the doc tool generate the sitemap if instructed so. But it will need to know about the base url. $ crystal docs --sitemap-base-url=https://crystal-lang.org/api/VERSION/ or something alike.

Then the maintenance of the root site map is more scriptable as proposed.

straight-shoota · 2019-10-18T18:24:11Z

We don't need another CLI option for this. That's exactly the same intent as --canonical-base-url.

bcardiff · 2019-10-18T19:03:16Z

The canonical-base-url is /latest always.
If there is no need to have a canonical-base then that setting might go away.
And they are different concerns.

straight-shoota · 2019-10-18T19:30:15Z

Oh yes, I mixed that up, sorry. It should go, because using canonical completely hides all older versions. So we can simply replace it.

straight-shoota · 2019-11-20T16:15:07Z

The compiler supports generating a sitemap now. We can proceed to get this integrated into the docs generation process.

Add DOCS_OPTIONS to distribution-scripts:
- For nightly: --sitemap-base-url=https://crystal-lang.org/api/master --sitemap-changefreq=daily --sitemap-priority=0.3
- For latest release: --sitemap-base-url=https://crystal-lang.org/api/$(version) --sitemap-changefreq=never --sitemap-priority=1.0
Pass the build type from .circle/config.yml to the distribution-script's workflow.

oprypin · 2020-04-06T22:55:50Z

I think that^ was not done yet

oprypin · 2020-04-06T23:23:27Z

I commented at crystal-lang/crystal#5952 (comment).

Should we dedupe the two issues?

oprypin · 2020-04-06T23:33:25Z

The summary of that is:
The <priority> tag, which is the crux of this suggestion, is explicitly documented as ignored by Google.

Also per another documentation page, sitemaps seem to not affect things much because the API docs site is already comprehensively linked internally (the highlighted part is a quote)

szabgab · 2021-05-16T06:36:21Z

Solving the Google ranking issue might be hard, but at least there could be a prominent warning and link at the top of each old page linking to the latest page. e.g like this one: https://flask.palletsprojects.com/en/1.1.x/patterns/appfactories/

If re-generating the old docs is an issue then I think this could be done even with JavaScript. These links will be placed on the top of the page dynamically without the need to regenerate the old pages. This won't solve the ranking issue, but will help visitors reach the latest version of the documentation.

straight-shoota · 2022-12-12T18:05:35Z

It looks like this can finally be considered resolved. Google search consistently ranks search results for the latest release (https://crystal-lang.org/api/latest) highest.

ukd1 changed the title ~~Fix google rankings of docs by using weighting~~ Fix google rankings of docs by using weighting / google sitemaps Aug 8, 2019

straight-shoota mentioned this issue Aug 8, 2019

Outdated API doc links in Google search (bad SEO) crystal-lang/crystal#5952

Open

straight-shoota mentioned this issue Oct 18, 2019

[Compiler] Add sitemap to doc generator crystal-lang/crystal#8348

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix google rankings of docs by using weighting / google sitemaps #79

Fix google rankings of docs by using weighting / google sitemaps #79

ukd1 commented Aug 8, 2019 •

edited

Loading

straight-shoota commented Aug 8, 2019

bcardiff commented Aug 9, 2019

straight-shoota commented Aug 9, 2019

ukd1 commented Aug 12, 2019

RX14 commented Aug 25, 2019

RX14 commented Aug 25, 2019

straight-shoota commented Sep 5, 2019

straight-shoota commented Sep 5, 2019

straight-shoota commented Oct 18, 2019

bcardiff commented Oct 18, 2019

straight-shoota commented Oct 18, 2019

bcardiff commented Oct 18, 2019 •

edited

Loading

straight-shoota commented Oct 18, 2019

straight-shoota commented Oct 18, 2019

bcardiff commented Oct 18, 2019

straight-shoota commented Oct 18, 2019

bcardiff commented Oct 18, 2019

straight-shoota commented Oct 18, 2019

straight-shoota commented Nov 20, 2019

oprypin commented Apr 6, 2020

oprypin commented Apr 6, 2020

oprypin commented Apr 6, 2020

szabgab commented May 16, 2021

straight-shoota commented Dec 12, 2022

Fix google rankings of docs by using weighting / google sitemaps #79

Fix google rankings of docs by using weighting / google sitemaps #79

Comments

ukd1 commented Aug 8, 2019 • edited Loading

straight-shoota commented Aug 8, 2019

bcardiff commented Aug 9, 2019

straight-shoota commented Aug 9, 2019

ukd1 commented Aug 12, 2019

RX14 commented Aug 25, 2019

RX14 commented Aug 25, 2019

straight-shoota commented Sep 5, 2019

straight-shoota commented Sep 5, 2019

straight-shoota commented Oct 18, 2019

bcardiff commented Oct 18, 2019

straight-shoota commented Oct 18, 2019

bcardiff commented Oct 18, 2019 • edited Loading

straight-shoota commented Oct 18, 2019

straight-shoota commented Oct 18, 2019

bcardiff commented Oct 18, 2019

straight-shoota commented Oct 18, 2019

bcardiff commented Oct 18, 2019

straight-shoota commented Oct 18, 2019

straight-shoota commented Nov 20, 2019

oprypin commented Apr 6, 2020

oprypin commented Apr 6, 2020

oprypin commented Apr 6, 2020

szabgab commented May 16, 2021

straight-shoota commented Dec 12, 2022

ukd1 commented Aug 8, 2019 •

edited

Loading

bcardiff commented Oct 18, 2019 •

edited

Loading