-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix google rankings of docs by using weighting / google sitemaps #79
Comments
There is essentially discussed crystal-lang/crystal#5952 Using a sitemap seems like a smart alternative to solve this issue. 👍 |
Does someone know if there is a tool for generating the sitemap does some recursive checks over directories? Unless it can be automated it won't happen. Doing a pass over old docs to set the canonical seems more likely if that fixes the issue. |
This should be fairly simple to automate. Essentially you only need to extract all local links from each version's I can try to put this together if nobody else is interested. |
@straight-shoota nice, didn't think to check for the issue on the main repo somehow...lol. I'd be down for helping on this, but tbh, I have no idea how the docs or old versions are built - so it might be easier if you do it. If you'd like a hand / pair / lmk. |
this can be done with a simple sed script to adjust the header for old versions. The doc pages don't need to be regenerated. |
I'd suggest redoing thse old pages before working on a sitemap |
@RX14 The problem with canonical links is that docs for older versions vanish from the search results because the search algorithm treats them as duplicate content. But in fact, they're not duplicate and users might have a need to find documentation for older versions as well. For example, when upgrading to a new version, you may need to read the documentation of deprecated features not available in the current version in order to find a suitable replacement. Using a sitemap is a superior solution because it allows to assign priorities to individual pages and outdated pages don't vanish completely, they just won't be as prominent as more recent once. I think it should eventually replace the canonical link. |
I've put together a simple program to automatically generate sitemaps for https://crystal-lang.org/api It's available at https://github.com/straight-shoota/crystal_docs_sitemap Generated output: output.tar.gz The output contents should be published at https://crystal-lang.org/api/ and search engines need to be informed about the sitemap (see https://www.sitemaps.org/protocol.html#informing). |
@bcardiff WDYT? |
I agree the sitemap is worth having and is a good solution. I don't think having the sitemap checked in the repo is the right thing. From a workflow point of view what it would make sense is to have a tool to change an existing sitemap with some operations:
At least that workflow will play well with the release process, where we have a local dir with the new api documentation to upload. I am unsure how to keep the sitemap up to date with respect to the content from jekyll itself regarding lastmod of existing pages and new posts. Maybe there is a 3rd action
That way we can update those params without iterating the whole content. So, in essence, is having an approach to update rather than create a sitemap. |
Agreed. It just needs to be generated and put into an S3 bucket. Ideally, a rebuild should be triggered after the nightly API docs have been updated from master.
Currently, these sitemaps are only for
Sure, we can do that. I just figured it would be easier to simply run the generator and push the result to S3 without having to synchronize first. In practice, there are two events that would require an update to the sitemaps:
The priority adjustments could actually just be implemented with a simple grep. The contents don't change when a release age, thus there is no need to actually rebuild the sitemap. Considering all this, it might actually be the best solution to integrate the sitemap generation into the doc generator. This problem is not specific to the stdlib and this way all shards API docs could benefit. To build the sitemaps for legacy releases, we can just use https://github.com/straight-shoota/crystal_docs_sitemap That's a one-time thing. With this, updates to |
Wouldn't that prevent indexing other pages?
I thought we didn't want sitemap for master. Is mostly used for preview (edit: sorry you mention it at the end)
I'm ok downloading the whole docs for a first time generation (edit: or using the proposed script), but upon a crystal version release I don't have locally all the bucket of docs. And I don't want to require to download it. What I do have is the |
No, sitemaps are not used as an exclusive source. Search engines still employ their regular crawling. They just augment the results or help discover pages that would otherwise not be discovered. See https://webmasters.stackexchange.com/questions/114425/if-i-remove-urls-from-an-xml-sitemap-will-google-still-index-them |
I guess it's not strictly necessary, but when the doc generator puts out the sitemap anyway, this requires no extra effort at all.
My suggestion is that the sitemap is generated directly by the docs generator, thus it would already be included in the When publishing a new release, you would just push the contents of <sitemap loc="https://crystal-lang.org/api/{{version}}/sitemap.xml" lastmod="{{`date --rfc-3339=date`}}" /> And you would need to grab This could all be placed in a simple shell script which could automatically retrieve the files from S3, apply the changes and push them back up. I haven't tested this but the general idea looks like this: CURRENT_VERSION=$1
aws s3 cp $S3_BUCKET/sitemapindex.xml sitemapindex.xml
sed '$ i\ <sitemap loc="https://crystal-lang.org/api/$CURRENT_VERSION/sitemap.xml" lastmod="$(date --rfc-3339=date)" />' -i sitemapindex.xml
aws s3 cp sitemapindex.xml $S3_BUCKET/sitemapindex.xml
ARGV=("$@")
for (( i=2; i < $#; i++ )); do
version=$ARGV[$i]
case $i in
2)
priority=0.5
;;
3)
priority=0.3
;;
*)
priority=0.1
esac
aws s3 cp $S3_BUCKET/$version/sitemap.xml sitemap-$version.xml
sed "s/priority=\"\\d\\.\\d/priority=\"$priority\"/" -i sitemap-$version.xml
aws s3 cp sitemap-$version.xml $S3_BUCKET/$version/sitemap.xml
done |
Ok, let's make the doc tool generate the sitemap if instructed so. But it will need to know about the base url. Then the maintenance of the root site map is more scriptable as proposed. |
We don't need another CLI option for this. That's exactly the same intent as |
The canonical-base-url is |
Oh yes, I mixed that up, sorry. It should go, because using |
The compiler supports generating a sitemap now. We can proceed to get this integrated into the docs generation process.
|
I think that^ was not done yet |
I commented at crystal-lang/crystal#5952 (comment). Should we dedupe the two issues? |
The summary of that is: Also per another documentation page, sitemaps seem to not affect things much because the API docs site is already comprehensively linked internally (the highlighted part is a quote) |
Solving the Google ranking issue might be hard, but at least there could be a prominent warning and link at the top of each old page linking to the latest page. e.g like this one: https://flask.palletsprojects.com/en/1.1.x/patterns/appfactories/ If re-generating the old docs is an issue then I think this could be done even with JavaScript. These links will be placed on the top of the page dynamically without the need to regenerate the old pages. This won't solve the ranking issue, but will help visitors reach the latest version of the documentation. |
It looks like this can finally be considered resolved. Google search consistently ranks search results for the latest release (https://crystal-lang.org/api/latest) highest. |
For me I get:
aka, linking to https://crystal-lang.org/api/0.20.1/HTTP/Client.html
I actually want https://crystal-lang.org/api/0.30.0/HTTP/Client.html, or "latest" (https://crystal-lang.org/api/latest/HTTP/Client.html).
This can be done, with sitemap (https://en.wikipedia.org/wiki/Sitemaps) using weighting, aka
priority
:etc...
The text was updated successfully, but these errors were encountered: