Skip to content

Commit 9812222

Browse files
authored
Merge pull request #177 from nationalarchives/chore/add-adr-for-noindex-and-sitemap-changes
[FCL-284] Add ADR explaining the change to our robots.txt
2 parents 7ef9cac + e53878d commit 9812222

1 file changed

+26
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# 20. Allow crawling in robots.txt and use sitemap to promote discovery
2+
3+
Date: 2024-09-16
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
At the moment our `robots.txt` explicitly disallows crawling of the site. This is nominally to discourage search engines from indexing judgments, however search engines may still index pages where they are linked to from other sites, using related terms on those pages to infer the title or content of the document.
12+
13+
For each document on the service we do provide a `noindex` robots directive both through HTML meta tags and HTTP headers, but because crawling and scraping of the page is forbidden by `robots.txt` search engines will not crawl the page in order to discover that they shouldn't index it. This means that pages may appear in search engines.
14+
15+
The spirit of the service is that individual documents should not appear in search indexes, and to achieve this we (somewhat counter-intuitively) need to allow crawling so that search engines can discover that they shouldn't include pages in their index.
16+
17+
## Decision
18+
19+
- `robots.txt` on Public UI will be changed to allow crawling of the entire site
20+
- A new sitemap, rooted in `sitemap.xml`, will be provided which allows easy discovery of all documents for the purposes of crawling. This means search engines can rapidly discover the entire corpus of documents, crawl them, and mark them for exclusion from their search indexes.
21+
- This sitemap will be included in `robots.txt`, as well as manually submitted to major search engines.
22+
23+
## Consequences
24+
25+
- Search engines which obey robot `noindex` directives will be better able to proactively flag documents as not for inclusion in the index.
26+
- Users looking to crawl the site will be able to discover all site content.

0 commit comments

Comments
 (0)