Archiving the Intranet, thankfully, is a task made simple using the following technologies:
- Cloud Platform
- AWS S3
- AWS CloudFront
- HTTrack Cli
- NodeJS Server
Access is granted to the snapshot if, you:
- Have access to the MoJ Intranet, and
- Your MoJ Intranet account meets the necessary permissions, as defined in that codebase.
Access points:
Please get in touch with the Intranet team on Slack for further information.
- A user logs into the Intranet.
- The user clicks a link to the archive.
- This submits a POST request to the archive NodeJS server. The payload contains an expiry and the users agency. The request is signed with a shared secret, and the server validates the signature.
- The NodeJS responds by redirecting to the CloudFront distribution. The redirect URL contains cookies, so that the user can access the snapshot.
Find the config file at deploy/<namespace>/config.yml
.
Update the SNAPSHOT_SCHEDULE
environment variable with values for the desired agency.
It should be in the following pattern <namespace>::<agency>::<day-of-week>::<hh:mm>
.
And, multiple values should be comma separated.
e.g. dev::hq::Mon::17:30::3,dev::hmcts::Thu::17:30::3
Snapshot scheduling should cover the project's use-case; manually creating a snapshot is limited to developers for debugging.
As such, it's required to port-forward
to the running service, and make a POST request to the /spider
endpoint.
# example POST request
curl -X POST http://localhost:2000/spider -d "agency=hq&env=dev&depth=2"
See the Cloud Platform and Commands sections below.
It's important to note that creating a snapshot of the intranet from a local machine proved to present resource related issues, such as VPN timeouts and rate limiting.
Requires
- Docker
Clone to your machine:
git clone https://github.com/ministryofjustice/intranet-archive.git && cd intranet-archive
Start docker compose:
make run
There is a script designed to help you install the Dory Proxy, if you'd like to.
If you chose to install Dory, you can access the application here:
Otherwise, access the application here:
Let's begin with servers and their interactions within...
The Archiver has an Nginx server. This is used to display responses from the underlying NodeJS server where Node processes form requests and decides how to treat them. Essentially, if happy with the request, Node will instruct HTTrack to perform a website copy operation, and it does this with predefined options, and a custom plugin.
At the very heart of the Archiver sits HTTrack. This application is configured by Node to take a snapshot of the MoJ Intranet. Potentially, you can point the Archiver at any website address and, using the settings for the Intranet, it will attempt to create an isolated copy of it.
The output of HTTrack can be noted in Docker Composes' stdout
in the running terminal window however, a more
detailed and linear output stream is available in the hts-log.txt
file. You can find this in the root of the snapshot.
During the build of the Archiver, we came across many challenges, two of which almost prevented our proof of concept from succeeding. The first was an inability to display images. The second was an inability to download them.
1) The HTTrack srcset
problem
In modern browsers, the srcset
attribute is used to render a correctly sized image, for the device the image was loaded
in. This helps to manage bandwidth and save the user money. The trouble is HTTrack doesn't modify the URLs in srcset
attributes so instead, we get no images where the attribute is used.
Using srcset
in the Archive bears little value so to fix this we decided to remove srcset
completely, we use
HTTracks' -V
option; this allows us to execute a command on every file that is downloaded. In particular, we run the
following sed
command, where $0
is the file reference in HTTrack.
# find all occurrences of srcset in the file referenced by $0
# select and remove, including contents.
sed -i 's/srcset="[^"]*"//g' $0
2) The HTTrack /agengy-switcher
problem
We do not want the archiver to crawl the /agency-switcher
page. This is because, the page is unnecessary in the context
of browsing an agency's archived snapshot.
We use a custom command to replace the agency switcher link on all pages, and replace it with a link to the root of the cdn domain. This link to the root of the cdn domain will show the index page, and allow the user to navigate to the agency they want to view.
# find all occurrences of href="https://intranet.justice.gov.uk/agency-switcher/" in the file referenced by $0
# and replace them with href="/".
sed -i 's|href="https://intranet.justice.gov.uk/agency-switcher/"|href="/"|g' $0
All processing for HTTrack is managed in the process.js
file located in the NodeJS application. You will find all the
options used to set HTTrack up.
To understand the build process further, please look at the Makefile.
In an aim to towards good security practices, when this application is deployed to the Cloud Platform, the /access
is the only route that is open publicly.
The /access
route allows users to be redirected to the CloudFront distribution, where they can access the snapshot.
Private routes, /status
and /spider
are used for developer purposes only. To access these endpoints, port-forward to the service. See the command below.
It may be possible to interact with running pods with help from this cheatsheet. Please be aware that with every call to the CP k8s cluster, you will need to provide the namespace, as shown below:
kubectl -n intranet-archive-dev
Kubernetes
# list available pods for the namespace
kubectl -n intranet-archive-dev get pods
# copy a log file from a pod to your local machine
# update pod-id, agency and date
kubectl -n intranet-archive-dev cp intranet-archive-dev-<pod-id>:/archiver/snapshots/intranet.justice.gov.uk/<agency>/<date>/hts-log.txt ~/hts-log.txt
# port-forward to a running pod
kubectl -n intranet-archive-dev service/intranet-archive-service 2000:80
Make
Command | Description |
---|---|
make image |
Used by GitHub action, cd.yml, during build step |
make launch |
Checks if the intranet docker instance is running; if not, launch dory and docker in the background and open the site in the systems default browser |
make run |
Launch the application locally with docker compose up , requiring env + dory |
make down |
Alias of docker compose down . |
make bash |
Open a bash shell on the spider container. The application must already be running (e.g. via make run ) before this can be used. |