(click to expand or hide)
Archive utility to capture working snapshots of justice.gov.uk. We use the following technologies to achieve this:
- Cloud Platform
- AWS S3
- AWS CloudFront
- HTTrack Cli
- NodeJS Server
Please ask the Central Digital Product Team for the INDEX URL to view the archives.
Access is granted if you are in possession of our basic-auth credentials; these are different from the credentials mentioned above.
Access point for archive-user
: Cloud Platform
It's important to note that creating a snapshot from a local machine proved to present resource related issues, such as rate limiting.
Requires
- Docker
Clone to your machine:
git clone https://github.com/ministryofjustice/justice-website-archive.git && cd justice-website-archive
Start docker compose:
make run
There is a script designed to help you install the Dory Proxy, if you'd like to.
If you chose to install Dory, you can access the application here:
Otherwise, access the application here:
Let's begin with servers and their interactions within...
The Archiver has an Nginx server. This is used to display responses from the underlying NodeJS server where Node processes form requests and decides how to treat them.
Essentially, if happy with the request, Node will instruct HTTrack to perform a website copy operation, and it does this with predefined options.
Supercronic is used within the app for scheduling. We have two schedules defined:
- S3 data-sync
- Daily snapshot
Using the AWS Cli, our data-sync executes aws s3 sync /snapshot/ s3://our-bucket
every 6 minutes whilst a spider
operation is alive. When the operation completes, the schedule is cancelled. A last and final data-sync takes place to
ensure all snapshot data has been transferred.
At 3 am each morning, a snapshot process is launched by sending an authorised POST request to the node service. Once accepted, S3 data-sync is scheduled; S3 data-sync only runs during a snapshot process.
At the very heart of the Archiver sits HTTrack. This application is configured by Node to take a snapshot of the MoJ Intranet. Potentially, you can point the Archiver at any website address and, using the settings for the Intranet, it will attempt to create an isolated copy of it.
The output of HTTrack can be noted in Docker Composes' stdout
in the running terminal window however, a more
detailed and linear output stream is available in the hts-log.txt
file. You can find this in the root of the snapshot.
All application processing for HTTrack is managed in the process.js
file located in the NodeJS application. You will find all the
options used to set HTTrack up.
To understand the build process further, please look at the Makefile.
Interact with running pods with help from this cheatsheet. Please be aware that with every call to the CP k8s cluster, you will need to provide the namespace, as shown below:
kubectl -n justice-archiver-dev
# make interaction a little easier; we can create repeatable
# variables, our namespace is the same name as the app, defined
# in ./kubectl_deploy/development/deployment.tpl
# set some vars, gets the first available pod (only one in our case)
K8S_NSP="justice-archiver-dev"; \
K8S_POD=$(kubectl -n ${K8S_NSP} get pod -l app=${K8S_NSP} -o jsonpath="{.items[0].metadata.name}"); \
After setting the above variables (copy -> paste -> execute
) the following blocks of commands will work using copy -> paste -> execute
too.
# list available pods and their status for the namespace
kubectl get pods -n ${K8S_NSP}
# describe the first available pod
kubectl describe pods -n ${K8S_NSP}
# monitor the system log of the first pod
kubectl logs -f ${K8S_POD} -n ${K8S_NSP}
# open an interactive shell on an active pod
kubectl exec -it ${K8S_POD} -n ${K8S_NSP} -- bash
Once you have an interactive shell, you can communicate with S3:
# list bucket directories
aws s3 ls s3://${S3_BUCKET_NAME}/
# get a list of snapshots
aws s3 ls s3://${S3_BUCKET_NAME}/www.justice.gov.uk/
# get a list of snapshot files - replace <date>
aws s3 ls s3://${S3_BUCKET_NAME}/www.justice.gov.uk/<date>-03-00/ --recursive --human-readable
Copy a log file from a pod to your local machine
# change the date to one that exists
SCRAPE_DATE="2023-10-24-18-03";
# uses K8S_NSP & K8S_POD variables
kubectl -n ${K8S_NSP} cp ${K8S_POD}:/archiver/snapshots/www.justice.gov.uk/"${SCRAPE_DATE}"/hts-log.txt ~/hts-log.txt
We use Makefile to reduce some complex or repetitive commands to simple make
commands.