www.justice.gov.uk
Archiver

Workflow

Archive utility to capture working snapshots of justice.gov.uk. We use the following technologies to achieve this:

Cloud Platform
AWS S3
AWS CloudFront
HTTrack Cli
NodeJS Server

Viewing the latest snapshot

Please ask the Central Digital Product Team for the INDEX URL to view the archives.

Creating a snapshot

Access is granted if you are in possession of our basic-auth credentials; these are different from the credentials mentioned above.

Access point for archive-user: Cloud Platform

Local development

It's important to note that creating a snapshot from a local machine proved to present resource related issues, such as rate limiting.

Requires

Docker

Installation

Clone to your machine:

git clone https://github.com/ministryofjustice/justice-website-archive.git && cd justice-website-archive

Start docker compose:

make run

There is a script designed to help you install the Dory Proxy, if you'd like to.

If you chose to install Dory, you can access the application here:

http://spider.justice.docker/

Otherwise, access the application here:

http://localhost:8080/

back to top

Understanding application logic

Let's begin with servers and their interactions within...

The Archiver has an Nginx server. This is used to display responses from the underlying NodeJS server where Node processes form requests and decides how to treat them.

Essentially, if happy with the request, Node will instruct HTTrack to perform a website copy operation, and it does this with predefined options.

Supercronic is used within the app for scheduling. We have two schedules defined:

S3 data-sync
Daily snapshot

S3 data-sync

Using the AWS Cli, our data-sync executes aws s3 sync /snapshot/ s3://our-bucket every 6 minutes whilst a spider operation is alive. When the operation completes, the schedule is cancelled. A last and final data-sync takes place to ensure all snapshot data has been transferred.

Daily snapshot

At 3 am each morning, a snapshot process is launched by sending an authorised POST request to the node service. Once accepted, S3 data-sync is scheduled; S3 data-sync only runs during a snapshot process.

back to top

HTTrack

At the very heart of the Archiver sits HTTrack. This application is configured by Node to take a snapshot of the MoJ Intranet. Potentially, you can point the Archiver at any website address and, using the settings for the Intranet, it will attempt to create an isolated copy of it.

Debugging

The output of HTTrack can be noted in Docker Composes' stdout in the running terminal window however, a more detailed and linear output stream is available in the hts-log.txt file. You can find this in the root of the snapshot.

Testing and making modifications

All application processing for HTTrack is managed in the process.js file located in the NodeJS application. You will find all the options used to set HTTrack up.

To understand the build process further, please look at the Makefile.

back to top

Kubernetes

Interact with running pods with help from this cheatsheet. Please be aware that with every call to the CP k8s cluster, you will need to provide the namespace, as shown below:

kubectl -n justice-archiver-dev

Useful commands

# make interaction a little easier; we can create repeatable 
# variables, our namespace is the same name as the app, defined 
# in ./kubectl_deploy/development/deployment.tpl

# set some vars, gets the first available pod (only one in our case)
K8S_NSP="justice-archiver-dev"; \
K8S_POD=$(kubectl -n ${K8S_NSP} get pod -l app=${K8S_NSP} -o jsonpath="{.items[0].metadata.name}"); \

After setting the above variables (copy -> paste -> execute) the following blocks of commands will work using copy -> paste -> execute too.

# list available pods and their status for the namespace
kubectl get pods -n ${K8S_NSP}

# describe the first available pod
kubectl describe pods -n ${K8S_NSP}

# monitor the system log of the first pod
kubectl logs -f ${K8S_POD} -n ${K8S_NSP}

# open an interactive shell on an active pod
kubectl exec -it ${K8S_POD} -n ${K8S_NSP} -- bash

Once you have an interactive shell, you can communicate with S3:

# list bucket directories
aws s3 ls s3://${S3_BUCKET_NAME}/

# get a list of snapshots
aws s3 ls s3://${S3_BUCKET_NAME}/www.justice.gov.uk/

# get a list of snapshot files - replace <date> 
aws s3 ls s3://${S3_BUCKET_NAME}/www.justice.gov.uk/<date>-03-00/ --recursive --human-readable

Copy a log file from a pod to your local machine

# change the date to one that exists
SCRAPE_DATE="2023-10-24-18-03";

# uses K8S_NSP & K8S_POD variables  
kubectl -n ${K8S_NSP} cp ${K8S_POD}:/archiver/snapshots/www.justice.gov.uk/"${SCRAPE_DATE}"/hts-log.txt ~/hts-log.txt

back to top

Makefile

We use Makefile to reduce some complex or repetitive commands to simple make commands.

Make commands

Command	Description
`make image`	Used by GitHub action, cd.yml, during build step
`make launch`	Checks if the docker instance is running; if not, launch dory and docker in the background and open the site in the systems default browser
`make run`	Launch the application locally with `docker compose up`, requiring `env` + `dory`
`make down`	Alias of `docker compose down`.
`make shell`	Open a bash shell on the spider container. The application must already be running (e.g. via `make run`) before this can be used.
`make sync`	Open a bash shell and execute `s3sync`. Uploads all assets to AWS S3

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
bin		bin
conf		conf
kubectl_deploy/development		kubectl_deploy/development
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

www.justice.gov.uk
Archiver

Table of contents

Workflow

Viewing the latest snapshot

Creating a snapshot

Local development

Installation

Understanding application logic

S3 data-sync

Daily snapshot

HTTrack

Debugging

Testing and making modifications

Kubernetes

Useful commands

Makefile

Make commands

About

Releases

Packages

Contributors 2

Languages

License

ministryofjustice/justice-website-archive

Folders and files

Latest commit

History

Repository files navigation

www.justice.gov.uk Archiver

Table of contents

Workflow

Viewing the latest snapshot

Creating a snapshot

Local development

Installation

Understanding application logic

S3 data-sync

Daily snapshot

HTTrack

Debugging

Testing and making modifications

Kubernetes

Useful commands

Makefile

Make commands

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

www.justice.gov.uk
Archiver

Packages