Skip to content

Latest commit

 

History

History
221 lines (144 loc) · 10.8 KB

README.md

File metadata and controls

221 lines (144 loc) · 10.8 KB

Table of contents

(click to expand or hide)
  1. Workflow
  2. Viewing the latest snapshot
  3. Creating a snapshot
  4. Local development
    1. Installation
    2. Understanding application logic
  5. HTTrack
    1. Debugging
    2. Testing and making modifications
  6. Kubernetes
    1. Commands
  7. Makefile
    1. Commands

Workflow

Archive utility to capture working snapshots of justice.gov.uk. We use the following technologies to achieve this:

  1. Cloud Platform
  2. AWS S3
  3. AWS CloudFront
  4. HTTrack Cli
  5. NodeJS Server

Viewing the latest snapshot

Please ask the Central Digital Product Team for the INDEX URL to view the archives.

Creating a snapshot

Access is granted if you are in possession of our basic-auth credentials; these are different from the credentials mentioned above.

Access point for archive-user: Cloud Platform

Local development

It's important to note that creating a snapshot from a local machine proved to present resource related issues, such as rate limiting.

Requires

  • Docker

Installation

Clone to your machine:

git clone https://github.com/ministryofjustice/justice-website-archive.git && cd justice-website-archive

Start docker compose:

make run

There is a script designed to help you install the Dory Proxy, if you'd like to.

If you chose to install Dory, you can access the application here:

http://spider.justice.docker/

Otherwise, access the application here:

http://localhost:8080/



back to top

Understanding application logic

Let's begin with servers and their interactions within...

The Archiver has an Nginx server. This is used to display responses from the underlying NodeJS server where Node processes form requests and decides how to treat them.

Essentially, if happy with the request, Node will instruct HTTrack to perform a website copy operation, and it does this with predefined options.

Supercronic is used within the app for scheduling. We have two schedules defined:

  1. S3 data-sync
  2. Daily snapshot

S3 data-sync

Using the AWS Cli, our data-sync executes aws s3 sync /snapshot/ s3://our-bucket every 6 minutes whilst a spider operation is alive. When the operation completes, the schedule is cancelled. A last and final data-sync takes place to ensure all snapshot data has been transferred.

Daily snapshot

At 3 am each morning, a snapshot process is launched by sending an authorised POST request to the node service. Once accepted, S3 data-sync is scheduled; S3 data-sync only runs during a snapshot process.



back to top

HTTrack

At the very heart of the Archiver sits HTTrack. This application is configured by Node to take a snapshot of the MoJ Intranet. Potentially, you can point the Archiver at any website address and, using the settings for the Intranet, it will attempt to create an isolated copy of it.

Debugging

The output of HTTrack can be noted in Docker Composes' stdout in the running terminal window however, a more detailed and linear output stream is available in the hts-log.txt file. You can find this in the root of the snapshot.

Testing and making modifications

All application processing for HTTrack is managed in the process.js file located in the NodeJS application. You will find all the options used to set HTTrack up.

To understand the build process further, please look at the Makefile.



back to top

Kubernetes

Interact with running pods with help from this cheatsheet. Please be aware that with every call to the CP k8s cluster, you will need to provide the namespace, as shown below:

kubectl -n justice-archiver-dev

Useful commands

# make interaction a little easier; we can create repeatable 
# variables, our namespace is the same name as the app, defined 
# in ./kubectl_deploy/development/deployment.tpl

# set some vars, gets the first available pod (only one in our case)
K8S_NSP="justice-archiver-dev"; \
K8S_POD=$(kubectl -n ${K8S_NSP} get pod -l app=${K8S_NSP} -o jsonpath="{.items[0].metadata.name}"); \

After setting the above variables (copy -> paste -> execute) the following blocks of commands will work using copy -> paste -> execute too.

# list available pods and their status for the namespace
kubectl get pods -n ${K8S_NSP}

# describe the first available pod
kubectl describe pods -n ${K8S_NSP}

# monitor the system log of the first pod
kubectl logs -f ${K8S_POD} -n ${K8S_NSP}

# open an interactive shell on an active pod
kubectl exec -it ${K8S_POD} -n ${K8S_NSP} -- bash

Once you have an interactive shell, you can communicate with S3:

# list bucket directories
aws s3 ls s3://${S3_BUCKET_NAME}/

# get a list of snapshots
aws s3 ls s3://${S3_BUCKET_NAME}/www.justice.gov.uk/

# get a list of snapshot files - replace <date> 
aws s3 ls s3://${S3_BUCKET_NAME}/www.justice.gov.uk/<date>-03-00/ --recursive --human-readable

Copy a log file from a pod to your local machine

# change the date to one that exists
SCRAPE_DATE="2023-10-24-18-03";

# uses K8S_NSP & K8S_POD variables  
kubectl -n ${K8S_NSP} cp ${K8S_POD}:/archiver/snapshots/www.justice.gov.uk/"${SCRAPE_DATE}"/hts-log.txt ~/hts-log.txt



back to top

Makefile

We use Makefile to reduce some complex or repetitive commands to simple make commands.

Make commands

Command Description
make image Used by GitHub action, cd.yml, during build step
make launch Checks if the docker instance is running; if not, launch dory and docker in the background and open the site in the systems default browser
make run Launch the application locally with docker compose up, requiring env + dory
make down Alias of docker compose down.
make shell Open a bash shell on the spider container. The application must already be running (e.g. via make run) before this can be used.
make sync Open a bash shell and execute s3sync. Uploads all assets to AWS S3