Site Archiving Toolkit

What is this thing?

The Site Archiving Toolkit allows you to quickly and easily make both flattened HTML and web archive versions of websites. These scripts are a relatively easy to use command line interface for crawling sites using both HTTrack and Browsertrix Crawler (from the Webrecorder project) in Docker.

Check out this video to see what it does and how to use it:

Site Archiving Toolkit - reclaim.tv

Check ou this example to see the types of archives it makes:

archiving.ca.reclaim.cloud

These archives are zipped and easy to download, where they can be placed on just about any web server and made public! Here's an example of one in use:

digciz.jadin.me

Features

Crawl an entire site / domain for offline browsing, preservation, or whatever other purpose
Crawls can run in the background after they’ve been started, even if you close your terminal
Accepts multiple URLs at once, to queue up multiple crawl jobs on Reclaim Cloud, Linux, or macOS (this is not supported on the Windows version)
Preview archived pages using a local web server
Automatically creates zip files for easy download/upload
Override crawl settings using the archive.ini file. Delete the file to return to defaults!

How do I use it?

The Site Archiving Toolkit is designed first to be run on Reclaim Cloud, but can also be used on any computer that has Docker installed.

Using the Site Archiving Toolkit on Reclaim Cloud

Install the Site Archiving Toolkit using the Marketplace. Open the terminal (either via SSH or the built-in Web SSH feature) to start crawling sites.

The archive command will start crawling a site. Here are some examples:

This will crawl all pages on the "url.com" domain over HTTPS:

archive https://url.com

You can give the archive command a list of URLs seperated by spaces, and it will crawl them sequentially:

archive https://url.com https://anotherurl.com

Once you start a crawl using the archive command, you no longer need to keep your terminal open, as it will run in the background. If you need to stop crawling a site, open a new terminal and use quit-crawlers which will quit all httrack or browsertrix crawler jobs:

quit-crawlers

Previewing and Downloading your archived sites

Visit the environment URL of your Site Archiving Toolkit environment to see all completed and in-progress crawls. When they are finished you can view them and download them as zip files. If you need to delete crawls that were made previously and are no longer needed, you can find them in the crawls directory, located at /root/site-archiving-toolkit/crawls, which is also bookmarked in the Reclaim Cloud file manager.

Using the Site Archiving Toolkit on your own computer

Install Docker Desktop
Launch Docker Desktop
Download the latest version of the Site Archiving Toolkit for your OS from the releases page
Unzip the release and place it somewhere convenient (maybe your Home directory or Documents folder)
Open the Terminal on macOS, or Powershell on Windows
cd to the folder you unzipped the release into (ex: cd ~/Documents/site-archiving-toolkit)

From here you can run any of the following commands on macOS or Linux:

./archive.sh to archive sites

./quit-crawlers.sh to quit any in-progress crawls

./attach.sh to re-attach to an in-progress crawl. This is useful if you started one earlier and closed your terminal, and now you want to check back up on their status.

./start-server.sh to start a local web server so you can preview your achived sites. After running this command, open up a web browser and navigate to http://localhost

./stop-server.sh to stop the local web server

Similar commands are available on Windows when using Powershell:

.\archive.ps1 to archive sites. Note that the Windows version only supports one URL at a time.

.\quit-crawlers.ps1 to quit any in-progress crawls

.\attach.ps1 to re-attach to an in-progress crawl. This is useful if you started one earlier and closed your terminal, and now you want to check back up on the status.

.\start-server.ps1 to start a local web server so you can preview your achived sites. After running this command, open up a web browser and navigate to http://localhost

.\stop-server.ps1 to stop the local web server

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Site Archiving Toolkit

What is this thing?

Features

How do I use it?

Using the Site Archiving Toolkit on Reclaim Cloud

Previewing and Downloading your archived sites

Using the Site Archiving Toolkit on your own computer

From here you can run any of the following commands on macOS or Linux:

Similar commands are available on Windows when using Powershell:

About

Releases 9

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
resources		resources
.gitignore		.gitignore
README.md		README.md
archive.ps1		archive.ps1
archive.sh		archive.sh
attach.ps1		attach.ps1
attach.sh		attach.sh
make-releases.sh		make-releases.sh
quit-crawlers.ps1		quit-crawlers.ps1
quit-crawlers.sh		quit-crawlers.sh
start-server.ps1		start-server.ps1
start-server.sh		start-server.sh
stop-server.ps1		stop-server.ps1
stop-server.sh		stop-server.sh

TaylorJadin/site-archiving-toolkit

Folders and files

Latest commit

History

Repository files navigation

Site Archiving Toolkit

What is this thing?

Features

How do I use it?

Using the Site Archiving Toolkit on Reclaim Cloud

Previewing and Downloading your archived sites

Using the Site Archiving Toolkit on your own computer

From here you can run any of the following commands on macOS or Linux:

Similar commands are available on Windows when using Powershell:

About

Resources

Stars

Watchers

Forks

Releases 9

Packages 0

Languages

Packages