Skip to content

Dataset rsync instructions

Aaron Elkiss edited this page Mar 13, 2024 · 14 revisions

Prerequisites

All of the following commands should work in bash on Linux and macos, as well as PowerShell on Windows. You must have a working installation of rsync and, for certain commands, ruby or perl.

Syncing a full dataset

Replace $TREE with to the name of the rsync tree you are connecting to (such as ht_text_pd) and $LOCAL_PATH is the path on your local filesystem you want to write the dataset to, for example /path/to/datasets.

rsync --copy-links --delete --ignore-errors --recursive --times --verbose datasets.hathitrust.org::$TREE $LOCAL_PATH

Sync a list of IDs

Step 1: generate list of paths from list of HathiTrust Volume IDs

id_list.txt must be a plain text file containing one HathiTrust Volume ID per line, with Unix line endings and no other encoding (URL esaping, quotes, etc)

w/ python

First run pip install pairtree, then save this script as ids_to_ppath.py:

import sys, pairtree;
for line in sys.stdin:
  (n,i) = line.strip().split('.',1);
  print("/".join([n, 'pairtree_root', pairtree.id2path(i), pairtree.id_encode(i)]))

Then run:

python ids_to_ppath.py < id_list.txt > path_list.txt

w/ ruby

First run gem install pairtree to install the pairtree gem.

ruby -e 'require "pairtree";ARGF.each {|l|l.chomp!;n,i=l.split(/\./,2);puts "#{n}/pairtree_root/#{Pairtree::Path.id_to_path i}"}' id_list.txt > path_list.txt

w/ perl

First install File::Pairtree CPAN module

perl -MFile::Pairtree -ne 'chomp;($n,$i)=split /\./,$_,2;print "$n/".File::Pairtree::id2ppath($i).File::Pairtree::s2ppchars($i)."\n"' id_list.txt > path_list.txt

Step 2: sync files

Replace $TREE with to the name of the rsync tree you are connecting to (such as ht_text_pd) and $LOCAL_PATH is the path on your local filesystem you want to write the dataset to, for example /path/to/datasets.

rsync --copy-links --delete --ignore-errors --recursive --times --verbose --files-from=path_list.txt datasets.hathitrust.org::$TREE $LOCAL_PATH