bag-nabit
is a tool for downloading and attaching provenance information to public datasets.
The tool is intended for library projects that back up public domain resources and share
them with patrons.
bag-nabit
writes a dialect of BagIt
from either local files or remote URLs, and can also:
- Store request and response headers in a headers.warc file.
- Attach timestamps and public-key signatures to the bag's tagmanifest file.
- Verify format compliance and provenance chains on an existing bag-nabit bag.
bag-nabit
is designed for researchers who want to use copies of public data
downloaded from libraries and archives, and be able to verify the provenance of downloaded data
if necessary. A typical scenario would be a library providing a copy of a government dataset where funding has expired and the dataset is no longer available from the original source.
The tool is inspired by the WARC and wacz-auth formats, by bagit-python, and by the C2PA format for attaching provenance information to media files.
Design goals are:
- Usability:
- Bags should be usable for capturing both web content and file content delivered out of band.
- Bag content should be directly usable. For example, an archived CSV file should be readable as a CSV, rather than requiring a web archive reader.
- Transportability and self-documentation:
- Users should be able to read metadata as a text file.
- Avoid depending on custom file formats and tooling for consuming bags. As much as possible, bags should be composed of standard file formats verifiable with existing tools. Notably, use standard WARC files for headers and standard openssl commands for signatures and timestamps.
- Integrity and provenance:
- Archivists should be able to vouch for a bag when creating it. Like C2PA and wacz-auth, bags should include certificate chains showing who (based on control of a private key) vouches for the integrity of the dataset.
- Bags should be easy to copy and back up in different archives. Vouching should continue to work after a particular archive goes offline. We therefore rely on the existing PKI infrastructure (such as email, domain, and document signing certificates) to establish the identity of the signer.
- Like WARC, users should be able to access request and response headers for the original source of data.
bag-nabit
is not yet available on PyPI, but can be installed from source:
pip install https://github.com/harvard-lil/bag-nabit/archive/refs/heads/main.zip
Or installed as a tool by uv:
uv tool install --from git+https://github.com/harvard-lil/bag-nabit nabit
Or run from uvx:
uvx --from git+https://github.com/harvard-lil/bag-nabit nabit --help
Create a bag from a single URL:
nabit archive example_bag -u https://example.com/
Create a bag from multiple URLs, files, and directories:
nabit archive example_bag -u https://example.com/ -u https://example.com/other -p /path/to/local/files
Create a bag with metadata headers:
nabit archive example_bag -u https://example.com/ -i "Title:Example Dataset"
Create a bag with a timestamp:
nabit archive example_bag -u https://example.com/ -t digicert
Create a bag with multiple signatures and a timestamp:
nabit archive example_bag -u https://example.com/ -s mykey.pem:mychain.pem -s anotherkey.pem:anotherchain.pem -t digicert
Amend an existing bag to add files, metadata, signatures, and/or timestamps (editing data will remove existing signatures):
nabit archive example_bag --amend -u https://example.com/another -i "Subtitle: Another Header" -s mykey.pem:mychain.pem -t digicert
Validate a bag's contents:
nabit validate example_bag
Usage: [OPTIONS] COMMAND [ARGS]...
BagIt package signing tool
Options:
--help Show this message and exit.
Commands:
archive Archive files and URLs into a BagIt package.
validate Validate a BagIt package.
Usage: [OPTIONS] BAG_PATH
Archive files and URLs into a BagIt package. bag_path is the destination
directory for the package.
Options:
-a, --amend Update an existing archive. May add OR
OVERWRITE existing data.
-u, --url TEXT URL to archive (can be repeated). May be a
bare url or a JSON dict with a "url" key and
an optional "output" key
-p, --path TEXT File or directory to archive (can be
repeated). May be a bare path or a JSON dict
with a "path" key and an optional "output"
key
-c, --collect TEXT Collection tasks in JSON format
--hard-link Use hard links when copying files (when
possible)
-i, --info TEXT bag-info.txt metadata in key:value format
(can be repeated)
--signed-metadata FILE JSON file to be copied to data/signed-
metadata.json
--unsigned-metadata FILE JSON file to be copied to unsigned-
metadata.json
--signed-metadata-json TEXT JSON string to be written to data/signed-
metadata.json
--unsigned-metadata-json TEXT JSON string to be written to unsigned-
metadata.json
-s, --sign <cert_chain>:<key_file>
Sign using certificate chain and private key
files (can be repeated)
-t, --timestamp <tsa_keyword> | <cert_chain>:<url>
Timestamp using either a TSA keyword or a
cert chain path and URL (can be repeated)
--timeout FLOAT Timeout for collection tasks (default: 5.0)
--collect-errors [fail|ignore] How to handle collection task errors
(default: fail)
--help Show this message and exit.
Usage: [OPTIONS] BAG_PATH
Validate a BagIt package. bag_path is the path to the package directory to
validate.
Options:
--help Show this message and exit.
After creating a bag, or obtaining a bag from another source, the bag can be edited manually --
for example, by adding files to the data/ directory or editing a metadata file. However, new
data files or edits to signed files will invalidate the signature, causing nabit validate
to fail.
To re-sign a bag after editing, use the --amend
flag to nabit archive
:
nabit archive example_bag --amend -s mykey.pem:mychain.pem -t digicert
This command will regenerate the manifest files, remove any signatures and timestamps that no longer validate, and then re-run the signing and timestamping process.
Bags can be signed with any key accepted by openssl cms
, such as domain keys, email keys, or
document signing keys. Protecting these keys is very important, as losing them will not only
allow an attacker to create fake bags, but also to publish fake websites, email, or whatever
other purpose the key serves.
In many situations it may make sense to create and sign bags on different machines or by different people. A typical workflow might be:
- A worker machine or trusted volunteer creates the bag and timestamps it, but does not sign it:
nabit archive example_bag -u https://example.com/ -t digicert
- The data is transferred to a high security machine where a signature and a second timestamp is added:
nabit archive example_bag --amend -s mykey.pem:mychain.pem -t digicert
- The signed bag is then published to the archive, perhaps simply by copying the bag directory to a public file server.
It is not recommended to collect URLs from untrusted sources without validating their destination.
bag-nabit
currently WILL capture URLs that point to local IP addresses, such as localhost or the local network.
This is a security risk, as it may allow an attacker to capture sensitive data from local networks, especially on
cloud hosting where known URLs may share sensitive configuration data.
bag-nabit
is not primarily a web archiving tool, but it supports collection backends that can gather both web content and file content. Collection tasks can be provided as JSON content passed to the --collect
flag to nabit archive
:
nabit archive example_bag --collect '[
{"backend": "url", "url": "https://example.com/", "output": "example_com.html"},
{"backend": "path", "path": "/path/to/local/file"}
]'
Currently supported collection backends are:
url
: fetch URLs with pythonrequests
, following redirects. Write metadata to data/headers.warc. Equivalent to the-u
flag tonabit archive
. Keys:url
: the URL to fetchoutput
(optional): the path to save the fetched content to in the bag, relative todata/files/
. If not provided, the content will be saved todata/files/<url_path>
, where<url_path>
is the last path component of the URL.
path
: copy local files or directories to the bag. Equivalent to the-p
flag tonabit archive
. Keys:path
: the path to the local file or directory to copyoutput
(optional): the path to save the fetched content to in the bag.
Future backends could include ftp, web crawlers, etc.
bag-nabit
reads and writes a special dialect of BagIt designed for attaching provenance to publicly hosted resources.
bag-nabit
-flavored bags have the following notable features:
- headers.warc records provenance information for files downloaded from the web.
- signatures/ contains a chain of signature files and timestamp files for the tagmanifest.
- standard locations for metadata that is either signed with the bag, or editable after the bag is created.
The layout of a bag-nabit
bag is as follows:
bagit.txt
: standard BagIt filebag-info.txt
: standard BagIt filemanifest-sha256.txt
: standard BagIt filetagmanifest-sha256.txt
: standard BagIt fileunsigned-metadata.json
: optional metadata, not signed, editable after bag is createddata/
files/
: directory of files added to the bag...
headers.warc
: optional, request and response headers from HTTP fetches for files infiles/
signed-metadata.json
: optional metadata, signed along with the bag contents
signatures/
: directory of signature filestagmanifest-sha256.txt.p7s
-- signature file fortagmanifest-sha256.txt
tagmanifest-sha256.txt.p7s.tsr
-- timestamp file fortagmanifest-sha256.txt.p7s
tagmanifest-sha256.txt.p7s.tsr.crt
-- certificate file fortagmanifest-sha256.txt.p7s.tsr
...
-- other signature files in chain
headers.warc is a standard WARC file containing request and
response headers from HTTP fetches for files in data/files/
.
headers.warc is not required to exist, and if it exists, may not include entries for every file in data/files/
.
This allows bag-nabit
to be used both for resources accessible via HTTP and otherwise.
Response records in headers.warc are stored as Revisit records, using a custom WARC-Profile header as allowed by the WARC specification:
WARC-Type: revisit
WARC-Profile: file-content; filename="files/data.html"
The signatures/
directory contains a chain of signature files and timestamp files for the tagmanifest.
Signing the tagmanifest file is sufficient to ensure the integrity of the bag, as the tagmanifest contains cryptographic hashes for all tag files, including the manifest file which contains hashes for all data files.
The signatures directory can contain two kinds of attestation files:
.p7s
files are PKCS#7 signature files, which assert a domain or email address vouching for the bag contents..p7s
files are created with the command:openssl cms -sign -binary -md sha256 -in <original_file> -out <signature_file> -inkey <key_file> -signer <first_cert> [-certfile <remaining_chain>] -outform PEM -nosmimecap -cades
.p7s
files can be validated with the command:openssl cms -verify -binary -content <original_file> -in <signature_file> -inform PEM -purpose any
.tsr
files are timestamp response files, which assert a time before which the bag was created..tsr
files are created with the command:openssl ts -query -data <original_file> -sha256 -cert
.tsr
files can be validated with the commands:# verify certificate chain openssl verify <certificate_file> # validate timestamp openssl ts -verify -data <original_file> -in <timestamp_file> -CAfile <certificate_file>
To allow creation of arbitrary certificate chains, each attestation includes the full file name of the file it attests to. For example, this layout:
tagmanifest-sha256.txt
signatures/
tagmanifest-sha256.txt.p7s
tagmanifest-sha256.txt.p7s.tsr
tagmanifest-sha256.txt.p7s.tsr.crt
indicates that the signature in tagmanifest-sha256.txt.p7s
attests to the integrity of tagmanifest-sha256.txt
,
and the timestamp in tagmanifest-sha256.txt.p7s.tsr
, whose TSA uses the certificate chain in tagmanifest-sha256.txt.p7s.tsr.crt
,
attests to the time tagmanifest-sha256.txt.p7s
was created.
Note: As of this version, there is no way for a signer to attach metadata (such as intents) to a signature. Therefore all signatures are understood to vouch for the contents of the entire bag.
Note: Signatures SHOULD always be followed by a timestamp, so that the final interpretation of a signature chain is: "the controller of this email address/domain, at or before this time, vouch for the contents of the bag".
Note: Bags created by bag-nabit
include only the base tag files (bagit.txt, bag-info.txt, manifest-sha256.txt)
in the tagmanifest-sha256.txt
file, which is allowed but not encouraged by the BagIt spec. This is to allow
unsigned metadata outside the data/ directory.
The BagIt specification allows minimal metadata to be stored in the bag-info.txt file,
and any extensive metadata to be stored within the data/ directory.
Headers can be added to bag-info.txt with the --info / -i
flag to nabit archive
.
To clarify the distinction between signed and unsigned metadata, bag-nabit
extends
this specification to encourage metadata to be stored in two files:
data/signed-metadata.json
: metadata signed with the bag.unsigned-metadata.json
: metadata not signed, editable after bag is created.
In practice any files within data/ will be signed, and any files outside data/ will not, but the provided filenames are encouraged to ensure that users will understand the distinction.
bag-nabit
does not currently specify anything regarding the
contents of the metadata files.
We use uv to manage development dependencies. After cloning the repository, to run from source:
uv run nabit
This will automatically install dependencies and run the command.
To run tests:
uv run pytest
Some tests use the inline-snapshot library. If the tool output changes
intentionally, you may need to run uv run pytest --inline-snapshot=review
to review the changes and apply them
to test files.
After making changes to the command line interface, run uv run scripts/update_docs.py
to update README.md.
- Currently only SHA-256 is supported for hashing. This is only out of expediency, and could be extended to other cryptographically secure hashes.
- Because empty directories are not included in BagIt manifest files, signatures do not verify the presence or absence of empty directories.