-
-
Notifications
You must be signed in to change notification settings - Fork 8
JSON Scrapbook File Format
Scrapbooking for the 2020s
This page describes the JSON Scrapbook file format of the Scrapyard browser add-on.
Although the RDF file format of the legacy ScrapBook browser add-on is widely accepted, it has some inherent drawbacks:
- It is not possible to store arbitrary structured data along with the archived item. For example, notes in a rich-text format or archive index.
- There is no top-level partitioning, so multiple RDF files are necessary to represent different domains of the content.
- It is not possible to easily export and import a dedicated subset of the archived items.
The JSON Scrapbook file format described here is intended to address these shortcomings, also providing efficiency and extensibility.
Files of the JSON Scrapbook format contain JSON lines files and have the .jsbk
file extension. Each file contains metadata on its first line.
There are two physical file layouts:
- The
index
layout is optimal for synchronization through a cloud service. Files in this layout contain only theitem
objects (described below). Archived items themselves are stored as multiple files in their dedicated directories.
The following listing demonstrates a possible index
physical file layout.
.
├── index.jsbk
└── objects
└── 50593137CE5046BCA0A580B0A00430E9
├── archive
│ ├── favicon.ico
│ └── index.html
├── archive_content.blob
├── archive_index.json
├── comments.json
├── comments_index.json
├── icon.json
├── item.json
├── notes.json
└── notes_index.json
- The
export
layout is optimal for bulk export. In this layout, all archived items are stored in a single JSON-lines file, one per line in the following form:
{object_name: {JSON representation}, ...}
{
"format": "JSON Scrapbook", // String, mandatory
"version": 1, // Integer, mandatory
"type": "index", // String, mandatory
"contains": "everything", // String, mandatory
"generator": "Scrapyard", // String, optional
"uuid": "620B64F084BA449A953FC80EEC4F8D27", // String, mandatory
"name": "default", // String, optional
"entities": 1000, // Integer, mandatory
"timestamp": 1645554142000, // Integer, mandatory
"date": "2022-02-22T22:22:22.222" // String, optional
"comment": "File comment" // String, optional
}
- The "type" field designates the physical file layout and may take the following values:
- "index"
- "export"
- The "contains" field designates the type of
item
objects contained inside the file:- "shelves" - the contained
item
object may be of type "shelf". - "folders" or missing - the file represent an export of a single shelf, there are no
item
objects of type "shelf".
- "shelves" - the contained
- The "name" field contains the original name of the exported shelf.
- The "entities" field contains the number of objects in the file.
- The "timestamp" field contains a timestamp (ms) of the last modification.
- The "date" field contains a human-readable date of the last modification and is optional.
Contains metadata of a shelf, folder, bookmark, or archive.
{
"type": "archive", // String, mandatory
"uuid": "50593137CE5046BCA0A580B0A00430E9", // String, mandatory
"parent": "F7A5DC6862C2459B986CCE85F2394AC0", // String, mandatory
"title": "This is an example", // String, mandatory
"url": "http://www.example.com", // String, optional
"content_type": "text/html", // String, optional
"contains": "text", // String, optional
"size": 1024, // Integer, optional
"tags": "comma,separated", // String, optional
"todo_state": "TODO", // String, optional
"todo_date": "2022-02-22", // String, optional
"todo_pos": 0, // Integer, optional
"details": "TODO details", // String, optional
"date_added": 1570076393657, // Integer, mandatory
"date_modified": 1663500045342, // Integer, mandatory
"content_modified": 1663500045342, // Integer, optional
"has_icon": true, // Boolean, optional
"has_comments": false, // Boolean, optional
"has_notes": false, // Boolean, optional
"is_site": false, // Boolean, optional
"pos": 1 // Integer, optional
}
- The "uuid" and "parent" fields contain an RFC 4122 v4 UUID. The special UUID "default" represents the default shelf in Scrapyard.
- The "type" field may take the following values:
- "shelf"
- "folder"
- "bookmark"
- "archive"
- "separator"
- "notes"
- The "is_site" field is set for the folders that contain pages created during site capture.
- The "contains" field may take the following values:
- "text" or missing:
- Index layout: the
archive_content.blob
file contains UTF-8-encoded text. - Export layout: the
archive
object contains a JSON string that does not require any encoding.
- Index layout: the
- "bytes":
- Index layout: the
archive_content.blob
file contains raw bytes of the archive. - Export layout: the
archive
object contains a JSON string with the Base64-encoded bytes of the archive.
- Index layout: the
- "files":
- Index layout: the archive is unpacked, its files are stored in the
archive
subfolder. - Export layout: the
archive
object contains a JSON string with the Base64-encoded bytes of a ZIP-archive.
- Index layout: the archive is unpacked, its files are stored in the
- "text" or missing:
- The "content_type" field contains the mime-type of the archived content. "text/html" is assumed if omitted.
- The "size" field contains an approximate size of the archive in bytes.
- The "date_modified" field contains a timestamp representing the time when the
item
object of the archive was modified. - The "content_modified" field contains a timestamp representing the time when the
archive
,comments
, ornotes
objects of the archive were modified.
In the export layout, items are sorted in a such way, that no descendant appears before its parent object.
Contains icon image represented as data-URL.
{
"url": "...", // String, mandatory
}
Incorporates the archived content.
{
"content": "..." // String, mandatory
}
In the index layout, archive content is stored inside the archive_content.blob
file.
Contains the text index of the archive.
{
"content": ["index", "content"] // Array of String, mandatory
}
Omitted in the export file layout.
{
"format": "text", // String, mandatory
"content": "...", // String, mandatory
"html": "..." // String, optional
}
Contains the notes attached to the corresponding item
.
- The "format" field may take the following values:
- "text" - plain text.
- "html" - HTML.
- "markdown" - Markdown-formatted text.
- "org" - Org-formatted text.
- "delta" - format used by the Quill text editor.
- The "html" field contains the representations of the notes rendered as HTML.
Contains the text index of the notes.
{
"content": ["index", "content"] // Array of String, mandatory
}
Omitted in the export file layout.
Contains the text of item comments.
{
"content": "...", // String, mandatory
}
Contains text index of the notes.
{
"content": ["index", "content"] // Array of String, mandatory
}
Omitted in the export file layout.