Skip to content

JSON Scrapbook File Format

GChristensen edited this page Nov 19, 2022 · 22 revisions

JSON Scrapbook File Format

Scrapbooking for the 2020s

This page describes the JSON Scrapbook file format of the Scrapyard browser add-on.

Rationale

Although the RDF file format of the legacy ScrapBook browser add-on is widely accepted, it has some inherent drawbacks:

  • It is not possible to store arbitrary structured data along with the archived item. For example, notes in a rich-text format or archive index.
  • There is no top-level partitioning, so multiple RDF files are necessary to represent different domains of the content.
  • It is not possible to easily export and import a dedicated subset of the archived items.

The JSON Scrapbook file format described here is intended to address these shortcomings, also providing efficiency and extensibility.

Physical File Layouts

Files of the JSON Scrapbook format contain JSON lines files and have the .jsbk file extension. Each file contains metadata on its first line.

There are two physical file layouts:

  • The index layout is optimal for synchronization through a cloud service. Files in this layout contain only the item objects (described below). Archived items themselves are stored as multiple files in their dedicated directories.

The following listing demonstrates a possible index physical file layout.

.
├── index.jsbk
└── objects
    └── 50593137CE5046BCA0A580B0A00430E9
        ├── archive
        │   ├── favicon.ico
        │   └── index.html
        ├── archive_content.blob
        ├── archive_index.json
        ├── comments.json
        ├── comments_index.json
        ├── icon.json
        ├── item.json
        ├── notes.json
        └── notes_index.json
  • The export layout is optimal for bulk export. In this layout, all archived items are stored in a single JSON-lines file, one per line in the following form:
{object_name: {JSON representation}, ...}

Objects

<file metadata>

{
  "format":    "JSON Scrapbook",                     // String, mandatory
  "version":   1,                                    // Integer, mandatory
  "type":      "index",                              // String, mandatory
  "contains":  "everything",                         // String, mandatory
  "generator": "Scrapyard",                          // String, optional
  "uuid":      "620B64F084BA449A953FC80EEC4F8D27",   // String, mandatory
  "name":      "default",                            // String, optional 
  "entities":  1000,                                 // Integer, mandatory   
  "timestamp": 1645554142000,                        // Integer, mandatory   
  "date":      "2022-02-22T22:22:22.222"             // String, optional  
  "comment":   "File comment"                        // String, optional  
}
Notes
  • The "type" field designates the physical file layout and may take the following values:
    • "index"
    • "export"
  • The "contains" field designates the type of item objects contained inside the file:
    • "shelves" - the contained item object may be of type "shelf".
    • "folders" or missing - the file represent an export of a single shelf, there are no item objects of type "shelf".
  • The "name" field contains the original name of the exported shelf.
  • The "entities" field contains the number of objects in the file.
  • The "timestamp" field contains a timestamp (ms) of the last modification.
  • The "date" field contains a human-readable date of the last modification and is optional.

item

Contains metadata of a shelf, folder, bookmark, or archive.

{
  "type":             "archive",                          // String, mandatory
  "uuid":             "50593137CE5046BCA0A580B0A00430E9", // String, mandatory
  "parent":           "F7A5DC6862C2459B986CCE85F2394AC0", // String, mandatory
  "title":            "This is an example",               // String, mandatory
  "url":              "http://www.example.com",           // String, optional
  "content_type":     "text/html",                        // String, optional
  "contains":         "text",                             // String, optional
  "size":             1024,                               // Integer, optional
  "tags":             "comma,separated",                  // String, optional 
  "todo_state":       "TODO",                             // String, optional
  "todo_date":        "2022-02-22",                       // String, optional
  "todo_pos":         0,                                  // Integer, optional
  "details":          "TODO details",                     // String, optional  
  "date_added":       1570076393657,                      // Integer, mandatory 
  "date_modified":    1663500045342,                      // Integer, mandatory
  "content_modified": 1663500045342,                      // Integer, optional
  "has_icon":         true,                               // Boolean, optional
  "has_comments":     false,                              // Boolean, optional
  "has_notes":        false,                              // Boolean, optional
  "is_site":          false,                              // Boolean, optional
  "pos":              1                                   // Integer, optional
}
Notes
  • The "uuid" and "parent" fields contain an RFC 4122 v4 UUID. The special UUID "default" represents the default shelf in Scrapyard.
  • The "type" field may take the following values:
    • "shelf"
    • "folder"
    • "bookmark"
    • "archive"
    • "separator"
    • "notes"
  • The "is_site" field is set for the folders that contain pages created during site capture.
  • The "contains" field may take the following values:
    • "text" or missing:
      • Index layout: the archive_content.blob file contains UTF-8-encoded text.
      • Export layout: the archive object contains a JSON string that does not require any encoding.
    • "bytes":
      • Index layout: the archive_content.blob file contains raw bytes of the archive.
      • Export layout: the archive object contains a JSON string with the Base64-encoded bytes of the archive.
    • "files":
      • Index layout: the archive is unpacked, its files are stored in the archive subfolder.
      • Export layout: the archive object contains a JSON string with the Base64-encoded bytes of a ZIP-archive.
  • The "content_type" field contains the mime-type of the archived content. "text/html" is assumed if omitted.
  • The "size" field contains an approximate size of the archive in bytes.
  • The "date_modified" field contains a timestamp representing the time when the item object of the archive was modified.
  • The "content_modified" field contains a timestamp representing the time when the archive, comments, or notes objects of the archive were modified.

In the export layout, items are sorted in a such way, that no descendant appears before its parent object.

icon

Contains icon image represented as data-URL.

{
  "url": "...", // String, mandatory
}

archive

Incorporates the archived content.

{
  "content": "..." // String, mandatory
}
Notes

In the index layout, archive content is stored inside the archive_content.blob file.

archive_index

Contains the text index of the archive.

{
  "content": ["index", "content"] // Array of String, mandatory
}
Notes

Omitted in the export file layout.

notes

{
  "format": "text", // String, mandatory
  "content": "...", // String, mandatory
  "html": "..."     // String, optional
}
Notes

Contains the notes attached to the corresponding item.

  • The "format" field may take the following values:
    • "text" - plain text.
    • "html" - HTML.
    • "markdown" - Markdown-formatted text.
    • "org" - Org-formatted text.
    • "delta" - format used by the Quill text editor.
  • The "html" field contains the representations of the notes rendered as HTML.

notes_index

Contains the text index of the notes.

{
  "content": ["index", "content"] // Array of String, mandatory
}
Notes

Omitted in the export file layout.

comments

Contains the text of item comments.

{
  "content": "...", // String, mandatory
}

comments_index

Contains text index of the notes.

{
  "content": ["index", "content"] // Array of String, mandatory
}
Notes

Omitted in the export file layout.