Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Webrecorder tools mentioned in readme #36

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ WARCIT

``warcit`` is a command-line tool to convert on-disk directories of web documents (commonly HTML, web assets and any other data files) into an ISO standard web archive (WARC) files.

Conversion to WARC file allows for improved durability in a standardized format, and allows for any web files stored on disk to be uploaded into `Webrecorder <https://github.com/webrecorder/webrecorder>`_, or replayed locally with `Webrecorder Player <https://github.com/webrecorder/webrecorderplayer-electron/releases>`_ or `pywb <https://github.com/ikreymer/pywb>`_
Conversion to WARC file allows for improved durability in a standardized format, and allows any website files stored on disk to be replayed locally through `ReplayWeb.page <https://webrecorder.net/replaywebpage/>`_ or `pywb <https://github.com/webrecorder/pywb>`_

(Many other tools also operate on WARC files, see: `Awesome Web Archiving -- Tools and Software <https://github.com/iipc/awesome-web-archiving#tools--software>`_)

Expand Down
26 changes: 13 additions & 13 deletions conversions-and-transclusions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# WARCIT Media Conversion Workflow

With the 0.4.0, warcit introduces a new workflow for converting video/audio files into web-friendly formats
With the 0.4.0 release, warcit introduces a new workflow for converting video/audio files into web-friendly formats
and then placing them into WARCs along with transclusion metadata to enable access from within a containing page.

To allow for maximum flexibility, this process is split into two phases: conversion and transclusion WARC creation.
Expand All @@ -14,7 +14,7 @@ converted files into a separate directory, recreating the same directory structu

For example, given a directory structure:

```
```txt
- data/
- videos/
- video_file.flv
Expand All @@ -27,15 +27,15 @@ Running:
warcit-converter http://www.example.com/ ./data/
```

with the default rules will result in converted files written into `./conversions` directory (by default).
with the default rules will result in converted files written into `./conversions` directory (by default).

The full url of each file, as with warcit, is created by prepending the prefix to the path in the directory.

The input media in this example would have a full url of `http://example.com/media/video_file.flv` and
`http://example.com/media/an_audio_file.ra`. The converted files simply have additional extensions added
for the full url, such as: `http://example.com/media/video_file.flv.mp4`, `http://example.com/media/an_audio_file.ra.webm`, etc...

```
```txt
- data/
- media/
- video_file.flv
Expand All @@ -61,9 +61,9 @@ The [default rule set](https://github.com/webrecorder/warcit/blob/video-conversi

The current output formats are two web-focused formats and a preservation format:

* .webm -- vpx9 + opus encoded video + audio, an open format for the web
* .mp4 -- H.264 + AAC encoded video + audio, primarily for Safari and Apple based platforms.
* .mkv -- [FFV1](https://en.wikipedia.org/wiki/FFV1) codec in a Matroska container.
- .webm -- vpx9 + opus encoded video + audio, an open format for the web
- .mp4 -- H.264 + AAC encoded video + audio, primarily for Safari and Apple based platforms.
- .mkv -- [FFV1](https://en.wikipedia.org/wiki/FFV1) codec in a Matroska container.

(For audio only content, .webm, .mp3 and .flac are used instead)

Expand Down Expand Up @@ -204,6 +204,7 @@ objects. Further, it is possible that additional transclusions + conversions may
And, different versions of a page may have different numbers of videos.

For example:

- A page has multiple videos but only one was initially available. Later, additional content with two more videos was discovered.
- A page has one video that needed to be converted and one that played natively in the current browser. Later, the other video also needed to be converted.
- An initial capture of a page has two video that were converted. A later capture has only one video (the other was removed, or shifted to a new page, etc...)
Expand Down Expand Up @@ -232,7 +233,7 @@ For a given entry, `urn:embeds:http://example.com/watch_page.html`, 2 videos may
However, at a later time, another transclusion is discovered for the same page and
added with a new metadata record:

```
```json
{
"transclusions":
{"http://example.com/media/yet_another_video.flv": {..., "formats": {...}},
Expand All @@ -246,7 +247,7 @@ added with a new metadata record:
When loading the page `20160102/http://www.example.com/watch_video.html`, both transclusion
metadata records will be loaded, and all 3 videos will be readded to the page, if possible.

```
```txt
- "http://example.com/media/video_file.flv"
- "http://example.com/media/another_video_file.flv"
- "http://example.com/media/yet_another_video.flv"
Expand All @@ -260,7 +261,7 @@ the actual creation date is set in the `WARC-Creation-Date` header.
However, if a later version of the same page contains *different* transclusions, only those transclusions
should be loaded. For example, the `20170102` version of the page may have only one video:

```
```json
{
"transclusions":
{"http://example.com/media/video_file.flv": {..., "formats": {...}},
Expand All @@ -277,7 +278,7 @@ When replaying a particular page, all of the exact match transclusions will be u

* When replaying `20160203/http://www.example.com/watch_video.htm`, the closest transclusion metadata are:

```
```txt
20160102/http://www.example.com/watch_video.html -- 2 videos
20160102/http://www.example.com/watch_video.html -- 1 video
```
Expand All @@ -286,12 +287,11 @@ Since there two records at the exact same timestamp, they will be combined and 3

* When replaying `20170203//http://www.example.com/watch_video.htm` the closest transclusion metadata record is:

```
```txt
20170102/http://www.example.com/watch_video.html -- 1 video
```

Since there is only one match, the 1 video from this record is used. Additional transclusion records
farther away are not searched.

If additional captures require custom sets of transclusions, additional records can be added at the exact capture time.