-
Notifications
You must be signed in to change notification settings - Fork 8
"How does it work?" diagrams (WIP)
Matteo Cargnelutti edited this page Mar 22, 2023
·
4 revisions
Simplified diagrams to explain how Scoop captures the web.
flowchart LR
A[Scoop]
B[Playwright]
C[Chromium]
D[Website]
E[HTTP Proxy]
A <--> |Controls| B
B <--> C
C <--> D
A <-.-> |Capture| E <-.-> C
Unless specified otherwise, everything the browser "sees" is captured via the HTTP proxy, which allows for the enforcement of time and size constraints while preserving partial responses.
flowchart LR
A[Scoop]
B[curl, yt-dlp ...]
C[Resource]
D[HTTP Proxy]
A <--> |Controls| B
B <--> C
A <-.-> |Capture| D <-.-> B
Unless specified otherwise, everything captured "out of band" goes through the HTTP proxy.
flowchart TD
A(Url)
B(Options)
A-->C
B-->C
C[Scoop class]
C-->D
D([Filter options])
D-->E
E([Filter url])
E-->F
F{{Ready to capture}}
- Filter options: Defaults are used for options that are not explicitly provided.
- Filter url: Url must be valid in format and not match against blocklist
stateDiagram-v2
state "Start Browser" as browser
state "Start Proxy" as intercepter
state "Detect non-web resource" as nonwebdetect
state "Capture of non-web resource" as nonwebcapture
state "Initial page load" as pageload
state "Capture page info" as pageinfo
state "Browser scripts" as browserscripts
state "Network idle" as networkidle
state "Scroll up" as scrollup
state "Screenshot" as screenshot
state "DOM snapshot*" as domsnapshot
state "PDF snapshot*" as pdfsnapshot
state "Capture video(s) as attachments" as capturevideo
state "Detect noarchive directive" as detectnoarchive
state "Capture of certificates" as certscapture
state "Gather Provenance Info" as provenanceinfo
state "Teardown" as teardown
[*] --> browser
browser --> intercepter
intercepter --> nonwebdetect
nonwebdetect --> nonwebcapture
nonwebdetect --> pageload
nonwebcapture --> certscapture
pageload --> pageinfo
pageinfo --> browserscripts
browserscripts --> networkidle
networkidle --> scrollup
scrollup --> screenshot
screenshot --> domsnapshot
domsnapshot --> pdfsnapshot
pdfsnapshot --> capturevideo
capturevideo --> detectnoarchive
detectnoarchive --> certscapture
certscapture --> provenanceinfo
provenanceinfo --> teardown
teardown --> [*]
- Steps marked with
*
are deactivated by default. - Unless specified otherwise
- Capture state is used determine if the next step should be run or not.
- Each steps counts towards the overall capture time and size limits, unless specified otherwise via an option flag. See options list for details.
- At the end of this capture process, Scoop holds everything it captured in memory, as state of the Scoop class.
flowchart TD
A(Scoop Instance)
B(gzip flag*)
A-->C
B-->C
C[scoopToWARC]
C-->D
D([Check capture state])
D-->E
E([Generate WARC info section])
E-->F
F([Generate WARC records section])
F-->G
G([Merge sections])
G-->H
H[WARC as ArrayBuffer]
Notes:
- The WARC sections are generated using warcio.js, which also handles per-segment GZIP compression
flowchart TD
A(Scoop Instance)
B(RAW flag*)
C(Signing server info*)
A-->D
B-->D
C-->D
D[scoopToWACZ]
D-->E
E([Check capture state])
E-->F
F([scoopToWARC])
F-->G
G([js-wacz on WARC])
H[Signing Server]
G<-.->|Optional| H
J[WACZ as ArrayBuffer]
G-->J
I[RAW exchanges]
G<-.->|Optional|I
stateDiagram-v2
state "Invoke" as args
state "Parse options" as options
state "Scoop.capture(url, options)" as capture
state "capture.toWARC / toWACZ" as export
state "Save to disk" as save
state "Exit with error code" as exit
[*] --> args
args --> options
options --> capture
capture --> export: Scoop Instance
export --> save: ArrayBuffer
save --> exit
exit --> [*]