Skip to content

Commit

Permalink
Documentation (#69)
Browse files Browse the repository at this point in the history
* Bump versions of Github workflow actions

* Improved a lot of documentation

* Moved README.md to docs directory and updated package documentation. This avoids README showing up in godoc

* Fix for #70 Misleading comment
  • Loading branch information
johnerikhalse authored Dec 11, 2023
1 parent 614c93d commit b976c17
Show file tree
Hide file tree
Showing 10 changed files with 208 additions and 63 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
steps:
- uses: actions/setup-go@v4
with:
go-version: 1.17
go-version: 1.19
- name: Checkout
uses: actions/checkout@v4
- name: golangci-lint
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,5 @@ jobs:
- uses: actions/checkout@v4
- uses: actions/setup-go@v4
with:
go-version: '^1.17.0'
go-version: '^1.19.0'
- run: go test ./...
26 changes: 20 additions & 6 deletions doc.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,32 @@
*/

/*
Package gowarc allows parsing, creating and validating WARC-records.
Reading, writing and validating WARC-files is also supported.
Package gowarc provides a framework for handling WARC files, enabling their parsing, creation, and validation.
WARC
# WARC Overview
The WARC format offers a standard way to structure, manage and store billions of resources collected from the web and elsewhere.
It is used to build applications for harvesting, managing, accessing, mining and exchanging content.
To learn more about the WARC standard, read the specification at https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
For more details, visit the WARC specification: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
Creating a WARC record
# WARC record creation
To create a WARC record.
The [WarcRecordBuilder], initialized via [NewRecordBuilder], is the primary tool for creating WARC records.
By default, the WarcRecordBuilder generates a record id and calculates the 'Content-Length' and 'WARC-Block-Digest'.
Use [WarcFileWriter], initialized with [NewWarcFileWriter], to write WARC files.
# WARC record parsing
To parse single WARC records, use the [Unmarshaler] initialized with [NewUnmarshaler].
To read entire WARC files, employ the [WarcFileReader] initialized through [NewWarcFileReader].
# Validation and repair
The gowarc package supports validation during both the creation and parsing of WARC records.
Control over the scope of validation and the handling of validation errors can be achieved by setting the appropriate
options in the [WarcRecordBuilder], [Unmarshaler], or [WarcFileReader].
*/
package gowarc
20 changes: 8 additions & 12 deletions README.md → docs/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
![Lint](https://github.com/nlnwa/gowarc/workflows/golangci-lint/badge.svg)
![GoReleaser](https://github.com/nlnwa/gowarc/workflows/goreleaser/badge.svg)
[![Release](https://img.shields.io/github/release/nlnwa/gowarc.svg)](https://github.com/nlnwa/gowarc/releases/latest)
[![License](https://img.shields.io/github/license/nlnwa/gowarc)](/LICENSE)
[![PkgGoDev](https://pkg.go.dev/badge/github.com/nlnwa/gowarc)](https://pkg.go.dev/github.com/nlnwa/gowarc)

> This project is currently in alpha. Expect API changes and enhanced documentation to come.
# gowarc

A library for creating, parsing and evaluating WARC-records, written in go.
A library for creating, parsing and evaluating WARC-files, written in go.

### What is WARC?

Expand All @@ -26,6 +25,8 @@ $ go get github.com/nlnwa/gowarc

#### Create a new WARC record

To get you started, here is a simple example of how to create a new WARC record.

```go
package main

Expand Down Expand Up @@ -54,16 +55,11 @@ func main() {
}
```

#### Expected output

```
WARC record: version: WARC/1.1, type: response, id: <urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>
```

### godoc

For complete documentation and examples consult the godoc online at: https://pkg.go.dev/github.com/nlnwa/gowarc

## Command line

https://github.com/nlnwa/warchaeology is a command line tool that use gowarc.
## Command line tools

[warchaeology](https://github.com/nlnwa/warchaeology) is a command line tool based on gowarc.
96 changes: 87 additions & 9 deletions example_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,26 +14,104 @@
* limitations under the License.
*/

package gowarc
package gowarc_test

import "fmt"
import (
"bufio"
"bytes"
"fmt"
"github.com/nlnwa/gowarc"
"io"
)

func Example_basic() {
builder := NewRecordBuilder(Response)
func ExampleNewRecordBuilder() {
builder := gowarc.NewRecordBuilder(gowarc.Response)
_, err := builder.WriteString("HTTP/1.1 200 OK\nDate: Tue, 19 Sep 2016 17:18:40 GMT\nServer: Apache/2.0.54 (Ubuntu)\n" +
"Last-Modified: Mon, 16 Jun 2013 22:28:51 GMT\nETag: \"3e45-67e-2ed02ec0\"\nAccept-Ranges: bytes\n" +
"Content-Length: 19\nConnection: close\nContent-Type: text/plain\n\nThis is the content")
if err != nil {
panic(err)
}
builder.AddWarcHeader(WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>")
builder.AddWarcHeader(WarcDate, "2006-01-02T15:04:05Z")
builder.AddWarcHeader(ContentLength, "257")
builder.AddWarcHeader(ContentType, "application/http;msgtype=response")
builder.AddWarcHeader(WarcBlockDigest, "sha1:B285747AD7CC57AA74BCE2E30B453C8D1CB71BA4")
builder.AddWarcHeader(gowarc.WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>")
builder.AddWarcHeader(gowarc.WarcDate, "2006-01-02T15:04:05Z")
builder.AddWarcHeader(gowarc.ContentLength, "257")
builder.AddWarcHeader(gowarc.ContentType, "application/http;msgtype=response")
builder.AddWarcHeader(gowarc.WarcBlockDigest, "sha1:B285747AD7CC57AA74BCE2E30B453C8D1CB71BA4")

if wr, v, err := builder.Build(); err == nil {
fmt.Println(wr, v)
}
// Output: WARC record: version: WARC/1.1, type: response, id: urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008
}

func ExampleUnmarshaler() {
data := bytes.NewBufferString(" WARC/1.1\r\n" +
"WARC-Date: 2017-03-06T04:03:53Z\r\n" +
"WARC-Record-ID: <urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>\r\n" +
"WARC-Filename: temp-20170306040353.warc.gz\r\n" +
"WARC-Type: warcinfo\r\n" +
"Content-Type: application/warc-fields\r\n" +
"Warc-Block-Digest: sha1:AF4D582B4FFC017D07A947D841E392A821F754F3\r\n" +
"Content-Length: 34\r\n" +
"\r\n" +
"format: WARC File Format 1.1\r\n" +
"\r\n\r\n")
input := bufio.NewReader(data)

// Create a new unmarshaler
unmarshaler := gowarc.NewUnmarshaler(gowarc.WithSpecViolationPolicy(gowarc.ErrWarn), gowarc.WithSyntaxErrorPolicy(gowarc.ErrWarn))
wr, off, validation, err := unmarshaler.Unmarshal(input)
if err == nil {
fmt.Printf("Offset: %d, %s\n%s", off, wr, validation)
}

// Output: Offset: 2, WARC record: version: WARC/1.1, type: warcinfo, id: urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008
// gowarc: Validation errors:
// 1: gowarc: record was found 2 bytes after expected offset
// 2: block: wrong digest: expected sha1:AF4D582B4FFC017D07A947D841E392A821F754F3, computed: sha1:8A936F9FD60D664CF95B1FFB40F1C4093E65BB40
}

func ExampleNewWarcFileWriter() {
nameGenerator := &gowarc.PatternNameGenerator{Directory: "directory-name"}

w := gowarc.NewWarcFileWriter(gowarc.WithFileNameGenerator(nameGenerator))
defer func() {
w.Close()
}()

builder := gowarc.NewRecordBuilder(gowarc.Response, gowarc.WithStrictValidation())
_, err := builder.WriteString("HTTP/1.1 200 OK\r\nDate: Tue, 19 Sep 2016 17:18:40 GMT\r\nContent-Length: 19 ....")
if err != nil {
panic(err)
}
builder.AddWarcHeader(gowarc.WarcRecordID, "<urn:uuid:e9a0cecc-0221-11e7-adb1-0242ac120008>")
builder.AddWarcHeader(gowarc.WarcDate, "2006-01-02T15:04:05Z")
builder.AddWarcHeader(gowarc.ContentType, "application/http;msgtype=response")

if wr, _, err := builder.Build(); err == nil {
w.Write(wr)
}
}

func ExampleNewWarcFileReader() {
reader, err := gowarc.NewWarcFileReader("test.warc.gz", 0, gowarc.WithStrictValidation())
if err != nil {
fmt.Println("Error creating warc reader:", err)
return
}

for {
record, _, _, err := reader.Next()
if err == io.EOF {
break
}
if err != nil {
fmt.Println("Error reading record:", err)
return
}
fmt.Println("Record type:", record.Type().String())
fmt.Println("Record version:", record.Version())
// Do more with record as per needs
}

}
6 changes: 3 additions & 3 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ require (
github.com/kr/pretty v0.3.1 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
golang.org/x/net v0.17.0 // indirect
golang.org/x/sys v0.13.0 // indirect
golang.org/x/text v0.13.0 // indirect
golang.org/x/net v0.19.0 // indirect
golang.org/x/sys v0.15.0 // indirect
golang.org/x/text v0.14.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)
16 changes: 8 additions & 8 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -1034,7 +1034,7 @@ golang.org/x/crypto v0.0.0-20220622213112-05595931fe9d/go.mod h1:IxCIyHEi3zRg3s0
golang.org/x/crypto v0.0.0-20220722155217-630584e8d5aa/go.mod h1:IxCIyHEi3zRg3s0A5j5BB6A9Jmi73HwBIUl50j+osU4=
golang.org/x/crypto v0.0.0-20221012134737-56aed061732a/go.mod h1:IxCIyHEi3zRg3s0A5j5BB6A9Jmi73HwBIUl50j+osU4=
golang.org/x/crypto v0.1.0/go.mod h1:RecgLatLF4+eUMCP1PoPZQb+cVrJcOPbHkTkbkB9sbw=
golang.org/x/crypto v0.14.0/go.mod h1:MVFd36DqK4CsrnJYDkBA3VC4m2GkXAM0PvzMCn4JQf4=
golang.org/x/crypto v0.16.0/go.mod h1:gCAAfMLgwOJRpTjQ2zCCt2OcSfYMTeZVSRtQlPC7Nq4=
golang.org/x/exp v0.0.0-20190121172915-509febef88a4/go.mod h1:CJ0aWSM057203Lf6IL+f9T1iT9GByDxfZKAQTCR3kQA=
golang.org/x/exp v0.0.0-20190306152737-a1d7652674e8/go.mod h1:CJ0aWSM057203Lf6IL+f9T1iT9GByDxfZKAQTCR3kQA=
golang.org/x/exp v0.0.0-20190510132918-efd6b22b2522/go.mod h1:ZjyILWgesfNpC6sMxTJOJm9Kp84zZh5NQWvqDGG3Qr8=
Expand Down Expand Up @@ -1150,8 +1150,8 @@ golang.org/x/net v0.0.0-20221014081412-f15817d10f9b/go.mod h1:YDH+HFinaLZZlnHAfS
golang.org/x/net v0.1.0/go.mod h1:Cx3nUiGt4eDBEyega/BKRp+/AlGL8hYe7U9odMt2Cco=
golang.org/x/net v0.6.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs=
golang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg=
golang.org/x/net v0.17.0 h1:pVaXccu2ozPjCXewfr1S7xza/zcXTity9cCdXQYSjIM=
golang.org/x/net v0.17.0/go.mod h1:NxSsAGuq816PNPmqtQdLE42eU2Fs7NoRIZrHJAlaCOE=
golang.org/x/net v0.19.0 h1:zTwKpTd2XuCqf8huc7Fo2iSy+4RHPd10s4KzeTnVr1c=
golang.org/x/net v0.19.0/go.mod h1:CfAk/cbD4CthTvqiEl8NpboMuiuOYsAr/7NOjZJtv1U=
golang.org/x/oauth2 v0.0.0-20180821212333-d2e6202438be/go.mod h1:N/0e6XlmueqKjAGxoOufVs8QHGRruUQn6yWY3a++T0U=
golang.org/x/oauth2 v0.0.0-20190226205417-e64efc72b421/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw=
golang.org/x/oauth2 v0.0.0-20190604053449-0f29369cfe45/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw=
Expand Down Expand Up @@ -1302,15 +1302,15 @@ golang.org/x/sys v0.0.0-20220919091848-fb04ddd9f9c8/go.mod h1:oPkhp1MJrh7nUepCBc
golang.org/x/sys v0.1.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.8.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.13.0 h1:Af8nKPmuFypiUBjVoU9V20FiaFXOcuZI21p0ycVYYGE=
golang.org/x/sys v0.13.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.15.0 h1:h48lPFYpsTvQJZF4EKyI4aLHaev3CxivZmv7yZig9pc=
golang.org/x/sys v0.15.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/term v0.0.0-20201117132131-f5c789dd3221/go.mod h1:Nr5EML6q2oocZ2LXRh80K7BxOlk5/8JxuGnuhpl+muw=
golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
golang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8=
golang.org/x/term v0.1.0/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8=
golang.org/x/term v0.5.0/go.mod h1:jMB1sMXY+tzblOD4FWmEbocvup2/aLOaQEp7JmGp78k=
golang.org/x/term v0.8.0/go.mod h1:xPskH00ivmX89bAKVGSKKtLOWNx2+17Eiy94tnKShWo=
golang.org/x/term v0.13.0/go.mod h1:LTmsnFJwVN6bCy1rVCoS+qHT1HhALEFxKncY3WNNh4U=
golang.org/x/term v0.15.0/go.mod h1:BDl952bC7+uMoWR75FIrCDx79TPU9oHkTZ9yRbYOrX0=
golang.org/x/text v0.0.0-20170915032832-14c0d48ead0c/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.3.1-0.20180807135948-17ff2d5776d2/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
Expand All @@ -1324,8 +1324,8 @@ golang.org/x/text v0.3.8/go.mod h1:E6s5w1FMmriuDzIBO73fBruAKo1PCIq6d2Q6DHfQ8WQ=
golang.org/x/text v0.4.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8=
golang.org/x/text v0.7.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8=
golang.org/x/text v0.9.0/go.mod h1:e1OnstbJyHTd6l/uOt8jFFHp6TRDWZR/bV3emEE/zU8=
golang.org/x/text v0.13.0 h1:ablQoSUd0tRdKxZewP80B+BaqeKJuVhuRxj/dkrun3k=
golang.org/x/text v0.13.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE=
golang.org/x/text v0.14.0 h1:ScX5w1eTa3QqT8oi6+ziP7dTV1S2+ALU0bI+0zXKWiQ=
golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU=
golang.org/x/time v0.0.0-20180412165947-fbb02b2291d2/go.mod h1:tRJNPiyCQ0inRvYxbN9jk5I+vvW/OXSQhTDSoE431IQ=
golang.org/x/time v0.0.0-20181108054448-85acf8d2951c/go.mod h1:tRJNPiyCQ0inRvYxbN9jk5I+vvW/OXSQhTDSoE431IQ=
golang.org/x/time v0.0.0-20190308202827-9d24e82272b4/go.mod h1:tRJNPiyCQ0inRvYxbN9jk5I+vvW/OXSQhTDSoE431IQ=
Expand Down
35 changes: 33 additions & 2 deletions record.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,28 +34,52 @@ const (
crlfcrlf = "\r\n\r\n" // Carriage return, Newline, Carriage return, Newline
)

// WarcRecord is the interface implemented by types that can represent a WARC record.
// A new instance of WarcRecord is created by a [WarcRecordBuilder].
type WarcRecord interface {
// Version returns the WARC version of the record.
Version() *WarcVersion

// Type returns the WARC record type.
Type() RecordType

// WarcHeader returns the WARC header fields.
WarcHeader() *WarcFields

// Block returns the content block of the record.
Block() Block

// RecordId returns the WARC-Record-ID header field.
RecordId() string

// ContentLength returns the Content-Length header field.
ContentLength() (int64, error)

// Date returns the WARC-Date header field.
Date() (time.Time, error)

// String returns a string representation of the record.
String() string

// Closer closes the record and releases any resources associated with it.
io.Closer

// ToRevisitRecord takes RevisitRef referencing the record we want to make a revisit of and returns a revisit record.
ToRevisitRecord(ref *RevisitRef) (WarcRecord, error)
// RevisitRef extracts a RevisitRef current record if it is a revisit record.

// RevisitRef extracts a RevisitRef from the current record if it is a revisit record.
RevisitRef() (*RevisitRef, error)

// CreateRevisitRef creates a RevisitRef which references the current record.
//
// The RevisitRef might be used by another records ToRevisitRecord to create a revisit record referencing this record.
// The RevisitRef might be used by another record's ToRevisitRecord to create a revisit record referencing this record.
CreateRevisitRef(profile string) (*RevisitRef, error)

// Merge merges this record with its referenced record(s)
//
// It is implemented only for revisit records, but this function will be enhanced to also support segmented records.
Merge(record ...WarcRecord) (WarcRecord, error)

// ValidateDigest validates block and payload digests if present.
//
// If option FixDigest is set, an invalid or missing digest will be corrected in the header.
Expand All @@ -71,13 +95,18 @@ type WarcRecord interface {
ValidateDigest(validation *Validation) error
}

// WarcVersion represents a WARC specification version.
//
// For record creation, only WARC 1.0 and 1.1 are supported which are represented by the constants [V1_0] and [V1_1].
// During parsing of a record, the WarcVersion will take on the version value found in the record itself.
type WarcVersion struct {
id uint8
txt string
major uint8
minor uint8
}

// String returns a string representation of the WARC version in the format used by WARC files i.e. 'WARC/1.0' or 'WARC/1.1'.
func (v *WarcVersion) String() string {
return "WARC/" + v.txt
}
Expand All @@ -96,8 +125,10 @@ var (
V1_1 = &WarcVersion{id: 2, txt: "1.1", major: 1, minor: 1} // WARC 1.1
)

// RecordType represents the type of a WARC record.
type RecordType uint16

// String returns a string representation of the record type.
func (rt RecordType) String() string {
switch rt {
case 1:
Expand Down
14 changes: 14 additions & 0 deletions unmarshaler.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,23 @@ import (
"io"
)

// Unmarshaler is the interface implemented by types that can unmarshal a WARC record. A new instance of Unmarshaler is created by calling [NewUnmarshaler].
// NewUnmarshaler accepts a number of options that can be used to control the unmarshalling process. See [WarcRecordOption] for details.
//
// Unmarshal parses the WARC record from the given reader and returns:
// - The parsed WARC record. If an error occurred during the parsing, the returned WARC record might be nil.
// - The offset value indicating the number of characters that have been discarded until the start of a new record is found.
// - A pointer to a [Validation] object that stores any errors or warnings encountered during the parsing process.
// The validation object is only populated if the error specification is set to ErrWarn or ErrFail.
// - The standard error object in Go. If no error occurred during the parsing, this object is nil. Otherwise, it contains details about the encountered error.
//
// If the reader contains multiple records, Unmarshal parses the first record and returns.
// If the reader contains no records, Unmarshal returns an [io.EOF] error.
type Unmarshaler interface {
Unmarshal(b *bufio.Reader) (WarcRecord, int64, *Validation, error)
}

// unmarshaler implements the Unmarshaler interface.
type unmarshaler struct {
opts *warcRecordOptions
warcFieldsParser *warcfieldsParser
Expand All @@ -45,6 +58,7 @@ func NewUnmarshaler(opts ...WarcRecordOption) Unmarshaler {
return u
}

// Unmarshal implements the Unmarshal method in the Unmarshaler interface.
func (u *unmarshaler) Unmarshal(b *bufio.Reader) (WarcRecord, int64, *Validation, error) {
var r *bufio.Reader
var offset int64
Expand Down
Loading

0 comments on commit b976c17

Please sign in to comment.