LDDB

Introduction

LDDB ("Linked Data Data Base") is the core data storage in XL.

It represents a set of conventions and methods, mostly about storing and querying JSON in PostgreSQL 9+.

Design

Use compact, flat JSON-LD, with PNames only in keys and @type values.
Use an outer object with a @graph array. This represents a "record" in LDDB, and is logically equivalent to a named graph (albeit it is represented as a default graph).
Put the description of the named graph as item 0 in the array, and its mainEntity as item 1.
Use sameAs as record and entity aliases.

Crunching JSON Lines from PSQL

In General

If the built-in JSON support in PSQL doesn't cut it, read on here.

Use psql and COPY (...) TO STDOUT:

$ psql -h $HOST -U $USER -tc "COPY (SELECT ...) TO STDOUT;" | sed 's/\\\\/\\/g'

(The sed part converts escaped psql output to valid JSON.)

For repetitive processing, consider redirecting the output to a file and processing it locally.

$ psql -h $HOST -Uwhelk -tc "COPY (SELECT data FROM lddb WHERE collection = 'bib' AND deleted = false) TO stdout;" | sed 's/\\\\/\\/g' | gzip > stg-lddb-bib.json.lines.gz
$ zcat stg-lddb-bib.json.lines.gz |  ...

Using JQ (and AWK)

JQ is a command line tool to process JSON. Most package managers provide it.

(Here also using time and AWK for convenience...)

Examples:

Find and count all authorized ("pre-coordinated") ComplexSubjects:

$ time psql -h $HOST -Uwhelk -tc "COPY (SELECT data FROM lddb WHERE collection = 'auth' AND deleted = false) TO STDOUT;" | sed 's/\\\\/\\/g' |
    jq '.["@graph"][1] | select(.["@type"] == "ComplexSubject") | .prefLabel' |
    awk '{print NR" "$0}'

1 "Varumärken--juridik och lagstiftning"
2 "Räkenskaper--historia"
3 "Skiften--juridik och lagstiftning"
4 "Substitutionsprincipen--miljöaspekter"
...

Count all usages of anonymous ("post-coordinated") ComplexSubjects:

$ zcat stg-lddb-bib.json.lines.gz |
    jq '.["@graph"][2].subject[]? | select(.["@type"] == "ComplexSubject") | .prefLabel' |
    awk '{printf "\r%s", NR}'

1234...

Find all ISBN values containing punctuation:

$ time zcat stg-lddb-bib.json.lines.gz |
    jq '.["@graph"][1]?.identifiedBy[]? |
        select(.["@type"] == "ISBN" and .value)? |
        .value | match(".+([^ ] ?[;:,]$|^[;:,])")? |
        .captures[0].string' |
    awk '{ a[$0]++ }
         END { for (k in a) print k": " a[k] }'

"6 ;": 1
") :": 6
") ;": 5
"1 ;": 1
...

Count the types of _marcUncompleted (including none):

$ time zcat stg-lddb-bib.json.lines.gz |
    jq -c '.["@graph"][]|._marcUncompleted?|type' |
    awk '{a[$0]++; printf "\r"; for (k in a) printf "%s %s; ", a[k], k }'

28 "array"; 214680 "null"; 27 "object";  ...

Find and count all _marcUncompleted patterns (fields and subfields):

$ time zcat stg-lddb-bib.json.lines.gz |
    jq -c '.["@graph"][] | select(has("_marcUncompleted"))? |
           ._marcUncompleted |
           if type == "object" then [.] else . end |
           .[] | [keys, ._unhandled]' |
    awk '{ a[$0]++ } END { for (k in a) print a[k]": "k }' |
    sort -nr

31: [["773","_unhandled"],["o"]]
28: [["945"],null]
1: [["586","_unhandled"],["ind1"]]
1: [["024","_unhandled"],"ind1"]
...

Using Python

Example stub:

import json
import sys

for i, l in enumerate(sys.stdin):
    if not l.rstrip():
        continue
    l = l.replace(b'\\\\"', b'\\"')
    if i % 100000 == 0:
        print("At line", i, file=sys.stderr)
    try:
        data = json.loads(l)

        # PROCESS DATA HERE

    except ValueError as e:
        print("ERROR at", i, "in data:", file=sys.stderr)
        print(l, file=sys.stderr)
        print(e, file=sys.stderr)

Create Json-shapes with statistics

See instructions under "In General" to create a local output stream of bib, auth or hold.
Run the lddb_json_shape.py script (using Python 3, ideally PyPy3).

Example for auth collection on STG environment:

$ psql -h $HOST -Uwhelk -tc "COPY (SELECT data FROM lddb WHERE collection = 'auth' AND deleted = false) TO stdout;" | sed 's/\\\\/\\/g' | gzip > stg-lddb-auth.json.lines.gz
$ time zcat stg-lddb-auth.json.lines.gz | pypy3 librisxl-tools/scripts/lddb_json_shape.py YOUR-OUTPUT-DIRECTORY

NOTE: On some systems, zcat may be installed as gzcat.

TIP: When crunching lots of data, use PyPy for speed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDDB.md

LDDB.md

LDDB

Introduction

Design

Crunching JSON Lines from PSQL

In General

Using JQ (and AWK)

Examples:

Using Python

Create Json-shapes with statistics

Files

LDDB.md

Latest commit

History

LDDB.md

File metadata and controls

LDDB

Introduction

Design

Crunching JSON Lines from PSQL

In General

Using JQ (and AWK)

Examples:

Using Python

Create Json-shapes with statistics