Skip to content

Latest commit

 

History

History
162 lines (117 loc) · 4.53 KB

LDDB.md

File metadata and controls

162 lines (117 loc) · 4.53 KB

LDDB

Introduction

LDDB ("Linked Data Data Base") is the core data storage in XL.

It represents a set of conventions and methods, mostly about storing and querying JSON in PostgreSQL 9+.

Design

  1. Use compact, flat JSON-LD, with PNames only in keys and @type values.
  2. Use an outer object with a @graph array. This represents a "record" in LDDB, and is logically equivalent to a named graph (albeit it is represented as a default graph).
  3. Put the description of the named graph as item 0 in the array, and its mainEntity as item 1.
  4. Use sameAs as record and entity aliases.

Crunching JSON Lines from PSQL

In General

If the built-in JSON support in PSQL doesn't cut it, read on here.

Use psql and COPY (...) TO STDOUT:

$ psql -h $HOST -U $USER -tc "COPY (SELECT ...) TO STDOUT;" | sed 's/\\\\/\\/g'

(The sed part converts escaped psql output to valid JSON.)

For repetitive processing, consider redirecting the output to a file and processing it locally.

$ psql -h $HOST -Uwhelk -tc "COPY (SELECT data FROM lddb WHERE collection = 'bib' AND deleted = false) TO stdout;" | sed 's/\\\\/\\/g' | gzip > stg-lddb-bib.json.lines.gz
$ zcat stg-lddb-bib.json.lines.gz |  ...

Using JQ (and AWK)

JQ is a command line tool to process JSON. Most package managers provide it.

(Here also using time and AWK for convenience...)

Examples:

Find and count all authorized ("pre-coordinated") ComplexSubjects:

$ time psql -h $HOST -Uwhelk -tc "COPY (SELECT data FROM lddb WHERE collection = 'auth' AND deleted = false) TO STDOUT;" | sed 's/\\\\/\\/g' |
    jq '.["@graph"][1] | select(.["@type"] == "ComplexSubject") | .prefLabel' |
    awk '{print NR" "$0}'

1 "Varumärken--juridik och lagstiftning"
2 "Räkenskaper--historia"
3 "Skiften--juridik och lagstiftning"
4 "Substitutionsprincipen--miljöaspekter"
...

Count all usages of anonymous ("post-coordinated") ComplexSubjects:

$ zcat stg-lddb-bib.json.lines.gz |
    jq '.["@graph"][2].subject[]? | select(.["@type"] == "ComplexSubject") | .prefLabel' |
    awk '{printf "\r%s", NR}'

1234...

Find all ISBN values containing punctuation:

$ time zcat stg-lddb-bib.json.lines.gz |
    jq '.["@graph"][1]?.identifiedBy[]? |
        select(.["@type"] == "ISBN" and .value)? |
        .value | match(".+([^ ] ?[;:,]$|^[;:,])")? |
        .captures[0].string' |
    awk '{ a[$0]++ }
         END { for (k in a) print k": " a[k] }'

"6 ;": 1
") :": 6
") ;": 5
"1 ;": 1
...

Count the types of _marcUncompleted (including none):

$ time zcat stg-lddb-bib.json.lines.gz |
    jq -c '.["@graph"][]|._marcUncompleted?|type' |
    awk '{a[$0]++; printf "\r"; for (k in a) printf "%s %s; ", a[k], k }'

28 "array"; 214680 "null"; 27 "object";  ...

Find and count all _marcUncompleted patterns (fields and subfields):

$ time zcat stg-lddb-bib.json.lines.gz |
    jq -c '.["@graph"][] | select(has("_marcUncompleted"))? |
           ._marcUncompleted |
           if type == "object" then [.] else . end |
           .[] | [keys, ._unhandled]' |
    awk '{ a[$0]++ } END { for (k in a) print a[k]": "k }' |
    sort -nr

31: [["773","_unhandled"],["o"]]
28: [["945"],null]
1: [["586","_unhandled"],["ind1"]]
1: [["024","_unhandled"],"ind1"]
...

Using Python

Example stub:

import json
import sys

for i, l in enumerate(sys.stdin):
    if not l.rstrip():
        continue
    l = l.replace(b'\\\\"', b'\\"')
    if i % 100000 == 0:
        print("At line", i, file=sys.stderr)
    try:
        data = json.loads(l)

        # PROCESS DATA HERE

    except ValueError as e:
        print("ERROR at", i, "in data:", file=sys.stderr)
        print(l, file=sys.stderr)
        print(e, file=sys.stderr)

Create Json-shapes with statistics

  1. See instructions under "In General" to create a local output stream of bib, auth or hold.
  2. Run the lddb_json_shape.py script (using Python 3, ideally PyPy3).

Example for auth collection on STG environment:

$ psql -h $HOST -Uwhelk -tc "COPY (SELECT data FROM lddb WHERE collection = 'auth' AND deleted = false) TO stdout;" | sed 's/\\\\/\\/g' | gzip > stg-lddb-auth.json.lines.gz
$ time zcat stg-lddb-auth.json.lines.gz | pypy3 librisxl-tools/scripts/lddb_json_shape.py YOUR-OUTPUT-DIRECTORY

NOTE: On some systems, zcat may be installed as gzcat.

TIP: When crunching lots of data, use PyPy for speed.