LDDB ("Linked Data Data Base") is the core data storage in XL.
It represents a set of conventions and methods, mostly about storing and querying JSON in PostgreSQL 9+.
- Use compact, flat JSON-LD, with PNames only in keys and
@type
values. - Use an outer object with a
@graph
array. This represents a "record" in LDDB, and is logically equivalent to a named graph (albeit it is represented as a default graph). - Put the description of the named graph as item 0 in the array, and its
mainEntity
as item 1. - Use
sameAs
as record and entity aliases.
If the built-in JSON support in PSQL doesn't cut it, read on here.
Use psql
and COPY (...) TO STDOUT
:
$ psql -h $HOST -U $USER -tc "COPY (SELECT ...) TO STDOUT;" | sed 's/\\\\/\\/g'
(The sed
part converts escaped psql
output to valid JSON.)
For repetitive processing, consider redirecting the output to a file and processing it locally.
$ psql -h $HOST -Uwhelk -tc "COPY (SELECT data FROM lddb WHERE collection = 'bib' AND deleted = false) TO stdout;" | sed 's/\\\\/\\/g' | gzip > stg-lddb-bib.json.lines.gz
$ zcat stg-lddb-bib.json.lines.gz | ...
JQ is a command line tool to process JSON. Most package managers provide it.
(Here also using time
and AWK for
convenience...)
Find and count all authorized ("pre-coordinated") ComplexSubjects:
$ time psql -h $HOST -Uwhelk -tc "COPY (SELECT data FROM lddb WHERE collection = 'auth' AND deleted = false) TO STDOUT;" | sed 's/\\\\/\\/g' |
jq '.["@graph"][1] | select(.["@type"] == "ComplexSubject") | .prefLabel' |
awk '{print NR" "$0}'
1 "Varumärken--juridik och lagstiftning"
2 "Räkenskaper--historia"
3 "Skiften--juridik och lagstiftning"
4 "Substitutionsprincipen--miljöaspekter"
...
Count all usages of anonymous ("post-coordinated") ComplexSubjects:
$ zcat stg-lddb-bib.json.lines.gz |
jq '.["@graph"][2].subject[]? | select(.["@type"] == "ComplexSubject") | .prefLabel' |
awk '{printf "\r%s", NR}'
1234...
Find all ISBN values containing punctuation:
$ time zcat stg-lddb-bib.json.lines.gz |
jq '.["@graph"][1]?.identifiedBy[]? |
select(.["@type"] == "ISBN" and .value)? |
.value | match(".+([^ ] ?[;:,]$|^[;:,])")? |
.captures[0].string' |
awk '{ a[$0]++ }
END { for (k in a) print k": " a[k] }'
"6 ;": 1
") :": 6
") ;": 5
"1 ;": 1
...
Count the types of _marcUncompleted
(including none):
$ time zcat stg-lddb-bib.json.lines.gz |
jq -c '.["@graph"][]|._marcUncompleted?|type' |
awk '{a[$0]++; printf "\r"; for (k in a) printf "%s %s; ", a[k], k }'
28 "array"; 214680 "null"; 27 "object"; ...
Find and count all _marcUncompleted
patterns (fields and subfields):
$ time zcat stg-lddb-bib.json.lines.gz |
jq -c '.["@graph"][] | select(has("_marcUncompleted"))? |
._marcUncompleted |
if type == "object" then [.] else . end |
.[] | [keys, ._unhandled]' |
awk '{ a[$0]++ } END { for (k in a) print a[k]": "k }' |
sort -nr
31: [["773","_unhandled"],["o"]]
28: [["945"],null]
1: [["586","_unhandled"],["ind1"]]
1: [["024","_unhandled"],"ind1"]
...
Example stub:
import json
import sys
for i, l in enumerate(sys.stdin):
if not l.rstrip():
continue
l = l.replace(b'\\\\"', b'\\"')
if i % 100000 == 0:
print("At line", i, file=sys.stderr)
try:
data = json.loads(l)
# PROCESS DATA HERE
except ValueError as e:
print("ERROR at", i, "in data:", file=sys.stderr)
print(l, file=sys.stderr)
print(e, file=sys.stderr)
- See instructions under "In General" to create a local output stream of bib, auth or hold.
- Run the
lddb_json_shape.py
script (using Python 3, ideally PyPy3).
Example for auth
collection on STG
environment:
$ psql -h $HOST -Uwhelk -tc "COPY (SELECT data FROM lddb WHERE collection = 'auth' AND deleted = false) TO stdout;" | sed 's/\\\\/\\/g' | gzip > stg-lddb-auth.json.lines.gz
$ time zcat stg-lddb-auth.json.lines.gz | pypy3 librisxl-tools/scripts/lddb_json_shape.py YOUR-OUTPUT-DIRECTORY
NOTE: On some systems, zcat may be installed as gzcat.
TIP: When crunching lots of data, use PyPy for speed.