page lengths of novels in the DNB catalogue
We analyse the number of pages of novels (i.e., fictional literary
works) in the German National Library (DNB).
It is not trivial to extract all novels from a big catalogue like that
of the German National library. “Librarians estimate that genre
information is present in the expected MARC field for less than a
quarter of the volumes in HathiTrust Digital Library,” (Underwood et
al. 2013 ) and we encounter the same problem, which calls for an
innovative solution.
Our approach is to
extract a list of writers from Wikidata together with their GND id
download linked data about the DNB books
join the writer list with the list of books using the GND id
This repository documents the evolution of this process, which turned
out to be not as straightforward as it seems. One reason is the size
of the data and the complexity of queries.
These are the different approaches we have tried, ordered chronologically:
querying Wikidata using a SPARQL endpoint
extracting authors from the Wikidata dump with an occupation property of a subclass of Writer
restricting to authors with a page in the German Wikipedia and works classified as “Roman”
This page shows the results for the latest approach which uses the
tools and methods developed before with different restrictions to
authors and works. Details on data extraction, cleansing, and joining
are described in one of the earlier documents .
We have also compared page lengths against the “1001 Books You Must
Read Before You Die” .
We restrict our analysis to works from the DNB dump which adhere to
the following conditions:
They were published in or after 1913 (issued_norm
>= 1913).
At least one of their authors has a GND id and an occupation
property in Wikidata and a sitelink to Wikipedia.
The work has an extractable page number (extent
matches the regex "\[?([0-9]+)\]? S(\.|eiten?);?
).
The works DNB property P60493 contains the character sequence
“roman” or “Roman”.
When analysing publishers we further limit the maximal number of pages
per work to 5000 to exclude errors.
Checksums:
for i in DNBTitel.rdf.gz DNBTitel_normalised_enriched.json.gz gnditems_2017-09-05_14:59.json.gz; do
echo " $i \t" $( ls -lh $i | awk ' {print $5"\t"$6,$7,$8"\t"}' ) $( md5sum < $i )
done
file name size date md5 hash
DNBTitel.rdf.gz 1.6G May 12 13:52 4dce7ed7e38bdc5f61491861b4a1082c
DNBTitel_normalised_enriched.json.gz 1.1G Sep 5 16:29 8e640bca81e6ac7504da00e223d766d1
gnditems_2017-09-05_14:59.json 102M Sep 5 16:12 943a6a50e2c19afb73fb859b64b20f06
All values of the P60493 property for items that fulfill our conditions:
./json2json.py -f \
-p " issued_norm,pages_norm,P60493,creator_wd.*.sitelinks" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913) print $3}' | sort -n | uniq -c | sort -nr\
> P60493.tsv
Print the top matches:
echo " Bezeichnung\tHäufigkeit"
sed -e " s/^ *//" -e " s/ /\t/" P60493.tsv \
| awk -F' \t' ' {if ($2 ~ /[rR]oman/) print $2"\t"$1}' \
| head -n50
echo -n " *Gesamtsumme*\t"
sed -e " s/^ *//" -e " s/ /\t/" P60493.tsv \
| awk -F' \t' ' {if ($2 ~ /[rR]oman/) sum+=$1} END {print sum}'
Bezeichnung Häufigkeit
Roman 120985
Kriminalroman 8657
[Roman] 2671
roman 2657
Science-fiction-Roman 986
historischer Roman 937
Kriminal-Roman 903
Western-Roman 760
Roman aus d. amerikan. Westen 408
heiterer Roman 405
Ein Roman 403
Westernroman 315
Arztroman 313
ein Roman 274
romanzo 259
Fantasy-Roman 250
Science-Fiction-Roman 248
histor. Roman 241
[roman] 228
Abenteuerroman 221
romanas 219
Horror-Roman 210
Wildwestroman 203
Ein heiterer Roman 200
[Kriminalroman] 184
Wildwest-Roman 181
Roman. 163
historischer Kriminalroman 163
Abenteuer-Roman 161
Zukunftsroman 160
zwei Romane in einem Band 157
Utop. Roman 153
Romanzo 153
Frauenroman 135
Planetenroman 131
e. Roman 129
utop. Roman 128
Histor. Roman 114
Jugendroman 108
Kinderroman 103
ein unheimlicher Roman 98
Roman ; [Thriller] 97
Wild-West-Roman 93
Heiterer Roman 93
Detektivroman 90
John-Sinclair-Roman 89
Roman für Kinder 87
zwei Romane 86
e. klass. Western-Roman 85
Detektiv-Roman 85
Gesamtsumme 180219
./json2json.py -f \
-p " issued_norm,pages_norm,publisher,P60493,creator_wd.*.sitelinks" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913 && $3 == "Reclam") print $4}' | sort | uniq -c | sort -nr \
> reclam.tsv
Condition for all items:
extracting data
all items with a page number
./json2json.py -f \
-p " issued_norm,pages_norm" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913) print $1"\t"$2}' | sort -n \
> items_per_year-page.tsv
all items with a page number and an author with a Wikipedia link
./json2json.py -f \
-p " issued_norm,pages_norm,creator_wd.*.sitelinks" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913) print $1"\t"$2}' | sort -n \
> items_per_year-page_author.tsv
./json2json.py -f \
-p " issued_norm,P60493" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913 && $2 ~ /[rR]oman/) print $1}' | sort -n \
> items_per_year-novel.tsv
all novels with a page number
./json2json.py -f \
-p " issued_norm,pages_norm,P60493" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913 && $3 ~ /[rR]oman/) print $1"\t"$2}' | sort -n \
> items_per_year-novel_page.tsv
all novels with a page number and an author with a Wikipedia link
./json2json.py -f \
-p " issued_norm,pages_norm,P60493,creator_wd.*.sitelinks" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913 && $3 ~ /[rR]oman/) print $1"\t"$2}' | sort -n \
> items_per_year-novel_page_author.tsv
wc -l items_per_year* .tsv
filter items
page 8346148
page + author 1349949
novel 353498
novel + page 316518
novel + page + author 180219
reset
set encoding utf8
set grid
set datafile separator " \t"
set xrange [1913 : 2017 ]
set xtics 10 ,10
set xlabel ' year'
set ylabel ' items'
set key left Left reverse
set y2tics
set term pngcairo enhanced size 800 ,600
set out ' img/items_per_year.png'
plot \
' < datamash -g1 count 1 < items_per_year-page.tsv' using 1 :2 with lines axes x1 y2 title ' Buecher mit Seitenzahlangabe (rechte y-Achse)' ,\
' < datamash -g1 count 1 < items_per_year-page_author.tsv' using 1 :2 with lines title ' Buecher mit Seitenzahlangabe und Autor*in in Wikipedia' ,\
' < datamash -g1 count 1 < items_per_year-novel.tsv' using 1 :2 with lines title ' Romane' ,\
' < datamash -g1 count 1 < items_per_year-novel_page.tsv' using 1 :2 with lines title ' Romane mit Seitenzahlangabe' ,\
' < datamash -g1 count 1 < items_per_year-novel_page_author.tsv' using 1 :2 with lines title ' Romane mit Seitenzahlangabe und Autor*in in Wikipedia'
set term svg enhanced size 800 ,600
set out ' img/items_per_year.svg'
replot
# relative frequency
set ylabel ' items'
set for mat y " %2.0f%%"
set term pngcairo enhanced size 800 ,600
set out ' img/items_per_year_rel.png'
plot \
' < datamash -g1 count 1 < items_per_year-page.tsv' using 1 :(100 *($2 /8346148 )) with lines title ' Buecher mit Seitenzahlangabe' ,\
' < datamash -g1 count 1 < items_per_year-page_author.tsv' using 1 :(100 *($2 /1349949 )) with lines title ' Buecher mit Seitenzahlangabe und Autor*in in Wikipedia' ,\
' < datamash -g1 count 1 < items_per_year-novel.tsv' using 1 :(100 *($2 /353498 )) with lines title ' Romane' ,\
' < datamash -g1 count 1 < items_per_year-novel_page.tsv' using 1 :(100 *($2 /316518 )) with lines title ' Romane mit Seitenzahlangabe' ,\
' < datamash -g1 count 1 < items_per_year-novel_page_author.tsv' using 1 :(100 *($2 /180219 )) with lines title ' Romane mit Seitenzahlangabe und Autor*in in Wikipedia'
set term svg enhanced size 800 ,600
set out ' img/items_per_year_rel.svg'
replot
Absolute numbers:
Relative numbers:
for i in $( ls items_per_* tsv) ; do
echo $i $( datamash count 1 sum 2 < $i )
done
filter items pages mean pages
page 8346148 1327973922 159
page + author 1349949 296472297 220
novel 353498 0
novel + page 316518 98947311 313
novel + page + author 180219 60717476 337
Of the original 14,102,309 items, we use 180,219 items with 60,717,476
pages. Those items fulfill the following conditions:
We can extract the year they have been issued.
They have been issued in or after 1913.
We can extract their extent (number of pages).
At least one of their authors has a GND id in Wikidata and a
Wikipedia page (in any Wikipedia language version).
For this set we did not require that any other values are available
(e.g., publisher) but some analyses might further restrict that set.
Compute frequencies:
sort -nk2 items_per_year-novel_page_author.tsv | datamash -g2 count 2 > pages_freq.tsv
Plot distribution:
reset
set term svg enhanced size 800 ,600
set out ' img/pages.svg'
set grid
set xrange [0 : 2000 ]
set logscale y
set for mat y " 10^%T"
set xlabel ' number of pages'
set ylabel ' frequency'
plot ' pages_freq.tsv' using 1 :2 with lines title ' '
set term pngcairo enhanced size 800 ,600
set out ' img/pages.png'
replot
# showing bogen boundaries
unset logscale
unset for mat y
set xtics 0 ,16
# zoom into range 400 to 600 to see 16-patterns of pages
set xrange [400 : 600 ]
set term pngcairo enhanced size 800 ,600
set out ' img/pages_400-600.png'
plot ' pages_freq.tsv' using 1 :2 with lines title ' '
set term svg enhanced size 800 ,600
set out ' img/pages_400-600.svg'
replot
# zoom into range 200 to 400 to see 16-patterns of pages
set xrange [200 : 400 ]
set term pngcairo enhanced size 800 ,600
set out ' img/pages_200-400.png'
plot ' pages_freq.tsv' using 1 :2 with lines title ' '
set term svg enhanced size 800 ,600
set out ' img/pages_200-400.svg'
replot
# zoom into range 0 to 200 to see 16-patterns of pages
set xrange [0 : 200 ]
set term pngcairo enhanced size 800 ,600
set out ' img/pages_000-200.png'
plot ' pages_freq.tsv' using 1 :2 with lines title ' '
set term svg enhanced size 800 ,600
set out ' img/pages_000-200.svg'
replot
Bin pages in multiples of 16:
steps = 16
limit = 1009
with open ("pages_freq_" + str (steps ) + ".tsv" , "wt" ) as out :
with open ("pages_freq.tsv" , "rt" ) as f :
bin = 0
binstr = ""
sumcount = 0
for line in f :
page , count = map (int , line .strip ().split ())
if page > limit :
if bin != limit :
bin = limit
binstr = str (limit ) + " und mehr"
elif page > bin :
if sumcount > 0 :
print (binstr , sumcount , file = out , sep = '\t ' )
bin += steps
binstr = str (bin - steps + 1 ) + "-" + str (bin )
sumcount = 0
sumcount += count
print (binstr , sumcount , file = out , sep = '\t ' )
reset
set grid y
set datafile separator " \t"
set xlabel ' page ranges'
set ylabel ' number of books'
set style data histogram filled
set style fill solid 1.0 noborder lt -1
set xtics rotate
set term pngcairo enhanced size 1000 ,600 font " Arial,10"
set out ' img/pages_16.png'
plot ' pages_freq_16.tsv' using 2 :xticlabels (1 ) title ' '
set term svg enhanced size 1000 ,600 font " Arial,10"
set out ' img/pages_16.svg'
replot
Let’s plot the median number of pages per year:
export LC_ALL=C
datamash -g1 median 2 mean 2 min 2 max 2 count 2 q1 2 q3 2 < items_per_year-novel_page_author.tsv > issued_pages_stats.tsv
reset
set encoding utf8
set term pngcairo enhanced size 800 ,600
set out ' img/issued_pages_decade.png'
set grid
set datafile separator " \t"
set xlabel ' year'
set ylabel ' number of pages'
set xrange [1913 : 2020 ]
set xtics 10 ,10
set term pngcairo enhanced size 800 ,600
set out ' img/issued_pages_1913.png'
plot \
' issued_pages_stats.tsv' using 1 :7 :8 with filledcurves fs transparent solid 0.2 noborder lc rgb " green" title ' 1st and 2nd quartile' ,\
' issued_pages_stats.tsv' using 1 :2 with lines lw 2 lt 3 lc rgb " green" title ' median'
# ,\
# 'issued_pages_stats.tsv' using 1:3 with lines lw 2 lt 3 lc rgb "blue" title 'mean'
set term svg enhanced size 800 ,600
set out ' img/issued_pages_1913.svg'
replot
Plot cumulative frequency distribution of the number of pages:
reset
set encoding utf8
set term pngcairo enhanced size 800 ,600
set out ' img/cumulative_page_distrib.png'
set grid
set datafile separator " \t"
set xlabel ' number of pages'
set ylabel ' P[x < number of pages]'
set logscale x
# divide the y-value by the number of books in the dataset
plot \
' ../1001-books/counts.tsv' using 1 :($2 /1001 ) smooth cumulative with lines title ' 1001 books' ,\
' pages_freq.tsv' using 1 :($2 /180219 ) smooth cumulative with lines title ' DNB'
set term svg enhanced size 800 ,600
set out ' img/cumulative_page_distrib.svg'
replot
The page distribution for the 1001 book list is skewed towards books
with longer pages. Let’s compare two specific ranges of pages: more
than 1000 pages vs. between 100 and 400 pages.
echo " dataset\t>1000 pages\t100-400 pages\tratio"
for file in ../1001-books/counts.tsv pages_freq.tsv; do
awk -F' \t' '
{
SUM += $2;
if ($1 > 1000) SUMBIG += $2;
if ($1 >= 100 && $1 <= 400) SUMSMALL += $2
} END {
printf("%s\t%s (%2.1f%%)\t%s (%2.1f%%)\t%2.4f\n", FILENAME, SUMBIG, SUMBIG/SUM*100, SUMSMALL, SUMSMALL/SUM*100, SUMBIG/SUMSMALL)
}' $file
done
dataset >1000 pages 100-400 pages ratio
1001 -books 23 (2.3%) 682 (68.1%) 0.0337
DNB 1056 (0.6%) 129167 (71.7%) 0.0082
TODO: plot distribution of the number of authors per work
./json2json.py -f \
-p " issued_norm,pages_norm,P60493,creator_wd.*.name,creator_wd.*.sitelinks" \
-c " creator_wd.*.name,creator_wd.*.sitelinks" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913 && $3 ~ /[rR]oman/) {sum[$4]+=$2; count[$4]+=1}} END {for (p in sum) printf("%s\t%s\t%s\t%s\n", sum[p], count[p], int(sum[p]/count[p]), p)}' \
> author_pages_stats.tsv
./json2json.py -f \
-p " issued_norm,pages_norm,P60493,creator_wd.*.name,creator_wd.*.id,creator_wd.*.sitelinks" \
-c " creator_wd.*.name,creator_wd.*.id,creator_wd.*.sitelinks" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913 && $3 ~ /[rR]oman/) print "[[https://www.wikidata.org/wiki/"$5"]["$4"]]"}' \
| sort -S1G | uniq -c | sort -nr | head -n50
without restriction to “[rR]oman”
./json2json.py -f \
-p " issued_norm,pages_norm,creator_wd.*.name,creator_wd.*.id,creator_wd.*.sitelinks" \
-c " creator_wd.*.name,creator_wd.*.id,creator_wd.*.sitelinks" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913) print "[[https://www.wikidata.org/wiki/"$4"]["$3"]]"}' \
| sort -S1G | uniq -c | sort -nr | head -n50
sort -S1G -nr author_pages_stats.tsv | head -n20
author pages items mean pages
Heinz G. Konsalik 692652 2232 310
Colleen McCullough 419930 133 3157
Marie Louise Fischer 331311 1264 262
Utta Danella 324470 778 417
Stephen King 293562 577 508
Fyodor Dostoyevsky 269869 390 691
Lion Feuchtwanger 248688 501 496
Eleanor Hibbert 235388 635 370
Johannes Mario Simmel 195975 403 486
Thomas Mann 191233 359 532
Gert Fritz Unger 188493 1013 186
Pearl S. Buck 185999 596 312
Robert Ludlum 185467 358 518
Hedwig Courths-Mahler 184677 647 285
Theodor Fontane 173444 565 306
Heinrich Mann 172019 394 436
Nora Roberts 171520 381 450
Hans Fallada 169877 396 428
Leo Tolstoy 163126 204 799
Georgette Heyer 159427 576 276
sort -S1G -nrk3 author_pages_stats.tsv | head -n20
author pages items mean pages work
Pierre Alexis Ponson du Terrail 3200 1 3200
Colleen McCullough 419930 133 3157
Petra Mönter 2290 1 2290
Stefano D’Arrigo 1470 1 1470
Vikram Seth 11208 8 1401
Jonathan Littell 4149 3 1383
Margaret George 35617 30 1187
Lucien Rebatet 1142 1 1142
Miquel de Palol 2266 2 1133
Cornelia Wusowski 14343 13 1103
William H. Gass 2184 2 1092
William King 1072 1 1072
Franz Erhard Walther 1071 1 1071
Péter Nádas 6414 6 1069
Gregory David Roberts 4250 4 1062
Hans Albrecht Moser 3171 3 1057
Francisco Casavella 1038 1 1038
Susanna Clarke 3068 3 1022
Baltasar Gracián 1013 1 1013
Elizabeth Arthur 2012 2 1006
There are probably some errors among those …
reset
set encoding utf8
set term pngcairo enhanced size 800 ,600
set out ' img/author_pages.png'
set grid
set datafile separator " \t"
set xrange [*: 10000 ]
set logscale
set for mat y " 10^%T"
set for mat x " 10^%T"
set xlabel ' number of items'
set ylabel ' mean number of pages per item'
set label " Heinz G.\nKonsalik" left at 2232 , 310 offset .5 , .3
set label " Colleen McCullough" left at 133 , 3157 offset .5 , .3
set label " Margaret George" left at 30 , 1187 offset .5 , .3
# set label "Guenther Bentele" left at 27, 3842 offset .5, .3
# set label "Johann\nWolfgang\nvon\nGoethe" left at 5169, 235 offset -1.8, 3.6
plot ' author_pages_stats.tsv' using 2 :3 with points pt 7 title ' '
set term svg enhanced size 800 ,600
set out ' img/author_pages.svg'
replot
TODO: top lists for different occupations
TODO: item count vs. mean page count colored by occupation
./json2json.py -f -p " issued_norm,pages_norm,title,_id,P60493,creator_wd.*.sitelinks" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913 && $5 ~ /[rR]oman/) {print $2"\t[[http://d-nb.info/"$4"]["$3"]] ("$1")"}}' \
| sort -S1G -nr | head -n100
./json2json.py -f \
-p " issued_norm,pages_norm,title,_id,P60493,creator_wd.*.name,creator_wd.*.sitelinks" \
-c " creator_wd.*.name,creator_wd.*.sitelinks" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913 && $5 ~ /[rR]oman/ && $6 == "Franz Kafka") {print $2"\t[[http://d-nb.info/"$4"]["$3"]] ("$1")"}}' \
| sort -S1G -nr | head -n50
We additionally consider only books with no more than 5000 pages to
avoid skews in the page counts due to errors.
Extract data:
./json2json.py -f -p " issued_norm,pages_norm,publisher,P60493,creator_wd.*.sitelinks" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913 && $4 ~ /[rR]oman/ && $2 <= 5000) print $3"\t"$2}' \
| sort -S1G > publisher.tsv
datamash -s -g1 count 1 < publisher.tsv | sort -t$' \t ' -S1G -nrk2 | head -n20
publisher items
Heyne 17249
Rowohlt 9356
Goldmann 8848
Ullstein 4986
Dt. Taschenbuch-Verl. 3864
Fischer-Taschenbuch-Verl. 3612
Suhrkamp 3513
RM-Buch-und-Medien-Vertrieb [u.a.] 3461
Piper 3363
Diogenes 2303
Dt. Buch-Gemeinschaft 1954
Weltbild 1912
Fischer-Taschenbuch-Verlag 1853
Büchergilde Gutenberg 1810
Droemer Knaur 1719
Rowohlt-Taschenbuch-Verl. 1678
Blanvalet 1630
Bastei-Verl. Lübbe 1478
Zsolnay 1238
Lübbe 1205
After normalisation: see below
LC_ALL=C datamash -s -g1 count 1 sum 2 mean 2 < publisher.tsv | sort -t$' \t ' -S1G -nrk3 | head -n20
publisher items page sum mean pages
Heyne 17249 6066956 352
Goldmann 8848 2898130 328
Rowohlt 9356 2604056 278
RM-Buch-und-Medien-Vertrieb [u.a.] 3461 1565075 452
Ullstein 4986 1536849 308
Dt. Taschenbuch-Verl. 3864 1281876 332
Fischer-Taschenbuch-Verl. 3612 1280201 354
Piper 3363 1264808 376
Suhrkamp 3513 1071240 305
Weltbild 1912 925697 484
Blanvalet 1630 774248 475
Dt. Buch-Gemeinschaft 1954 746935 382
Droemer Knaur 1719 716908 417
Diogenes 2303 715190 311
Büchergilde Gutenberg 1810 679455 375
Rowohlt-Taschenbuch-Verl. 1678 610853 364
Aufbau-Verl. 1205 525199 436
Fischer-Taschenbuch-Verlag 1853 519204 280
Dt. Bücherbund 1139 514752 452
Lübbe 1205 505148 419
LC_ALL=C datamash -s -g1 count 1 sum 2 mean 2 < publisher.tsv | sort -t$' \t ' -S1G -nrk4 | head -n20
publisher items page sum mean pages
Ander 1 3202 3202
K. M. John 1 1258 1258
Dörfler 1 1232 1232
Wissenschaftl. Buchges. 7 8052 1150
Uitg. NAS 1 1075 1075
Parkland 3 3214 1071
Blanvalet-Verlag 1 1056 1056
Nord 1 1032 1032
Wissenschaftl. Buchges 2 2030 1015
Schweizer Druck- u. Verl.-haus 1 1003 1003
Jokers-Ed. 1 989 989
Zentralverl. d. NSDAP Eher 1 980 980
Uitg.De Arbeiderspers 1 972 972
Implex-Verl. 1 971 971
Libr. General Française 1 955 955
Parkland-Verlag 8 7397 925
Lesering. Das Bertelsmann Buch 1 924 924
Roder 1 904 904
Leon 1 904 904
List-Taschenbuchverl. 1 896 896
How is the number of items per publisher related to the mean number of
pages per publisher?
LC_ALL=C datamash -s -g1 count 1 sum 2 mean 2 < publisher.tsv > publisher_page_stats.tsv
reset
set term pngcairo enhanced size 800 ,600
set out ' img/publisher_pages.png'
set grid
set datafile separator " \t"
set logscale
set xlabel ' number of items
set ylabel ' mean number of pages per item'
plot ' publisher_page_stats.tsv' using 2:4 with points pt 7 title ' '
set term svg enhanced size 800,600
set out ' img/publisher_pages.svg '
replot
top normalised publishers
These rankings only comprise the normalised publishers.
Cleaning up the publishers now by deleting all rows which should not
be regarded the same publisher and then creating a big intermediate
file:
./json2json.py -m publisher_map.tsv -f -p " issued_norm,pages_norm,publisher_norm,title,_id,P60493,creator_wd.*.name,creator_wd.*.id" \
DNBTitel_normalised_enriched.json.gz \
| awk -F' \t' ' {if ($1 >= 1913 && $2 <= 5000 && $6 ~ /[rR]oman/) print $0}' \
> publisher_data.tsv
cut -f3 publisher_data.tsv | sort -S1G | uniq -c | sort -nr
publisher items
Heyne 17430
Rowohlt 11354
Goldmann 8887
Ullstein 5597
Suhrkamp 3554
Piper 3394
Aufbau 2957
Kiepenheuer & Witsch 1285
Reclam 1117
Insel 1063
Hoffmann und Campe 988
Hanser 854
Luchterhand Literaturverlag 784
Manesse 390
Eichborn 360
Berlin Verlag 238
Nagel & Kimche 228
Ammann 150
Schöffling & Co. 147
Wallstein 60
Verbrecher Verlag 37
Blumenbar 30
Rogner & Bernhard 23
Wiesenburg 20
Voland & Quist 9
Urs Engeler Editor 4
awk -F' \t' ' {sum[$3]+=$2; count[$3]+=1} END {for (p in sum) printf("%s\t%s\t%s\t%s\n", sum[p], count[p], int(sum[p]/count[p]), p)}' publisher_data.tsv \
| sort -S1G -nr
publisher page sum items mean pages
Heyne 6148284 17430 352
Rowohlt 3319270 11354 292
Goldmann 2911633 8887 327
Ullstein 1708227 5597 305
Piper 1274961 3394 375
Aufbau 1203891 2957 407
Suhrkamp 1086269 3554 305
Kiepenheuer & Witsch 422237 1285 328
Insel 382329 1063 359
Hoffmann und Campe 374922 988 379
Hanser 298526 854 349
Reclam 283163 1117 253
Luchterhand Literaturverlag 253884 784 323
Manesse 205907 390 527
Eichborn 117060 360 325
Berlin Verlag 72008 238 302
Nagel & Kimche 53012 228 232
Schöffling & Co. 48106 147 327
Ammann 45497 150 303
Wallstein 14337 60 238
Verbrecher Verlag 12290 37 332
Rogner & Bernhard 9376 23 407
Blumenbar 7595 30 253
Wiesenburg 4799 20 239
Voland & Quist 2349 9 261
Urs Engeler Editor 1197 4 299
awk -F' \t' ' {sum[$3]+=$2; count[$3]+=1} END {for (p in sum) printf("%s\t%s\t%s\t%s\n", sum[p], count[p], int(sum[p]/count[p]), p)}' publisher_data.tsv \
| sort -S1G -nrk3
publisher page sum items mean pages
Manesse 205907 390 527
Rogner & Bernhard 9376 23 407
Aufbau 1203891 2957 407
Hoffmann und Campe 374922 988 379
Piper 1274961 3394 375
Insel 382329 1063 359
Heyne 6148284 17430 352
Hanser 298526 854 349
Verbrecher Verlag 12290 37 332
Kiepenheuer & Witsch 422237 1285 328
Schöffling & Co. 48106 147 327
Goldmann 2911633 8887 327
Eichborn 117060 360 325
Luchterhand Literaturverlag 253884 784 323
Ullstein 1708227 5597 305
Suhrkamp 1086269 3554 305
Ammann 45497 150 303
Berlin Verlag 72008 238 302
Urs Engeler Editor 1197 4 299
Rowohlt 3319270 11354 292
Voland & Quist 2349 9 261
Blumenbar 7595 30 253
Reclam 283163 1117 253
Wiesenburg 4799 20 239
Wallstein 14337 60 238
Nagel & Kimche 53012 228 232
Average page count per year per publisher:
awk -F' \t' ' {print int($1/10)"\t"$3"\t"$2}' publisher_data.tsv | sort | datamash -g1,2 mean 3 median 3 | sed " s/,/./g" | sort -n > publisher_pages_decades.tsv
reset
set encoding utf8
set term pngcairo enhanced size 800 ,600
set out ' img/publisher_pages_decades.png'
set grid
set datafile separator " \t"
set xlabel ' year'
set ylabel ' median number of pages'
set key top left horizontal maxcols 4
plot \
' < grep Heyne publisher_pages_decades.tsv' using ($1 *10 ):4 with linespoints pt 7 lw 2 title ' Heyne' ,\
' < grep Rowohlt publisher_pages_decades.tsv' using ($1 *10 ):4 with linespoints pt 7 lw 2 title ' Rowohlt' ,\
' < grep Goldmann publisher_pages_decades.tsv' using ($1 *10 ):4 with linespoints pt 7 lw 2 title ' Goldmann' ,\
' < grep Ullstein publisher_pages_decades.tsv' using ($1 *10 ):4 with linespoints pt 7 lw 2 title ' Ullstein' ,\
' < grep Suhrkamp publisher_pages_decades.tsv' using ($1 *10 ):4 with linespoints pt 7 lw 2 title ' Suhrkamp' ,\
' < grep Piper publisher_pages_decades.tsv' using ($1 *10 ):4 with linespoints pt 7 lw 2 title ' Piper' ,\
' < grep Aufbau publisher_pages_decades.tsv' using ($1 *10 ):4 with linespoints pt 8 lw 2 title ' Aufbau' ,\
' < grep Kiepenheuer publisher_pages_decades.tsv' using ($1 *10 ):4 with linespoints pt 8 lw 2 title ' Kiepenheuer & Witsch' ,\
' < grep Reclam publisher_pages_decades.tsv' using ($1 *10 ):4 with linespoints pt 7 lw 2 title ' Reclam' ,\
' < grep Insel publisher_pages_decades.tsv' using ($1 *10 ):4 with linespoints pt 7 lw 2 title ' Insel'
set term svg enhanced size 800 ,600
set out ' img/publisher_pages_decades.svg'
replot
Iterate over publishers:
for publisher in $( awk -F' \t' ' {print $2}' publisher_map.tsv | sort -u | sed " s/ /###/g" ) ; do
# get publisher name
publisher=$( echo $publisher | sed " s/###/ /g" )
# echo "$publisher\t" $(awk -F'\t' -v p="$publisher" '{if ($3 == p) print $2"\t hier dann Titel, Autor, Jahr"}' publisher_data.tsv | wc -l)
# extract all works
echo " \n**** $publisher \n"
echo " | pages | author: title (year) |"
awk -F' \t' -v p=" $publisher " ' {if ($3 == p) print "| "$2" | [[https://www.wikidata.org/wiki/"$8"]["$7"]]: [[http://d-nb.info/"$5"]["$4"]] ("$1")"}' publisher_data.tsv | sort -t' |' -nrk2 | head -n20
done
Luchterhand Literaturverlag
pages author: title (year)