Skip to content

Commit

Permalink
Add script to get CG-Dump; generate rvk.tsv (#1058)
Browse files Browse the repository at this point in the history
See also #2024.
  • Loading branch information
dr0i committed Aug 27, 2024
1 parent 52ff1f1 commit 2c80e61
Showing 1 changed file with 22 additions and 0 deletions.
22 changes: 22 additions & 0 deletions scripts/generateRvkConcordance.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#/bin/bash
# Date: 2024-08
# Description: gets the monthly generated aggregated data from culturegraph
# Is called from crontab every second Wednesday of the month.
# Takes 5.5h, single process on quaoar.
# Generated tsv: ~ 257 MB
# See https://github.com/hbz/lobid-resources/issues/1058.

URL_ROOT="https://data.dnb.de/culturegraph/"
TARGET_FNAME="/data/other/cg/aggregate.marcxml.gz"

FNAME=$(curl $URL_ROOT | grep '<a href="aggregate_' | sed 's#.*\<a href="aggregate_\(.*\)".*#aggregate_\1#g')
echo "Got filename: $FNAME"
wget $URL_ROOT$FNAME -O $TARGET_FNAME

FNAME_SIZE=$(ls -s $TARGET_FNAME |cut -d ' ' -f1)
if [ $FNAME_SIZE -gt 8654321 ]; then # 9593288 blocks was aggregate_20240507.marcxml.gz
cd ..
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
mvn exec:java -Dexec.mainClass="org.lobid.resources.run.CulturegraphXmlFilterHbzRvkToTsv" -Dexec.args=$TARGET_FNAME
fi

0 comments on commit 2c80e61

Please sign in to comment.