Skip to content

qEndpoint CLI Indexing datasets

Antoine Willerval edited this page Jan 25, 2024 · 1 revision

To follow these information, you need to have installed the qEndpoint CLI, this can be fund under the qEndpoint-CLI-commands page.

Datasets

Wikidata

To index Wikidata, you need first to download a dump of the dataset from here. You can download the truthy or all the dataset. (Here the difference)

BSBM

The Berlin SPARQL Benchmark (BSBM) is a benchmark using a generator to create custom sized benchmark. You can find this generator inside the bsbmtools.

The generator is working by generating a certain amount of products, here a table of some products with the amount of triples and HDT equivalent. (specs)

Products Triples HDT Size
10K 3.53M 191MB
50k 17.5M 933MB
100K 34.9M 1.9GB
200K 69.5M 3.6GB
500K 174M 9GB
1M 347M 18GB
2M 693M 38GB

The command to generate can be used with

# Generate a dataset with 100000 products
./generate -s nt -pc 100000

An update dataset can be generated using the -ud option. See the benchmark specification to understand what is the difference.

# Generate an update dataset with 100000 products
./generate -s nt -ud -pc 100000

LUBM

The Lehigh University Benchmark (LUBM) is a benchark that is also using a generator to create a custom dataset. Less complex than BSBM, it can still be used to test the generation. The data are described in universities count.

To generate the dataset, we are using this script, it generates a dataset using a certain amount of universities.

# Generate dataset with 1000 university (LUBM1K)
./lubm.sh -t -u 1000

Indexing

Once you have your dataset, you can create an HDT file using the rdf2hdt command.

# Index the file dataset.nt.gz into dataset.hdt
rdf2hdt dataset.nt.gz dataset.hdt

If the size of the dataset is too important to fit into memory, you can config the indexing process to use a disk based algorithm.

First create a config file, we will call it option.hdtspec,

The default config file to use the disk generation is this one:

option.hdtspec

# Set the loader type, here cat splits the generation into small HDTs to generate 
# one with HDTCat
loader.type=cat
# Config cat loader, we use the disk loadertype to use the disk based generation 
# to create the small HDTs
loader.cattree.loadertype=disk
loader.cattree.futureHDTLocation=cfuture.hdt
loader.cattree.location=cattree
loader.cattree.memoryFaultFactor=1
loader.cattree.kcat=20
# Use the disk based generation configs, slower than the memory implementation, but better
# against out of memory.
loader.disk.futureHDTLocation=future_msd.hdt
loader.disk.location=gen
loader.disk.compressWorker=3
# HDTCat configuration
hdtcat.location=catgen
hdtcat.location.future=catgen.hdt

You can now add this config file to the rdf2hdt command with the -config option.

# Index the file dataset.nt.gz into dataset.hdt using option.hdtspec configs
rdf2hdt -config option.hdtspec dataset.nt.gz dataset.hdt

Now you should have access to the dataset.hdt HDT file, you can then use it with qEndpoint or with another software using HDT files.

Custom dictionary

You can define custom dictionary type using the dictionary.type property, it can also be used to generate HDTq HDT files.

You can use the -printoptions argument to see all options.

The currently supported dictionaries are splitted using three capabilities:

  • Quads: Is the HDT indexing quad (HDTq)
  • Type: Is the HDT indexing by literal types. (allows fast retrieval in the library)
  • Language: Is the HDT indexing by literal languages. (allows fast retrieval in the library)
Value Quads Type Language
<http://purl.org/HDT/hdt#dictionaryFour>
<http://purl.org/HDT/hdt#dictionaryFourQuad> ✔️
dictionaryMultiObj ✔️
dictionaryMultiObjLang ✔️ ✔️
dictionaryMultiObjLangQuad ✔️ ✔️ ✔️

/!\ Not all the dictionaries might be available in rdfhdt hdt libraries.

Convert dictionary

With your dictionary indexed, you might want to create another HDT with another dictionary type from an already indexed HDT without reindexing your data. The hdtconvert tool is made for that.

You can use it with the old dataset, the future dataset location and the dictionary type.

# Convert dataset-old.hdt into dataset-new.hdt with the new dictionary being a dictionaryMultiObjLang
hdtconvert dataset-old.hdt dataset-new.hdt dictionaryMultiObjLang

Not all the dictionary types are available to conversion.

Update

To add triples to an HDT file, you need first to create an HDT containing all the triples you want to add in your dataset. Once this is done you can use the hdtDiffCat command to merge the HDTs into one.

# Combine multiple HDT files (the last one is the output)
hdtDiffCat dataset1.hdt dataset2.hdt output.hdt

If you want to remove triples from an HDT file, you can also use hdtDiffCat, but this time using the -diff parameter

# Remove the triples of dataset3.hdt from dataset1.hdt and output the result in output.t
hdtDiffCat dataset1.hdt -diff dataset3.hdt output.hdt

These two commands can be combined to cat and diff at the same time.

# Cat dataset1.hdt and dataset2.hdt and remove the triples from dataset3.hdt into output.hdt
hdtDiffCat dataset1.hdt dataset2.hdt -diff dataset3.hdt output.hdt