-
Notifications
You must be signed in to change notification settings - Fork 9
qEndpoint CLI Indexing datasets
To follow these information, you need to have installed the qEndpoint CLI, this can be fund under the qEndpoint-CLI-commands page.
To index Wikidata, you need first to download a dump of the dataset from here. You can download the truthy or all the dataset. (Here the difference)
The Berlin SPARQL Benchmark (BSBM) is a benchmark using a generator to create custom sized benchmark. You can find this generator inside the bsbmtools.
The generator is working by generating a certain amount of products, here a table of some products with the amount of triples and HDT equivalent. (specs)
Products | Triples | HDT Size |
---|---|---|
10K | 3.53M | 191MB |
50k | 17.5M | 933MB |
100K | 34.9M | 1.9GB |
200K | 69.5M | 3.6GB |
500K | 174M | 9GB |
1M | 347M | 18GB |
2M | 693M | 38GB |
The command to generate can be used with
# Generate a dataset with 100000 products
./generate -s nt -pc 100000
An update dataset can be generated using the -ud
option. See the benchmark specification to understand what is the difference.
# Generate an update dataset with 100000 products
./generate -s nt -ud -pc 100000
The Lehigh University Benchmark (LUBM) is a benchark that is also using a generator to create a custom dataset. Less complex than BSBM, it can still be used to test the generation. The data are described in universities count.
To generate the dataset, we are using this script, it generates a dataset using a certain amount of universities.
# Generate dataset with 1000 university (LUBM1K)
./lubm.sh -t -u 1000
Once you have your dataset, you can create an HDT file using the rdf2hdt
command.
# Index the file dataset.nt.gz into dataset.hdt
rdf2hdt dataset.nt.gz dataset.hdt
If the size of the dataset is too important to fit into memory, you can config the indexing process to use a disk based algorithm.
First create a config file, we will call it option.hdtspec,
The default config file to use the disk generation is this one:
option.hdtspec
# Set the loader type, here cat splits the generation into small HDTs to generate
# one with HDTCat
loader.type=cat
# Config cat loader, we use the disk loadertype to use the disk based generation
# to create the small HDTs
loader.cattree.loadertype=disk
loader.cattree.futureHDTLocation=cfuture.hdt
loader.cattree.location=cattree
loader.cattree.memoryFaultFactor=1
loader.cattree.kcat=20
# Use the disk based generation configs, slower than the memory implementation, but better
# against out of memory.
loader.disk.futureHDTLocation=future_msd.hdt
loader.disk.location=gen
loader.disk.compressWorker=3
# HDTCat configuration
hdtcat.location=catgen
hdtcat.location.future=catgen.hdt
You can now add this config file to the rdf2hdt command with the -config
option.
# Index the file dataset.nt.gz into dataset.hdt using option.hdtspec configs
rdf2hdt -config option.hdtspec dataset.nt.gz dataset.hdt
Now you should have access to the dataset.hdt HDT file, you can then use it with qEndpoint or with another software using HDT files.
You can define custom dictionary type using the dictionary.type
property, it can also be used to generate HDTq HDT files.
You can use the -printoptions
argument to see all options.
The currently supported dictionaries are splitted using three capabilities:
- Quads: Is the HDT indexing quad (HDTq)
- Type: Is the HDT indexing by literal types. (allows fast retrieval in the library)
- Language: Is the HDT indexing by literal languages. (allows fast retrieval in the library)
Value | Quads | Type | Language |
---|---|---|---|
<http://purl.org/HDT/hdt#dictionaryFour> |
❌ | ❌ | ❌ |
<http://purl.org/HDT/hdt#dictionaryFourQuad> |
✔️ | ❌ | ❌ |
dictionaryMultiObj |
❌ | ✔️ | ❌ |
dictionaryMultiObjLang |
❌ | ✔️ | ✔️ |
dictionaryMultiObjLangQuad |
✔️ | ✔️ | ✔️ |
/!\ Not all the dictionaries might be available in rdfhdt hdt libraries.
With your dictionary indexed, you might want to create another HDT with another dictionary type from an already indexed HDT without reindexing your data. The hdtconvert tool is made for that.
You can use it with the old dataset, the future dataset location and the dictionary type.
# Convert dataset-old.hdt into dataset-new.hdt with the new dictionary being a dictionaryMultiObjLang
hdtconvert dataset-old.hdt dataset-new.hdt dictionaryMultiObjLang
Not all the dictionary types are available to conversion.
To add triples to an HDT file, you need first to create an HDT containing all the triples you want to add in your dataset. Once this is done you can use the hdtDiffCat command to merge the HDTs into one.
# Combine multiple HDT files (the last one is the output)
hdtDiffCat dataset1.hdt dataset2.hdt output.hdt
If you want to remove triples from an HDT file, you can also use hdtDiffCat, but this time using the -diff parameter
# Remove the triples of dataset3.hdt from dataset1.hdt and output the result in output.t
hdtDiffCat dataset1.hdt -diff dataset3.hdt output.hdt
These two commands can be combined to cat and diff at the same time.
# Cat dataset1.hdt and dataset2.hdt and remove the triples from dataset3.hdt into output.hdt
hdtDiffCat dataset1.hdt dataset2.hdt -diff dataset3.hdt output.hdt