In this tutorial we are gonna add a resource Lex Lex
to Karp.
All commands are supposed to run after poetry shell
(that is, in an activated virtual environment).
- Create a resource config
- In this example we have saved the config in
path/to/parolelexplus.json
- In this example we have saved the config in
- Add a resource:
karp-cli resource create path/to/parolelexplus.json
- Publish the resource:
karp-cli resource publish <resource_id>
- In our example:
karp-cli resource publish parolelexplus
- In our example:
- Before adding entries, we can dry-run and validate them:
karp-cli entries validate --resource_id <resource_id> <path/to/entries>
- In our example:
karp-cli entries validate --resource_id parolelexplus path/to/parolelexplus.jsonl
- In our example:
- Adding entries, can be done in 2 ways, both methods can read
json
,jsonl
,json.gz
orjsonl.gz
files:- Add all in one go:
karp-cli entries add <resource_id> <path/to/entries>
- In our example:
karp-cli entries add parolelexplus path/to/parolelexplus.jsonl
- In our example:
- Add all but in chunks:
karp-cli entries add <resource_id> <path/to/entries> --chunked --chunk-size <number (default: 1000)>
- This version works by addind
<chunk-size>
entries to the database and indexes them in the search-service (e.g. ElasticSearch). - The pro with this method is that not all entries are held in memory.
- The con is that if a later chunk fails, all previously added chunks are stored, that is the resource is partially added. Manual intervention can be needed.
- In our example:
karp-cli entries add parolelexplus path/to/parolelexplus.jsonl --chunked --chunk-size 10000
- This version works by addind
- NOTE If are going to add a large resource this can take several hours, please use
nohup
for better experience.- For instance
nohup karp-cli entries add <resource_id> <path/to/entries> --chunked > stdout.log 2> stderr.log.jsonl &
. - Karp print logs to stderr in jsonl format.
- In our example:
nohup karp-cli entries add parolelexplus path/to/parolelexplus.jsonl --chunked > stdout.log 2> stderr.log.jsonl &
- For instance
- Add all in one go: