The DBpedia Spotlight Model Editor was originally developed by Idio with the intent to tweak Dbpedia Spotlight models up to version 0.6 - 0.7. Both repo's have been archived and are no longer maintained.
As of 2018, DBpedia made the choice to move Spotlight's codebase to another repository, namely Dbpedia Spotlight Model.
This repo is an attempt to resuscitate the model-editor tool to make it work with the new(ish) dbpedia spotlight model entity linking system.
Many thanks to Idio (and specially to @dav009)
In order to use the Model Editor, you will need:
- Java 1.8
- Sbt (> 1.0)
- (Optional) Compiling and Installing the Dpedia Spotlight Model tool (if you want to test against a development version of spotlight model)
- A pre-computed language model (downloaded from here)
You should be able to install java, mvn and sbt in your system. If you are editing the latest models (i.e. version 1.1
) you are all set.
but for the rest you should:
git clone https://github.com/dbpedia-spotlight/dbpedia-spotlight-model
cd dbpedia-spotlight-model && mvn install
(this will build a development version of spotlight model into your local maven repository)- clone this repo, cd into it, and change the reference of your
build.sbt
file to it
In either case, run sbt test
, if it's all green you are ready to go.
This tool works as a command line tool for editing a model and there are essentially two ways to use it.
Since models are usually big, you should use a lot of ram in your machine (sometimes higher than 16GB).
You can compile a jar via sbt assembly
, which will produce target/scala-2.10/dbpedia-model-editor.jar
.
You can use this as a cli in the following way:
java -Xmx15g -jar target/scala-2.10/dbpedia-model-editor.jar <command> <subcommand> <args> ...
A script that calls the tool via sbt runMain
has been provided in the model-editor.sh
file.
You can use it like:
./model-editor.sh <command> <subcommand> <args> ...
Note that you might need to tweak the amount of -Xmx
passed in .sbtopts
depending on your machine or use case.
Commands and Subcommands allow you to perform certain actions on a spotlight model, including to manually:
- Add new Surface Forms
- Add new entity uris
- Create associations between surface forms and dbpedia uris
- Remove associations between surface forms and dbpedia uris
- Make surface forms spottable or invisible
- Modify the context vectors
Models are published by Dbpedia's Databus here. So before running these operations you need to download one corresponding to the language you are going to modify.
Once you have it all compiled and ready to go, you can test things by running:
./model-editor.sh explore <path-to-model-folder>/<lang>/model/ 20
this should print the stats for 20 surface forms from the model you just downloaded.
Start by freeing as much ram as possible. Each of the following tools addressing a
command
refers to calling the jar/script with one command or subcommand as follows:
- command:
explore
- arg1: path to dbpedia spotlight model,
en/model
- arg2: number of surface forms
- result: outputs arg2 number of SurfaceForms with their respective candidates, priors and statistics
example:
./model-editor.sh explore <path-to-model>/<lang>/model/ 40
All topic related actions are carried out using the topic
command followed by one of the following subcommands:
search
: checking if a topic is in the storescheck-context
: printing the context of a topicclean-set-context
: cleaning and setting the context of a topic
- command:
topic
- subcommand:
search
- arg1: path to dbpedia spotlight model (e.g.
en/model
) - arg2: dbpediaURI
- result: looks for a given
DbpediaId
in the Model and returns whether that topic exists or not in the model
i.e :
./model-editor.sh topic search <path-to-model> Michael_Schumacher
- command:
topic
- subcommand:
check-context
- arg1: path to dbpedia spotlight model (e.g.
en/model
) - arg2: piped separated list of dbpediaUris
example:
./model-editor.sh topic check-context en/model Barack_Obama|United_States
- command:
topic
- subcommand:
clean-set-context
- arg1: path to dbpedia spotlight model (e.g.
en/model
) - arg2: pathToFile
- result: The context words and counts for the topics in the file will be cleared. The specified context Words will be stemmed and added with their respective counts to the context vector of the given topics.
each line of the given input file should be like:
dbpediaUri <tab> contextWordsSeparatedByPipe <tab> countsSeparatedByPipe
the size of contextWordsSeparatedByPipe
and countsSeparatedByPipe
should be the same
example:
./model-editor.sh topic clean-set-context en/model folder/fileWithContextChanges
All surface forms related actions are carried out using the surfaceform
command followed by one of the following subcommands:
stats
: printing stats of a surface formcandidates
: printing the list of candidates of a surface formmake-spottable
: making surfaceforms spottablemake-unspottable
: making surfaceforms unspottablecopy-candidates
: adding to asurfaceformA
all candidates of asurfaceFormB
- subcommand:
surfaceform
- subcommand:
stats
- arg1: path to dbpedia spotlight model (e.g.
en/model
) - arg2: surfaceForm
- result: outputs statistics of the given surfaceForm
example :
./model-editor.sh surfaceform stats <path-to-model>/<lang>/model/ evrimleri
outputs statistics for the surface form evrimleri
- command:
surfaceform
- subcommand:
candidates
- arg1: path to dbpedia spotlight model (e.g.
en/model
) - arg2: surfaceForm
- result: outputs the candidate topics of a surface form
example :
./model-editor.sh surfaceform candidates <path-to-model>/<lang>/model/ evrimleri
would check the candidate topics for the surface form evrimleri
- command:
surfaceform
- subcommand:
make-unspottable
- arg1: path to dbpedia spotlight model (e.g.
en/model
) - arg2:
- list of Surface Forms separated by
|
. i.e:how\|How\|Hello\ World
- file containing a surfaceForm per line ( if option
-f
is passed)
- list of Surface Forms separated by
- result: Each
SF
won't be spottable anymore
./model-editor.sh surfaceform make-unspottable <path-to-model> surfaceForm1\|surfaceForm2\|
./model-editor.sh surfaceform make-unspottable <path-to-model> pathTo/File/withSF -f
-
command:
surfaceform
-
subcommand:
copy-candidates
-
arg1: path to dbpedia spotlight model (e.g.
en/model
) -
arg2: path to file containing pairs of surfaceForm. each line should be :
``` <originSurfaceForm> <tab> <destinySurfaceForm> ```
-
result: copies the candidate topics from each
originSurfaceForm
as candidates topics todestinySurfaceForm
example:
./model-editor.sh surfaceform copy-candidates <path-to-model> pathToFile
- command:
surfaceform
- subcommand:
make-spottable
- arg1: path to dbpedia spotlight model (e.g.
en/model
) - arg2:
- list of Surface Forms separated by
|
. i.e:how\|How\|Hello\ World
- file containing a surfaceForm per line ( if option
-f
is passed)
- list of Surface Forms separated by
- result: Each
SF
will be spottable
example:
./model-editor.sh surfaceform make-spottable <path-to-model> surfaceForm1\|surfaceForm2\|
./model-editor.sh surfaceform make-spottable <path-to-model> pathTo/File/withSF -f
All surface forms related actions are carried out using the association
command followed by one of the following subcommands:
remove
- command:
association
- subcommand:
remove
- arg1:
pathToSpotlightModel/model
- arg2: pathToInputFile
- result: All associations between SFs and Topics in the given input file will be deleted from the model.
Every line in the input file describes an association which will be deleted, each line should follow the format:
dbpediaURI <tab> Surface Form
example:
./model-editor.sh association remove en/model /path/to/file/file_with_associations
- command:
fsa
- subcommand:
find
- arg1: path to dbpedia spotlight model (e.g.
en/model
) - arg2: piped separated list of surface forms
- result: the FSA spots for each surface forms
example:
./model-editor.sh fsa find en/model Nintendo\ Wii\|barack
When updating the model with lots of SF
, Topics
and Context Words
best is to do it from a file.
each line of the file should follow the format:
dbpedia_id <tab> surfaceForm1|surfaceForm2... <tab> contextW1|contextW2... <tab> contextW1Counts|ContextW2Counts
Before doing actual changes to the model it might be useful to see how many SF
,dbpedia topics
and links between those two are missing.
./model-editor.sh file-update check path/to/en/model path_to_file/with/model/changes
.
make sure you have enough ram to hold all the models that should be around 15g. do:
./model-editor.sh file-update all path/to/en/model path_to_file/with/model/changes
If you don't have enough ram you can update the SF
and DbpediaTopics
in one step and the Context Words
in other, this will require less memory.
- go to the model folder and rename
context.mem
tocontext2.mem
this will avoid the jar to avoid loading thecontext store
- calling the following command will update the
surfaceform store
,resource store
andcandidate store
:./model-editor.sh file-update all path/to/en/model path_to_file/with/model/changes
. - a new file
path_to_file/with/model/changes_just_context
will be generated after running the previous command.This file contains dbpediaIds(internal model's indexes) to contextWords, and it can be processed in the following step. - rename
context2.mem
tocontext.mem
, and rename every other file in the model folder to something else.( if this is not done, the stores will be loaded and they will consume all your ram) - calling the following will update the
context store
:
./model-editor.sh file-update context-only path/to/en/model path_to_file/with/model/changes_just_context
- rename all files to their usual conventions and enjoy a fresh baked model
steps 1-4 could be applied while ignoring 5 and 6 when:
- wanting to add
SFs
- wanting to link
SFs
with already existingDbpedia Topic
steps 5-6 could be applied while ignoring 1-4 when:
- wanting to add Context words to a
Dbpedia Topic
Important:
step 1-4
will only addSF
andDbpedia Topics
if they don't exist.step 1-4
will make all specifiedSF
spottablestep 5-6
Only ADDS context words to the context of a dbpedia Topic.
One of the best ways to play the models and modify them is to use the scala console. From the project root, you can run:
sbt console
(note: we provide a .sbtopts
file with sensible defaults, but you might want to tweak those: adding less or more ram depending on your circumstances)
Once you start a scala console you can use it like ipython
to create instances of the scala classes we have, to load the models, check if dbpedia id's exist, add new dbpedia ids, add new surface forms etc ..
Example:
import org.idio.dbpedia.spotlight.SpotlightModelReader
var spotlightModel = SpotlightModelReader.getSpotlightModel("<path-to-model>/model")
spotlightModel.showSomeSurfaceForms(10) // show 10 surface forms
spotlightModel.getStatsForSurfaceForm("Barack Obama") // prints stats for entities associated with that sf
spotlightModel.searchForDBpediaResource("Caetano_Veloso") // return boolean if false
spotlightModel.addNew("ikimono_gakari_sf","ikimono_gakari_dbpedia_uri", 1 , Array()) // adds a new entity
spotlightModel.exportModels("/new/path/of/folder/model/") // exports
Copyright 2014 Idio
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0