Skip to content
kim gerdes edited this page Oct 11, 2018 · 13 revisions

This page is a short guide to the Arborator, a collaborative dependency annotation software.

It contains information for

  • annotators
  • validators (the referees between different annotations that decide on the final annotation to be included in the corpus)
  • arborator administrators, who assign the annoations and validation tasks
  • and site administrators (for installation of the software on an apache web server)
The software is designed for use in Mozilla Firefox. Google Chrome works partly. Other browsers are not supported. There is also a simple viewing version of Arborator, useful for integrating dependency graphs in your webpage.

Table of Contents

the editor

Most functions are accessible by keyboard and mouse:

shortcuts

  • TAB - shows next sentence. if not yet loaded, loads it. loops through sentences
  • SPACE - edit next word. if no word handeled yet, opens first word, loops through words
  • BACKSPACE - edit previous word. if no word handeled yet, opens last word, loops through words
  • RETURN - accepts the changed head, function, or category. if SHIFT is held down and a new head is proposed, the existing link is preserved
  • ESC - stops editing, closes open menus
  • CURSOR UP or DOWN - if editing: open function menu, if moreover SHIFT or CTRL pressed: open category menu
  • "c" - if editing: open category menu
  • "f" - if editing open function menu
  • in open menus: first letter of function or category loops through the funcs/cats with that letter
  • ctrl-s - save

mouse

  • pull governor on dependent token creates link, function menu opens. if shift is held down, existing link is preserved
  • pull word to the top until the line changes color and let go to create a root link
  • click on function name or category opens corresponding menus
  • doubleclick on token: open all feature table

tree status

each annotator can assign a status to each tree:

  • no tree if no tree has been saved by the user yet. clicking on the "no tree" label saves the currently visible tree
  • todo
  • ok
  • problem
The status can be seen by other users on the project page.

export

the graph of a tree can be exported in

  • pdf
  • ps
  • odg
  • jpg
  • png
  • tiff
  • svg

compare mode

  • validators and administrators have access to the compare mode (if there are two different annotation of the sentence):
    • the fruit icon opens a list of the annotations for the sentence.
    • different annotations of the given sentence can be checked
    • clicking on the fruit icon again gives a graphical representation of the different annotations of the sentence.
    • wrong links in the unified tree can be erased and the corrected tree can be saved.

connecting and splitting

Administrators see a special button, a chain symbol, at the far right of the buttons for each sentence. This allows for

  • connecting the sentence with the following sentence into one line. A simple click on the chain button opens an approval dialog.
  • splitting the sentence at any position into two separate lines. To do that, the administrator selects the word after which he or she wants to split the sentence. The word is then color highlighted. Then, a Ctrl-click on the chain button, opens an approval dialog.

the project page

  • shows
    • the assigned texts
      • the annotator can change the text status by clicking on the default status (todo)
      • the default status tags are todo, ok, and problem
      • The status can be seen by administrators on the project page.
    • all the texts of the database and their annotators and validators
    • administrators additionally see a list of all the assignments per user
  • allows searching for words, functions, and other features in the database. See section "query"
  • allows for different types of conll export:
    • by assignment: the text is exported with all its sentences, a file per user. i.e. all sentences of a text are exported even if the annotator did not save them as his or her own. in that case, the parser's trees are taken
    • by existing trees: all saved trees for a given text are exported, a file per user, non-annotated sentences of a text do not appear in the file.
  • administrators can
    • assign annotators and validators: to assign a person,
      • click on "assign", choose the person,
        • click on "+" to assign a simple annotator (can't see the other annotators' annotations)
        • click on "✓" to assign a validator (can see the other annotators' annotations)
    • erase texts
    • add texts, see below

adding texts

  • Click on "Add files to the database".
  • Above: Upload a new file to the site, Below: list of already uploaded file containing syntactic analyses (CoNLL)
    • To upload: Click on "Browse", choose the file, "OK", then "Upload"
      • accepted file formats: Malt, CoNLL 10 (including orféo, ie. CoNLL 10+3), CoNLL 14
  • To include the file in the project: Click on the "^" button. If all goes well, the file is added to the database. Currently you have to refresh the Project page (F5) to see it.

exercise modes

  • teacher visible mode: dumb exercise where students have to copy the teacher's tree which is visible but not directly modifiable. a good start with 3 sentences or so.
  • no feedback: the student can't see the teacher's tree and gets no feedback, but the admin can export the results of the students' annotations compared to the teacher trees
  • percentage: when students save, they can see how many percent they got wrong of dependencies and pos, but they don't know where
  • graphical feedback: when students save, they can see where there are problems compared to the teacher's tree and they have to find the right annotation.

query

The Arborator allows for queries in the database. In the field in the above right corner of the project page, google-like queries can be carried out: space separated query terms and quotes around multiple words to search for the whole string including spaces, AND (default), OR, *, ... Hit "Enter" and the system will give back a list of results with links to the corresponding sentence (with snippets of sentences containing all the query terms if no features were used). Note that feature searches are much slower as they are not precompiled.

  • Feature queries: Colon-separated attribute value pairs can be included like for example cat:N. "func" or "function" can be used to access function names. These are valid queries:
    • 'agréable cat:I func:para_disfl'
    • 'lemma:pouvoir func:fixed tag:NOUN journée'
Note, however that double categories or double function queries will look for a word that has all the features. For example the query 'cat:N lemma:red' will look for words that have 'both' the features cat=N and lemma=red.

using the integrated Mate Parser for Bootstrapping

put 'mate = 1' In the config file. Then admins see a mate box on the project page and clicking it will take the validated trees and adds a tree for every sentence annotated by the "mate" user

Simple steps: Say you have a annotated example treebank and a simple text file you want to parse. On the server:

  1. Create a new project (new folder in projects of name XXX)
  2. ' cd lib ' and then ' python createDB.py XXX'
On the website of the new project:
  1. Click on "Add files to the database" to upload treebank
  2. Click on "Add files to the database" , then check "A sentence per line" to upload text file (one sentence per line, no empty lines!) ignore weird json, go back on step, refresh
  3. attribute the treebank sample to a user as validator, the user has to change the status from todo to ok
  4. check "all validated trees" in the mate box
  5. wait before admiring the parse results.
This can be repeated as soon as new trees have been validated.

account editing

  • On the bottom of each page are links to logout and to edit the user account. The page provides
    • information on passed site access
    • change of password and real name
  • Administrators additionally have a link for user administration which allows
    • To create or invite new users
    • To edit or delete users
    • To edit the main config file and default user

User Administration

  • Users can be simple annotators or validators
  • Annotators can only see their own trees and the trees by users specified in the project.cfg (generally the trees by parser)
  • Validators can see all the trees on the texts for which they are assigned as validators and can use the compare tools.
  • Admins are declared in the user admin pages. They should obtain the "Admin Level" of 1 (only the site admin should have 3). Admins can
    • attribute annotation validation tasks and
    • see all the trees
    • upload conll files
    • erase texts
    • export texts as conll
Be careful when editing user.ini files by hand: Don't leave temporary files in the user folder!

site administration and installation

The software is written in Python (server side) and Javascript (client side). It is released under the GNU Affero GPL v3 licence. The latest version can be obtained on the Arborator's launchpad page.

To install the Arborator, the whole source dump has to be unzipped to a folder on an apache server.

Arborator uses SQLite, and needs this to be installed just as the corresponding python sqlite module (standard in recent versions of Python). The tree transformation for the import options also use two non-standard modules: nltk.featstruct (you'll have to install nltk, on ubuntu, that's done with sudo apt-get install python-nltk) and jellyfish (for fuzzy matching when importing, install by typing sudo pip install jellyfish). All other used python modules are standard: difflib, generators, glob, hashlib, json, optparse, os, random, re, shutil, sqlite3, subprocess, sys, time, traceback, urllib, xml.dom

The Arborator includes various open source scripts and software:

  • Javascript tools
    • JQuery
    • JQueryUI
    • Raphael
    • JQuery.fileupload
  • Python tools
    • logintools
  • Java tools
    • Batik
    • svg2office

making Arborator run locally on an ubuntu machine for test purposes

install apache

either the whole lamp (if you want to use php or mysql):

 sudo apt-get install tasksel
 sudo tasksel install lamp-server

or simply apache alone:

 sudo apt-get install apache2

try to visit http://localhost in your browser

if it doesn't connect try

 sudo /etc/init.d/apache2 restart

linking the folder

suppose you downloaded arborator into /home/me/arborator (i.e. the file index.cgi for example is in this folder)

 sudo ln -s /home/me/arborator /var/www/arborator

make the folder accessible and writable for everyone (not a good idea for a public server!):

 sudo chmod -R a+rw /home/me/arborator

check whether all .cgi files are executable. if not:

 sudo find /home/arborator -name "*.cgi" -exec chmod +x {} \;

apache configuration

now look into /etc/apache2/sites-enabled/ there should be the default server: 000-default

open this file with your favorite editor, for example with dolphin



the information about the /var/www directory should look like this:

 <Directory /var/www/>
 		Options Indexes FollowSymLinks MultiViews
 		Options +FollowSymLinks 
 		Options +ExecCGI
 		AddHandler cgi-script .cgi
 		AllowOverride None
 		Order allow,deny
 		allow from all
 	</directory>

change it, save it, restart apache:

 sudo /etc/init.d/apache2 restart

now going on http://localhost/arborator in your browser should show the arborator start page

creating a new project

  • create a folder with the name of the project in the projects directory
  • copy project.cfg from an example project and edit it:
    • this includes giving the list of categories and functions in separate files. In the editor, functions and categories not in the list will be shown in grey and cannot be assigned.
    • each function and each category can be followed by a tab and a json/css description of how the arrows and categories should look like respectively
  • make all new folders world read and writable: the projects folder, the export/cache folder, the user folder, and all their subfolders should be world read-writable. if it is not , you could change the mode by :
 chmod 777 files_or_folder

If you edit anything in the users folder, don't leave temporary folders (...ini~).

  • to create a new database, go to the project directory (in arborator/projects) and run:
 python ../lib/createDB.py name_of_project
  • an image file name_of-the_project.png can be placed in the project folder. this will be shown on the start page.
  • you should not have to change anything outside of the projects folder.
    • only exception: you may want to add (rhapsodie xml or conll) files into the corpus folder, instead of manually uploading each file individually.
  • Please make sure all your project folder is writable ,

entering parser data

In a usual setting, the data is automatically preparsed and only corrected by the annotators. CoNLL (Malt, 10 or 14) files as well as Rhapsodie XML files can be uploaded into the database from the project page. For the moment, all existing annotations on texts of the same name are erased when uploading a new file!!!

  • you can also enter the CoNLL or XML files manually into the database (using a python script) instead of clicking on each link on the site:
look at the last lines of treebankfiles.py. the two essential lines are: sql = SQL(name_of_project) and sql.enterRhapsodie("corpus/xml/example.xml") for xml and sql.enterConll("corpus/conll/example.conll10") for CoNLL (Malt, CoNLL 10 and CoNLL 14 are accepted).
  • If users get their own login by signing up, they have to log on to confirm. Only then, the user is added to the database (and thus only then the user can be assigned an annotation task). In case of automatic creation of users (or manual creation by adding ini files), click once on User Administration / edit users so that all users are also included in the database.
  • Another way if you want enter a data file like .conllu , you can follow this exemple in function main:
    from database import SQL
    trees=conll.conllFile2trees("../projects/yourproject/export/yourexemple.conllu")     
    simpleEnterSentences(SQL("yourproject"),trees,"yourexemple", "parser", eraseAllAnnos=True)
  • Attention:
    • Make sure that you have imported SQL
    • If you did the option like: eraseAllAnnos=True, the ealier annotations named "yourexemple" will be erased.


further tweaks

The database is a file called arborator.db.sqlite. It's located in the project folder.

  • before doing any works on the databases, warn potential annotators by changing the name of the file xsitemessage.html into sitemessage.html
  • make a backup copy of your database (simply copy the file somewhere else)

bulk correction

the distribution contains a script called bulkCorrectDatabase.py

it allows to run over the whole database and correct features coherently. however, it's very slow. usually, a direct access to the database is faster.

  • the function evoked in is bulkcorrectDB.
    • it can be called with a list of treeids: bulkcorrectDB("Rhapsodie", [9795])
    • or without it: bulkcorrectDB("Rhapsodie")

TODO

  • non destructive upload: integrate users's annotation without deleting existing annotations, even if tokens are slightly different (diff)
  • mode comparaison: referee, cohen's kappa
  • importation of a selection of features from the rhapsodie xml format.
  • special save button for validators in order to keep their original annotation if they are also annotators.
  • graphical editor:
    • little problems with the undo manager. e.g. if menu opens and the same func/cat is chosen, don't register as dirty, something is wrong after saving.undo/redo accesssibility is different from dirty/clean: after save, undo/redo should remain accessible (ok)
  • check the simple graphical viewer:
    • put ad hoc colors back in again (currently: project's colors)
  • upload page: make trash file work, make rhapsodie xml work
  • make tiny and fast js compilation
  • redesign: make project choice cookie based (and not form based)