Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add django command for extracting annotations #387

Merged
merged 47 commits into from
Oct 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
8aaf1c4
Test that removing or rejecting an annotator from a project should cl…
ianroberts May 5, 2023
43bca74
Errors should be raised, not returned
ianroberts May 5, 2023
aa949ee
Clear pending annotation(s) when rejecting a user from a project
ianroberts May 5, 2023
3409e27
Merge pull request #362 from GateNLP/annotator-removal-bug
twinkarma May 5, 2023
97c387f
correct DOI
May 11, 2023
1ab0cf2
manage version in readme citation section as well
May 11, 2023
664f8da
use safer regex pattern
May 11, 2023
a444750
Merge pull request #364 from GateNLP/correct_doi
davidwilby May 11, 2023
0d2e116
enable version.py update to take version number as argument
May 11, 2023
59d3617
Added fadein transition
twinkarma May 12, 2023
0db010c
Put transition trigger in getCurrentTask() instead
twinkarma May 12, 2023
5ac4304
Stop throwing error on initialise rpc call, added tests to check that…
twinkarma May 12, 2023
9502dcf
Merge pull request #367 from GateNLP/annotation-view-transition
twinkarma May 16, 2023
89d0cd5
Acquire a row lock on the user before checking for an annotation task…
ianroberts May 16, 2023
1c109f3
Initial attempt at implementing #164 (conditional annotation)
ianroberts Apr 24, 2023
a0d3133
Only validate components that are visible (i.e. exclude those with an…
ianroberts Apr 24, 2023
10df61d
Filter annotationOutput to include only visible widgets (those withou…
ianroberts Apr 29, 2023
7416d37
Explicitly install esbuild so it properly handles the different platf…
ianroberts May 5, 2023
112e0f4
Added some component tests for "if" expressions
ianroberts May 5, 2023
a0fbf2a
New approach - rather than allowing functions by default and then try…
ianroberts May 5, 2023
3af5c5b
A couple of enhancements that occurred to me as I was writing documen…
ianroberts May 9, 2023
ae9079e
First cut of documentation for conditional widgets, more examples needed
ianroberts May 9, 2023
c2402f8
Allow AnnotationRendererPreview in the docs to have a list of documen…
ianroberts May 10, 2023
b937fba
Add option to have AnnotationRenderer display below the form any erro…
ianroberts May 10, 2023
4a4b113
Added an example of a conditional widget that uses document as well a…
ianroberts May 10, 2023
9b95730
Merge pull request #365 from GateNLP/version_args
davidwilby May 17, 2023
4679cc3
correct error in navbar property setting
May 17, 2023
c78be97
Merge pull request #375 from GateNLP/issue-374
twinkarma May 22, 2023
1e5d360
Merge pull request #350 from GateNLP/conditional-widgets
twinkarma May 22, 2023
5ee58ce
Merge pull request #368 from GateNLP/fronend-error-page
twinkarma May 25, 2023
7cf3717
Merge pull request #376 from GateNLP/navbar_warning
twinkarma May 25, 2023
d450543
Make backup and restore scripts with with compose v2
ianroberts Jul 28, 2023
672a84d
Include backup and restore scripts in install.tar.gz
ianroberts Jul 28, 2023
a219a83
Fix version of postgres-backup-local to match the database server ver…
ianroberts Jul 28, 2023
6b1e6ea
Updated postgres version number in docs
ianroberts Jul 28, 2023
c9c9683
Merge pull request #382 from GateNLP/compose-backups
ianroberts Jul 30, 2023
b1cbc4f
If get-teamware.sh is run in an existing installation directory that …
ianroberts Jul 30, 2023
74c44c6
We need generate-docker-env *not* to load the existing .env itself wh…
ianroberts Jul 30, 2023
e06b246
Better fingerprint for an existing install
ianroberts Aug 4, 2023
44098c8
Merge pull request #383 from GateNLP/upgrade-script
ianroberts Aug 4, 2023
8cad199
Put the training and test docs in the correct lists
ianroberts Aug 4, 2023
636b1ec
In test_annotations_per_doc_not_enforced_for_training_or_test, instea…
ianroberts Aug 4, 2023
262acbd
Added debug logging that builds a table showing exactly which annotat…
ianroberts Aug 4, 2023
04118ad
Attempt to spread documents more evenly across annotators
ianroberts Aug 4, 2023
a9c28d8
Merge pull request #384 from GateNLP/spread-docs-evenly
ianroberts Aug 4, 2023
baa6882
Add django command for extracting annotations
twinkarma Oct 2, 2023
9692df0
Merge pull request #386 from GateNLP/manual-data-download
twinkarma Oct 2, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/create-release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ jobs:
- name: Create release artifacts
run: |
sed "s/DEFAULT_IMAGE_TAG=latest/DEFAULT_IMAGE_TAG=${GITHUB_REF_NAME#v}/" install/get-teamware.sh > ./get-teamware.sh
tar cvzf install.tar.gz README.md docker-compose*.yml generate-docker-env.sh create-django-db.sh nginx custom-policies Caddyfile
tar cvzf install.tar.gz README.md docker-compose*.yml generate-docker-env.sh create-django-db.sh nginx custom-policies Caddyfile backup_manual.sh backup_restore.sh

- name: Create release
uses: softprops/action-gh-release@v1
Expand Down
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,28 @@

### Fixed

In versions from 0.2.0 to 2.1.0 inclusive the default `docker-compose.yml` file fails to back up the database, due to a mismatch between the version of the database server and the version of the backup client. This is now fixed, but in order to create a proper database backup before attempting to upgrade you will need to manually edit your `docker-compose.yml` file and change

```yaml
pgbackups:
image: prodrigestivill/postgres-backup-local:12
```

to

```yaml
pgbackups:
image: prodrigestivill/postgres-backup-local:14
```

(change the "12" to "14"), then run `docker compose up -d` (or `docker-compose up -d`) again to upgrade just the backup tool. Once the correct backup tool is running you can start an immediate backup using

```
docker compose run --rm -it pgbackups /backup.sh
```

(or `docker-compose` if your version of Docker does not support compose v2).

## [2.1.0] 2023-05-03

### Added
Expand Down
2 changes: 1 addition & 1 deletion CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ identifiers:
- description: The collection of archived snapshots of all versions of GATE Teamware
2
type: doi
value: 10.5281/zenodo.7821718
value: 10.5281/zenodo.7899193
keywords:
- NLP
- machine learning
Expand Down
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

![](/frontend/public/static/img/gate-teamware-logo.svg "GATE Teamware")

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7821718.svg)](https://doi.org/10.5281/zenodo.7821718)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7899193.svg)](https://doi.org/10.5281/zenodo.7899193)

A web application for collaborative document annotation.

Full documentation can be [found here][docs].

GATE teamware provides a flexible web app platform for managing classification of documents by human annotators.
GATE Teamware provides a flexible web app platform for managing classification of documents by human annotators.

## Key Features
* Configure annotation options using a highly flexible JSON config.
Expand Down Expand Up @@ -37,6 +37,16 @@ bash ./get-teamware.sh

[A Helm chart](https://github.com/GateNLP/charts/tree/main/gate-teamware) is also available to allow deployment on Kubernetes.

### Upgrading

**When upgrading GATE Teamware it is strongly recommended to ensure you have a recent backup of your database before starting the upgrade procedure.** Database schema changes should be applied automatically as part of the upgrade but unexpected errors may cause data corruption - **always** take a backup before starting any significant changes to your database, so you can roll back in the event of failure.

Check the [changelog](CHANGELOG.md) - any breaking changes and special considerations for upgrades to particular versions will be documented there.

To upgrade a GATE Teamware installation that you installed using `get-teamware.sh`, simply download and run the latest version of the script in the same folder. It will detect your existing configuration and prompt you for any new settings that have been introduced in the new version. Note that any manual changes you have made to the `docker-compose.yml` and other files will not be duplicated automatically for the new version, you will have to port the necessary changes to the new files by hand.

Upgrading a Kubernetes deployment generally consists simply of installing the new chart version with `help upgrade`. As above, check the GATE Teamware changelog and the [chart readme](https://github.com/GateNLP/charts/tree/main/gate-teamware) for any special considerations, new or changed configuration values, etc. and ensure you have a recent database backup before starting the upgrade process.

## Building locally
Follow these steps to run the app on your local machine using `docker-compose`:
1. Clone this repository by running `git clone https://github.com/GateNLP/gate-teamware.git` and move into the `gate-teamware` directory.
Expand All @@ -63,7 +73,7 @@ Teamware is developed by the [GATE](https://gate.ac.uk) team, an academic resear
## Citation
For published work that has used Teamware, please cite this repository. One way is to include a citation such as:

> Karmakharm, T., Wilby, D., Roberts, I., & Bontcheva, K. (2022). GATE Teamware (Version 0.1.4) [Computer software]. https://github.com/GateNLP/gate-teamware
> Karmakharm, T., Wilby, D., Roberts, I., & Bontcheva, K. (2022). GATE Teamware (Version 2.1.0) [Computer software]. https://github.com/GateNLP/gate-teamware

Please use the `Cite this repository` button at the top of the [project's GitHub repository](https://github.com/GATENLP/gate-teamware) to get an up to date citation.

Expand Down
58 changes: 58 additions & 0 deletions backend/management/commands/download_annotations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import json
from django.core.management.base import BaseCommand, CommandError
from django.template.loader import render_to_string
from backend.rpcserver import JSONRPCEndpoint
from backend.views import DownloadAnnotationsView
import argparse

class Command(BaseCommand):

help = "Download annotation data"



def add_arguments(self, parser):
parser.add_argument("output_path", type=str, help="Path of file output")
parser.add_argument("project_id", type=str, help="ID of the project")
parser.add_argument("doc_type", type=str, help="Document type all, training, test, or annotation")
parser.add_argument("export_type", type=str, help="Type of export json, jsonl or csv")
parser.add_argument("anonymize", type=self.str2bool, help="Data should be anonymized or not ")
parser.add_argument("-j", "--json_format", type=str, help="Type of json format: raw (default) or gate ")
parser.add_argument("-n", "--num_entries_per_file", type=int, help="Number of entries to generate per file, default 500")


def handle(self, *args, **options):

annotations_downloader = DownloadAnnotationsView()

output_path = options["output_path"]
project_id = options["project_id"]
doc_type = options["doc_type"]
export_type = options["export_type"]
anonymize = options["anonymize"]
json_format = options["json_format"] if options["json_format"] else "raw"
num_entries_per_file = options["num_entries_per_file"] if options["num_entries_per_file"] else 500

print(f"Writing annotations to {output_path} \n Project: {project_id}\n Document type: {doc_type}\n Export type: {export_type} \n Anonymized: {anonymize} \n Json format: {json_format} \n Num entries per file: {num_entries_per_file}\n")

with open(output_path, "wb") as z:
annotations_downloader.write_zip_to_file(file_stream=z,
project_id=project_id,
doc_type=doc_type,
export_type=export_type,
json_format=json_format,
anonymize=anonymize,
documents_per_file=num_entries_per_file)


def str2bool(self, v):
if isinstance(v, bool):
return v
if v.lower() in ('yes', 'true', 't', 'y', '1'):
return True
elif v.lower() in ('no', 'false', 'f', 'n', '0'):
return False
else:
raise argparse.ArgumentTypeError('Boolean value expected.')


30 changes: 26 additions & 4 deletions backend/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,13 @@ class ServiceUser(AbstractUser):
agreed_privacy_policy = models.BooleanField(default=False)
is_deleted = models.BooleanField(default=False)

def lock_user(self):
"""
Lock this user with a SELECT FOR UPDATE. This method must be called within a transaction,
the lock will be released when the transaction commits or rolls back.
"""
return type(self).objects.filter(id=self.id).select_for_update().get()

@property
def has_active_project(self):
return self.annotatorproject_set.filter(status=AnnotatorProject.ACTIVE).count() > 0
Expand Down Expand Up @@ -485,6 +492,9 @@ def reject_annotator(self, user, finished_time=timezone.now()):
annotator_project.status = AnnotatorProject.COMPLETED
annotator_project.rejected = True
annotator_project.save()

Annotation.clear_all_pending_user_annotations(user)

except ObjectDoesNotExist:
raise Exception(f"User {user.username} is not an annotator of the project.")

Expand Down Expand Up @@ -589,6 +599,9 @@ def get_annotator_task(self, user):
user from annotator list if there's no more tasks or user reached quota.
"""

# Lock required to prevent concurrent calls from assigning two different tasks
# to the same user
user = user.lock_user()
annotation = self.get_current_annotator_task(user)
if annotation:
# User has existing task
Expand Down Expand Up @@ -623,7 +636,7 @@ def get_current_annotator_task(self, user):

annotation = current_annotations.first()
if annotation.document.project != self:
return RuntimeError(
raise RuntimeError(
"The annotation doesn't belong to this project! Annotator should only work on one project at a time")

return annotation
Expand Down Expand Up @@ -724,9 +737,18 @@ def assign_annotator_task(self, user, doc_type=DocumentType.ANNOTATION):
Annotation task performs an extra check for remaining annotation task (num_annotation_tasks_remaining),
testing and training does not do this check as the annotator must annotate all documents.
"""
if (DocumentType.ANNOTATION and self.num_annotation_tasks_remaining > 0) or \
DocumentType.TEST or DocumentType.TRAINING:
for doc in self.documents.filter(doc_type=doc_type).order_by('?'):
if (doc_type == DocumentType.ANNOTATION and self.num_annotation_tasks_remaining > 0) or \
doc_type == DocumentType.TEST or doc_type == DocumentType.TRAINING:
if doc_type == DocumentType.TEST or doc_type == DocumentType.TRAINING:
queryset = self.documents.filter(doc_type=doc_type).order_by('?')
else:
# Prefer documents which have fewer complete or pending annotations, in order to
# spread the annotators as evenly as possible across the available documents
queryset = self.documents.filter(doc_type=doc_type).alias(
occupied_annotations=Count("annotations", filter=Q(annotations__status=Annotation.COMPLETED)
| Q(annotations__status=Annotation.PENDING))
).order_by('occupied_annotations', '?')
for doc in queryset:
# Check that annotator hasn't annotated and that
# doc hasn't been fully annotated
if doc.user_can_annotate_document(user):
Expand Down
20 changes: 20 additions & 0 deletions backend/tests/test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -411,6 +411,26 @@ def test_reject_annotator(self):
self.assertEqual(AnnotatorProject.COMPLETED, annotator_project.status)
self.assertEqual(True, annotator_project.rejected)

def test_remove_annotator_clears_pending(self):
annotator = self.annotators[0]
# Start a task - should be one pending annotation
self.project.get_annotator_task(annotator)
self.assertEqual(1, annotator.annotations.filter(status=Annotation.PENDING).count())

# remove annotator from project - pending annotations should be cleared
self.project.remove_annotator(annotator)
self.assertEqual(0, annotator.annotations.filter(status=Annotation.PENDING).count())

def test_reject_annotator_clears_pending(self):
annotator = self.annotators[0]
# Start a task - should be one pending annotation
self.project.get_annotator_task(annotator)
self.assertEqual(1, annotator.annotations.filter(status=Annotation.PENDING).count())

# reject annotator from project - pending annotations should be cleared
self.project.reject_annotator(annotator)
self.assertEqual(0, annotator.annotations.filter(status=Annotation.PENDING).count())

def test_num_documents(self):
self.assertEqual(self.project.num_documents, self.num_docs)

Expand Down
27 changes: 23 additions & 4 deletions backend/tests/test_rpc_endpoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from django.utils import timezone
import json
import logging

from backend.models import Annotation, Document, DocumentType, Project, AnnotatorProject, UserDocumentFormatPreference
from backend.rpc import create_project, update_project, add_project_document, add_document_annotation, \
Expand All @@ -28,7 +29,7 @@
from backend.tests.test_rpc_server import TestEndpoint



LOGGER = logging.getLogger(__name__)

class TestUserAuth(TestCase):

Expand Down Expand Up @@ -1379,7 +1380,7 @@ def setUp(self):
self.num_training_docs = 5
self.training_docs = []
for i in range(self.num_training_docs):
self.docs.append(Document.objects.create(project=self.proj,
self.training_docs.append(Document.objects.create(project=self.proj,
doc_type=DocumentType.TRAINING,
data={
"text": f"Document {i}",
Expand All @@ -1396,7 +1397,7 @@ def setUp(self):
self.num_test_docs = 10
self.test_docs = []
for i in range(self.num_test_docs):
self.docs.append(Document.objects.create(project=self.proj,
self.test_docs.append(Document.objects.create(project=self.proj,
doc_type=DocumentType.TEST,
data={
"text": f"Document {i}",
Expand Down Expand Up @@ -1609,10 +1610,11 @@ def test_annotations_per_doc_not_enforced_for_training_or_test(self):
self.proj.save()

docs_annotated_per_user = []
for (i, (ann_user, _)) in enumerate(self.annotators):
for (ann_user, _) in self.annotators:
# Add to project
self.assertTrue(add_project_annotator(self.manager_request, self.proj.id, ann_user.username))

for (i, (ann_user, _)) in enumerate(self.annotators):
# Every annotator should be able to complete every training document, even though
# max annotations per document is less than the total number of annotators
self.assertEqual(self.num_training_docs,
Expand All @@ -1623,6 +1625,7 @@ def test_annotations_per_doc_not_enforced_for_training_or_test(self):
self.assertEqual(self.num_training_docs,
self.proj.get_annotator_document_score(ann_user, DocumentType.TRAINING))

for (i, (ann_user, _)) in enumerate(self.annotators):
# Every annotator should be able to complete every test document, even though
# max annotations per document is less than the total number of annotators
self.assertEqual(self.num_test_docs,
Expand All @@ -1633,6 +1636,7 @@ def test_annotations_per_doc_not_enforced_for_training_or_test(self):
self.assertEqual(self.num_training_docs,
self.proj.get_annotator_document_score(ann_user, DocumentType.TRAINING))

for (i, (ann_user, _)) in enumerate(self.annotators):
# Now attempt to complete task normally
num_annotated = self.complete_annotations(self.num_docs, "Annotation", annotator=i)
docs_annotated_per_user.append(num_annotated)
Expand Down Expand Up @@ -1662,15 +1666,30 @@ def complete_annotations(self, num_annotations_to_complete, expected_doc_type_st

# Expect to get self.num_training_docs tasks
num_completed_tasks = 0
if expected_doc_type_str == 'Annotation':
all_docs = self.docs
elif expected_doc_type_str == 'Training':
all_docs = self.training_docs
else:
all_docs = self.test_docs

annotated_docs = {doc.pk: ' ' for doc in all_docs}
for i in range(num_annotations_to_complete):
task_context = get_annotation_task(ann_req)
if task_context:
self.assertEqual(expected_doc_type_str, task_context.get("document_type"),
f"Document type does not match in task {task_context!r}, " +
"annotator {ann.username}, document {i}")
annotated_docs[task_context['document_id']] = "\u2714"
complete_annotation_task(ann_req, task_context["annotation_id"], {"sentiment": answer})
num_completed_tasks += 1

# Draw a nice markdown table of exactly which documents each annotator was given
if annotator == 0:
LOGGER.debug("Annotator | " + (" | ".join(str(i) for i in annotated_docs.keys())))
LOGGER.debug(" | ".join(["--"] * (len(annotated_docs)+1)))
LOGGER.debug(ann.username + " | " + (" | ".join(str(v) for v in annotated_docs.values())))

return num_completed_tasks

class TestAnnotationChange(TestEndpoint):
Expand Down
Loading
Loading