-
Notifications
You must be signed in to change notification settings - Fork 0
/
stam.pyi
2341 lines (1795 loc) · 112 KB
/
stam.pyi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
from __future__ import annotations
from typing import Iterator, List, Optional, Union, Iterable
"""
[STAM](https://github.com/annotation/stam) is a standalone data model for stand-off text
annotation and described in detail [here](https://github.com/annotation/stam).
This is a python library (to be more specific; a python binding written in
Rust) to work with the model.
**What can you do with this library?**
* Keep, build and manipulate an efficient in-memory store of texts and annotations on texts
* Search in annotations, data and text, either programmatically or via the [STAM Query Language](https://github.com/annotation/stam/tree/master/extensions/stam-query).
* Search annotations by data, textual content, relations between text fragments (overlap, embedding, adjacency, etc),
* Search in text (incl. via regular expressions) and find annotations targeting found text selections.
* Search in data (set,key,value) and find annotations that use the data.
* Elementary text operations with regard for text offsets (splitting text on a delimiter, stripping text).
* Convert between different kind of offsets (absolute, relative to other structures, UTF-8 bytes vs unicode codepoints, etc)
* Read and write resources and annotations from/to STAM JSON, STAM CSV, or an optimised binary (CBOR) representation
* The underlying [STAM model](https://github.com/annotation/stam) aims to be clear and simple. It is flexible and
does not commit to any vocabulary or annotation paradigm other than stand-off annotation.
This STAM library is intended as a foundation upon which further applications
can be built that deal with stand-off annotations on text. We implement all
the low-level logic in dealing this so you no longer have to and can focus on your
actual application.
"""
class AnnotationStore:
"""
An Annotation Store is a collection of annotations, resources and
annotation data sets. It can be seen as the *root* of the *graph model* and the glue
that holds everything together. It is the entry point for any stam model.
"""
def __init__(self, id=None,file=None, string=None,config=None) -> None:
"""
To instantiate an AnnotationStore, at least one of `id`, `file` or `string` must be specified as keyword arguments:
Keyword Arguments
--------------------
id: Optional[str], default: None
The public ID for a *new* store. Only specify this if you want to create a new store, rather than load an existing one.
file: Optional[str], default: None
The STAM JSON, STAM CSV or STAM CBOR file to load, or if used in combination with `id`, the filename for the new store.
string: Optional[str], default: None
STAM JSON as a string
config: Optional[dict]
A python dictionary containing configuration parameters:
* use_include: Optional[bool], default: True
Use the `@include` mechanism to point to external files, if unset, all data will be kept in a single STAM JSON file.
* workdir: Optional[str]
Set the working directory, all relative filenames (also for `@include`) will be interpreted relative to this.
* debug: Optional[bool], default: False
Enable debug mode, outputs extra information to standard error output (verbose!)
* annotation_annotation_map: Optional[bool], default: True
Enable/disable index for annotations that reference other annotations
* resource_annotation_map: Optional[bool], default: True
Enable/disable reverse index for TextResource => Annotation. Holds only annotations that **directly** reference the TextResource (via a ResourceSelector), i.e. metadata
* dataset_annotation_map: Optional[bool], default: True
Enable/disable reverse index for AnnotationDataSet => Annotation. Holds only annotations that **directly** reference the AnnotationDataSet (via DataSetSelector), i.e. metadata
* key_annotation_metamap: Optional[bool], default: True
Enable/disable reverse index for DataKey => Annotation. Holds only annotations that **directly** reference the DataKey (via DataKeySelector), i.e. metadata
* data_annotation_metamap: Optional[bool], default: True
Enable/disable reverse index for AnnotationData => Annotation. Holds only annotations that **directly** reference the AnnotationData (via AnnotationDataSelector), i.e. metadata
* textrelationmap: Optional[bool], default: True
Enable/disable the reverse index for text, it maps TextResource => TextSelection => Annotation
* generate_ids: Optional[bool], default: False
Generate pseudo-random public identifiers when missing (during deserialisation). Each will consist of 21 URL-friendly ASCII symbols after a prefix of A for Annotations, S for DataSets, D for AnnotationData, R for resources
* strip_temp_ids: Optional[bool], default: True
Strip temporary IDs during deserialisation. Temporary IDs start with an exclamation mark, a capital ASCII letter denoting the type, and a number
* shrink_to_fit: Optional[bool], default: True
Shrink data structures to optimize memory (at the cost of longer deserialisation times)
* milestone_interval: Optional[int], default: 100
Milestone placement interval (in unicode codepoints) in indexing text resources. A low number above zero increases search performance at the cost of memory and increased initialisation time.
Example
---------
Load a store from file::
store = AnnotationStore(file="hamlet.store.json")
Instantiate a store from scratch and populate it with a resource and annotation::
self.store = AnnotationStore(id="test")
resource = self.store.add_resource(id="testres", text="Hello world")
self.store.annotate(id="A1",
target=Selector.textselector(resource, Offset.simple(6,11)),
data={ "id": "D1", "key": "pos", "value": "noun", "set": "testdataset"})
"""
def id(self) -> Optional[str]:
"""Returns the public identifier (by value, aka a copy)"""
def to_file(self, filename: str) -> None:
"""Saves the annotation store to file. Use either .json or .csv as extension."""
def from_file(self, filename: str) -> None:
"""Load another annotation store (only in STAM JSON format currently) into the current one. This can be done multiple times and effectively merges annotations stores."""
def save(self) -> None:
"""Saves the annotation store to the same file it was loaded from or last saved to."""
def to_json_string(self) -> str:
"""Returns the annotation store as one big STAM JSON string"""
def dataset(self, id: str) -> AnnotationDataSet:
"""Basic retrieval method that returns an :class:`AnnotationDataSet` by ID. Raises an exception if not found."""
def annotation(self, id: str) -> Annotation:
"""Basic retrieval method that returns an :class:`Annotation` by ID. Raises an exception if not found."""
def resource(self, id: str) -> TextResource:
"""Basic retrieval method that returns a :class:`TextResource` by ID. Raises an exception if not found."""
def key(self, set_id: str, key_id: str) -> DataKey:
"""Shortcut retrieval method that returns an :class:`DataKey` by ID. Raises an exception if not found."""
def annotationdata(self, set_id: str, data_id: str) -> AnnotationData:
"""Shortcut retrieval method that returns an :class:`AnnotationData` by ID"""
def add_resource(self, filename: Optional[str] = None, text: Optional[str] = None, id: Optional[str] = None) -> TextResource:
"""Create a new :class:`TextResource` and add it to the store. Returns the added instance.
If you want to store the resource as a stand-off text file, you can specify a filename. Make sure to set `use_include = True` in the Annotation Store's configuration then.
Note that any relative paths will be interpreted relative to the directory the current (root) store is in.
"""
def add_dataset(self, id: Optional[str] = None, filename: Optional[str] = None) -> AnnotationDataSet:
"""Create a new :class:`AnnotationDataSet` and add it to the store. Returns the added instance.
If you want to store the dataset as a stand-off JSON file, you can specify a filename. The dataset will be loaded from file if it exists. Make sure to set `use_include = True` in the Annotation Store's configuration then.
Note that any relative paths will be interpreted relative to the directory the current (root) store is in.
"""
def add_substore(self, filename: str) -> AnnotationSubStore:
"""
Load an existing annotation store as a dependency to this one. It will be store in an stand-off JSON file and included using the @include mechanism.
Note that any relative paths will be interpreted relative to the directory the current (root) store is in.
Returns the added substore.
"""
def add_new_substore(self, id: str, filename: str) -> AnnotationSubStore:
"""
Add a new empty annotation store as a dependency to this one.
It will be stored in an stand-off JSON file and included using the @include mechanism.
Note that any relative paths will be interpreted relative to the directory the current (root) store is in.
Returns the added substore.
"""
def set_filename(self, filename: str) -> None:
"""Set the filename for the annotationstore, the format is derived from the extension, can be `.json` or `csv`. This may be also be a full absolute or relative path."""
def annotate(self, target: Selector, data: Union[dict,List[dict],AnnotationData,List[AnnotationData]], id: Optional[str] = None) -> Annotation:
"""Adds a new annotation. Returns the :obj:`Annotation` instance that was just created.
Parameters
-------------
target: :class:`Selector`
A target selector that determines the object of annotation
data: Union[dict,List[dict],AnnotationData,List[AnnotationData]]
A dictionary or list of dictionaries with data to set. The dictionary
may have fields: `id` (optional),`key`,`set`, and `value`.
Alternatively, you can pass an existing :class:`AnnotationData` instance.
id: Optional[str]
The public ID for the annotation. If unset, one may be autogenerated if this was
explicitly enabled in the configuraiton.
Example
-----------
Instantiate a store from scratch and populate it with a resource and annotation::
self.store.annotate(id="A1",
target=Selector.textselector(store.resource("testres"), Offset.simple(6,11)),
data={ "id": "D1", "key": "pos", "value": "noun", "set": "testdataset"})
"""
def __iter__(self) -> Iterator[Annotation]:
"""Returns an iterator over all annotations (:class:`Annotation`) in this store.
This iterator has little runtime overhead but does not provide any filtering options, use :meth:`annotations` instead if you plan to do any filtering,
or use the equally named method on other objects for more constrained and filterable annotations (e.g. :meth:`DataKey.annotations`, :meth:`AnnotationDataSet.annotations`, :meth:`TextResource.annotations`)
"""
def annotations(self, *args, **kwargs) -> Annotations:
"""Returns an iterator over all annotations (:class:`Annotation`) in this store.
Filtering can be applied using positional arguments and/or keyword arguments. It is recommended to only use this method if you apply further filtering, otherwise the memory overhead may be very large if you have many annotations.
Otherwise you can fall back to a more low-level iterator, :meth:`__iter__` instead
Parameters
--------------
*args: tuple, optional
Filter arguments. These can any be of the following types:
* :class:`DataKey`
Returns annotations with data matching this key.
* :class:`AnnotationData`
Returns only annotations that have this exact data.
* :class:`Annotations` | [:class:`Annotation`]
Returns only annotations that match any of those specified here.
* :class:`Data` | [:class:`AnnotationData`]
Returns only annotations with data matching any of those specified here.
* :class:`dict` with keys:
* **set** - An ID of a dataset (or a :class:`DataAnnotationSet` instance), only needed when specifying `key` as a string (see below)
* **key** - A key, either an instance of :class:`DataKey` or a string, in the latter case you need to specify `set` as well.
* **value** - (see keyword arguments below)
**kwargs: dict, optional
* limit: (Optional[int] = None)
The maximum number of results to return (default: unlimited)
* substore: (Optional[bool] = None)
Set this to False if you want to include only results from the root store and not from any substores (default: True)
* set: (Optional[Union[str,AnnotationDataSet]] = None)
An ID of a dataset (or an :class:`AnnotationDataSet` instance), only needed when specifying `key` as a string
* key: (Optional[Union[str,DataKey]] = None)
An ID of a key (or a :class:`DataKey` instance), make sure to specify `set` as well if you use a string value for this parameter.
* value: (Optional[Union[str,int,float,bool]])
Constrain the search to annotations with data of a certain value. This can only be used when you also pass a :class:`DataKey` as filter.
This holds the exact value to search for, there are other variants of this keyword available, see :meth:`data` for a full list.
"""
def datasets(self) -> Iterator[AnnotationDataSet]:
"""Returns an iterator over all annotation data sets (:class:`AnnotationDataSet`) in this store"""
def resources(self) -> Iterator[TextResource]:
"""Returns an iterator over all text resources (:class:`TextResource`) in this store"""
def substores(self) -> Iterator[AnnotationSubStore]:
"""Returns an iterator over all substores (:class:`AnnotationSubStore`) in this store, i.e. stores that are included by this one as dependencies"""
def annotations_len(self) -> int:
"""Returns the number of annotations in the store (not substracting deletions)"""
def datasets_len(self) -> int:
"""Returns the number of annotation data sets in the store (not substracting deletions)"""
def resources_len(self) -> int:
"""Returns the number of text resources in the store (not substracting deletions)"""
def substores_len(self) -> int:
"""Returns the number of substores in the store"""
def shrink_to_fit(self):
"""Reallocates internal data structures to tight fits to conserve memory space (if necessary). You can use this after having added lots of annotations to possibly reduce the memory consumption."""
def data(self, *args, **kwargs) -> Data:
"""Returns an iterator over all data (:class:`AnnotationData`) in this store.
Filtering can be applied using positional arguments and/or keyword arguments. It is recommended to only use this method if you apply further filtering, otherwise the memory overhead may be very large if you have a lot of data.
Parameters
-------------
*args: tuple, optional
Filter arguments, these can be of the following types:
* :class:`DataKey`
Returns data matching this key
* :class:`Annotation`
Returns data referenced by the mentioned annotation
* :class:`AnnotationData`
Returns only this exact data. Not very useful, use :meth:`test_data` instead.
* :class:`Annotations` | [class:`Annotation`]
Returns data references by annotations in the provided collection.
* :class:`Data` | [class:`AnnotationData`]
Returns only data that is in the provided :obj:`Data` collection (intersection)
* :class:`dict` with keys:
* **set** - An ID of a dataset (or a :class:`DataAnnotationSet` instance), only needed when specifying `key` as a string (see below)
* **key** - A key, either an instance of :class:`DataKey` or a string, in the latter case you need to specify `set` as well.
* **value** or variants (see keyword arguments below)
**kwargs: dict, optional
* limit: `Optional[int] = None`
The maximum number of results to return (default: unlimited)
* set: `Optional[Union[str,AnnotationDataSet]] = None`
An ID of a dataset (or an :class:`AnnotationDataSet` instance), only needed when specifying `key` as a string
* key: `Optional[Union[str,DataKey]] = None`
An ID of a key (or a :class:`DataKey` instance), make sure to specify `set` as well if you use a string value for this parameter.
* value: `Optional[Union[str,int,float,bool,List[Union[str,int,float,bool]]]]`
Search for data matching a specific value.
This holds exact value to search for. Further variants of this keyword are listed below:
* value_not: `Optional[Union[str,int,float,bool]]`
Value must not match
* value_greater: `Optional[Union[int,float]]`
Value must be greater than specified (int or float)
* value_less: `Optional[Union[int,float]]`
Value must be less than specified (int or float)
* value_greatereq: `Optional[Union[int,float]]`
Value must be greater than specified or equal (int or float)
* value_lesseq: `Optional[Union[int,float]]`
Value must be less than specified or equal (int or float)
* value_in: `Optional[Tuple[Union[str,int,float,bool]]]`
Value must match any in the tuple (this is a logical OR statement)
* value_not_in: `Optional[Tuple[Union[str,int,float,bool]]]`
Value must not match any in the tuple
* value_in_range: `Optional[Tuple[Union[int,float]]]`
Must be a numeric 2-tuple with min and max (inclusive) values
"""
def query(self, query: str, **kwargs) -> list:
"""
Query the data using STAMQL.
Parameters
--------------
query: str
Query in `STAMQL <https://github.com/annotation/stam/tree/master/extensions/stam-query>`_.
Note that you *MUST* specify a variable to bind to in your `SELECT`
statement (this is normally optional but is required for calling from
Python).
**kwargs: tuple, optional
You can bind extra context variables using keyword arguments. The keys
correspond to the variable names that these will be bound to and which
you can subsequently use in the STAMQL query. These keys
should not carry the '?' prefix you may be accustomed to in STAMQL. The
value must be instances of STAM objects such as :class:`Annotation`,
:class:`AnnotationData`, :class:`DataKey`, :class`TextSelection` etc. These context variables
are available to the query but not propagated to the output.
Keyword arguments
-------------------
readonly: Optional[bool]
If set to `True`, queries that would mutate the store are rejected (raise an Exception).
In other words, only `SELECT` statements are allowed then.
A query returns a list consisting of dictionaries, each corresponding one
result row. The keys in the dictionaries match with the variable names
in the STAMQL query, the values are result instances of whatever type
the query returns, i.e. Annotation, AnnotationData, TextResource,
TextSelection, AnnotationDataSet.
Examples
--------------
Query for annotations with certain kind of data::
for row in store.query('SELECT ANNOTATION ?a WHERE "some-set" "pos" = "noun";'):
for result in row:
#just print out the text of the annotation
print(str(result['a']))
"""
def remove(self, item: Union[Annotation,AnnotationDataSet,TextResource, AnnotationData,DataKey], **kwargs):
"""
Remove any STAM item from the store.
Keyword arguments
-------------------
strict: Optional[bool]
In strict mode, any annotation that uses this item (where item is `AnnotationData` or `DataKey`) will be removed entirely, otherwise the annotation will be modified to remove the reference only.
"""
def view(self, selectionquery, *args, **kwargs) -> str:
"""
Execute a selection query and zero or more highlight queries, and visualise the result as HTML or ANSI-colored text.
The results are returned as a string.
Arguments
---------------
selectionquery: str
The main selection query in STAMQL. This selects what will be shown.
*args: List[str]
Each positional argument is a highlight query in STAMQL. It determines what portions of the results will be highlighted and how (attributes are support)
Keyword arguments
-----------------------
format: Optional[str]
The format to use, can be either `html` (default) or `ansi`.
legend: Optional[bool]
Show legend or not?
titles: Optional[bool]
Show titles or not (per result)
use: Optional[str]
The variable to use for the main selection (if not set, the first will be used)
interactive: Optional[bool]
Output is slightly interactive (html only, insert some minimal javascript)
autocollapse: Optional[bool]
Collapse all tags on initial view (html only)
"""
def split(self, queries: List[str], retain: bool):
"""
Splits an annotation store by either retaining (if `retain == True`) or removing (if `retain == False`) the items selected by the queries.
Queries must be STAMQL queries that select annotations, resources or datasets.
This deletes items from the store along with all their dependencies and comes with a reasonable performance overhead.
"""
def align_texts(self, *args: list[tuple[TextSelection,TextSelection]], **kwargs) -> list[list[Annotation]]:
"""
Used to compute an alignment between two texts; it
identifies which parts of the two texts are identical and computes a mapping
between the two coordinate systems. Two related sequence alignments algorithms
from bioinformatics are implemented to accomplish this:
Smith-Waterman and Needleman-Wunsch.
The resulting alignment is added to the store as an annotation, a so called transposition,
according to the `STAM Transpose <https://github.com/annotation/stam/tree/master/extensions/stam-transpose>`_
extension. These annotations are also returned by this function.
Alignments between text selection pairs will be computed in parallel, it may be memory intensive.
For the simpler sequential variant, use :meth:`TextSelection.align_texts()` instead.
Positional Arguments
-------------------
Each argument is a two-tuple containing two text selections (:class:`TextSelection`) to align.
Keyword Arguments
-------------------
case_insensitive: bool
Case-insensitive matching has more performance overhead
algorithm: str
The alignment algorithm to use, can be `smithwaterman`/`local` (local alignment) or `needlemanwunsch`/`global` (global alignment).
grow: bool
Grow aligned parts into larger alignments by incorporating non-matching parts. If you set this,
the function will return translations rather than transpositions.
You'll want to set `max_errors` in combination with this one to prevent very low-quality alignments.
max_errors: Union[int,float]
The maximum number of errors (max edit distance) that may occur for a transposition to be valid.
This is either an absolute integer or a relative ratio between 0.0 and 1.0, interpreted in relation to the length of the first text in the alignment.
In other words; this represents the number of characters in the search string that may be missed when matching in the larger text.
The transposition itself will only consist of fully matching parts, use `grow` if you want to include non-matching parts.
minimal_align_length: int
The minimal number of characters that must be aligned (absolute number) for a transposition/translation to be valid.
annotation_id_prefix: str
Prefix to use when assigning annotation IDs. The actual ID will have a random component
trim: bool
Strip leading and trailing whitespace/newlines from aligned text selections, keeping them as minimal as possible (default is to be as greedy as possible in selecting)
Setting this may lead to certain whitespaces not being covered even though they may align.
simple_only: bool
Only allow for alignments that consist of one contiguous text selection on either side. This is a so-called simple transposition.
"""
# def find_data(self, **kwargs) -> Data:
# """
# Find annotation data by set, key and value.
# Returns :class:`Data`, which holds a collection of :class:`AnnotationData` instances.
# Keyword arguments
# -------------------
# set: Optional[Union[str,AnnotationDataSet]]
# The set to search for; it will search all sets if not specified
# key: Optional[Union[str,DataKey]]
# The key to search for; it will search all keys if not specified. If you specify a key, you must also specify a set!
# value: Optional[Union[str,int,float,bool]]
# The exact value to search for, if this or any of its variants mentioned below is omitted, it will search all values.
# value_not: Optional[Union[str,int,float,bool]]
# Value
# value_greater: Optional[Union[int,float]]
# Value must be greater than specified (int or float)
# value_less: Optional[Union[int,float]]
# Value must be less than specified (int or float)
# value_greatereq: Optional[Union[int,float]]
# Value must be greater than specified or equal (int or float)
# value_lesseq: Optional[Union[int,float]]
# Value must be less than specified or equal (int or float)
# value_in: Optional[Tuple[Union[str,int,float,bool]]]
# Value must match any in the tuple (this is a logical OR statement)
# value_not_in: Optional[Tuple[Union[str,int,float,bool]]]
# Value must not match any in the tuple
# value_in_range: Optional[Tuple[Union[int,float]]]
# Must be a numeric 2-tuple with min and max (inclusive) values
# Examples
# -------------
# Query for specific annotation data::
# for annotationdata in store.find_data(set="some-set", key="structuretype", value="word"):
# # only returns one
# ...
# Query for all data for a key::
# for annotationdata in store.find_data(set="some-set", key="structuretype"):
# ...
# Note, the latter can be accomplished more efficiently as::
# for annotationdata in store.dataset("some-set").key("structuretype").data():
# ...
# `find_data` should be considered as a convenience/shortcut method.
# """
class Annotation:
"""
`Annotation` represents a particular *instance of annotation* and is the central
concept of the model. Annotations can be considered the primary nodes of the graph model. The
instance of annotation is strictly decoupled from the *data* or key/value of the
annotation (:class:`AnnotationData`). After all, multiple instances can be annotated
with the same label (multiple annotations may share the same annotation data).
Moreover, an `Annotation` can have multiple annotation data associated.
The result is that multiple annotations with the exact same content require less storage
space, and searching and indexing is facilitated.
This structure is not instantiated directly, only returned. Use :meth:`AnnotationStore.annotate()` to instantiate a new Annotation.
"""
def id(self) -> Optional[str]:
"""Returns the public ID (by value, aka a copy)
Don't use this for extensive ID comparisons, use :meth:`has_id` instead as it is more performant (no copy)."""
def has_id(self, id: str) -> bool:
"""Tests the ID"""
def __iter__(self) -> Iterator[AnnotationData]:
"""Returns a iterator over all data (:class:`AnnotationData`) in this annotation; this has little overhead but is less suitable if you want to do further filtering, use :meth:`data` instead for that."""
def __len__(self) -> int:
"""Returns the number of data items (:class:`AnnotationData`) in this annotation"""
def select(self) -> Selector:
"""Returns a selector pointing to this annotation"""
def text(self) -> List[str]:
"""Returns the text of the annotation.
Note that this will always return a list (even it if only contains a single element),
as an annotation may reference multiple texts.
If you are sure an annotation only reference a single contingent text slice or are okay with slices being concatenated, then you can use the `str()` function instead."""
def __str__(self) -> str:
"""
Returns the text of the annotation.
If the annotation references multiple text slices, they will be concatenated with a space as a delimiter,
but note that in reality the different parts may be non-contingent!
Use `text()` instead to retrieve a list of texts
"""
def textselections(self, **kwargs) -> TextSelections:
"""
Returns a collection of all textselections (:class:`TextSelection`) referenced by the annotation (i.e. via a *TextSelector*).
Note that this will always return a collection (even it if only contains a single element),
as an annotation may reference multiple text selections.
Text selections will be returned in textual order, except if a DirectionalSelector was used.
Text selections may be filtered using the following positionl and/or keyword arguments:
Parameters
-------------------
*args: tuple, optional
Filter arguments, can be of the following types:
* :class:`DataKey`
Returns text selections referenced by annotations with data matching this key
* :class:`AnnotationData`
Returns text selections referenced by annotations that have this exact data
* :class:`Annotations` | [:class:`Annotation`]
Returns text selections referenced by any annotations that are already in the provided :obj:`Annotations` collection (intersection)
* :class:`Data` | [:class:`AnnotationData`]
Returns only textselections referenced by annotations with data that is in the provided collection.
* :class:`dict` with keys:
* **set** - An ID of a dataset (or a :class:`DataAnnotationSet` instance), only needed when specifying `key` as a string (see below)
* **key** - A key, either an instance of :class:`DataKey` or a string, in the latter case you need to specify `set` as well.
* **value** (see keyword arguments below)
**kwargs: dict, optional
limit: Optional[int] = None
The maximum number of results to return (default: unlimited)
value: Optional[Union[str,int,float,bool]]
Constrain the search to text selections referenced by annotations with data of a certain value. This is usually used together with passing a :obj:`DataKey` as filter in the positional arguments.
This holds the exact value to search for, there are other variants of this keyword available, see :meth:`data` for a full list.
"""
def annotations_in_targets(self, *args, **kwargs) -> Annotations:
"""
Returns annotations (:class:`Annotations` containing
:class:`Annotation` instances) this annotation refers to (i.e. using an
*AnnotationSelector*)
The annotations can be filtered using positional and/or keyword
arguments; see :meth:`annotations` for full documentation. One extra keyword argument is
available for this method (see below).
Annotations will returned be in textual order unless recursive is set
or a DirectionalSelector is involved.
Keyword Arguments
-------------------
recursive: bool
Follow AnnotationSelectors recursively (default False)
"""
def annotations(self, *args, **kwargs) -> Annotations:
"""
Returns annotations (:class:`Annotations` containing
:class:`Annotation` instances) that are referring to this annotation (i.e. others
using an AnnotationSelector).
The annotations can be filtered using positional and/or keyword
arguments.
Parameters
-----------
*args: tuple, optional
These arguments can any be of the following types:
* :class:`DataKey`
Returns annotations with data matching this key.
* :class:`AnnotationData`
Returns only annotations that have this exact data.
* :class:`Annotations` | :class:`Annotation`
Returns only annotations that match any of those specified here.
* :class:`Data` | :class:`AnnotationData`
Returns only annotations with data matching any of those specified here.
* :class:`dict` with keys:
* **set** - An ID of a dataset (or a :class:`DataAnnotationSet` instance), only needed when specifying `key` as a string (see below)
* **key** - A key, either an instance of :class:`DataKey` or a string, in the latter case you need to specify `set` as well.
* **value** - (see keyword arguments below)
**kwargs: dict, optional
* limit: (Optional[int] = None)
The maximum number of results to return (default: unlimited)
* set: (Optional[Union[str,AnnotationDataSet]] = None)
An ID of a dataset (or an :class:`AnnotationDataSet` instance), only needed when specifying `key` as a string
* key: (Optional[Union[str,DataKey]] = None)
An ID of a key (or a :class:`DataKey` instance), make sure to specify `set` as well if you use a string value for this parameter.
* value: (Optional[Union[str,int,float,bool]])
Constrain the search to annotations with data of a certain value. This can only be used when you also pass a :class:`DataKey` as filter.
This holds the exact value to search for, there are other variants of this keyword available, see :meth:`data` for a full list.
* limit: (Optional[int] = None)
The maximum number of results to return (default: unlimited)
Example
---------
Filter by data key and value::
key = store.dataset("linguistic-set").key("part-of-speech")
for annotation in store.annotations(key, value="noun"):
...
But if you already have the key, like in the example above, you may just as well do (more efficient)::
for annotation in key.annotations(value="noun"):
...
"""
def test_annotations(self, *args, **kwargs) -> bool:
"""
Tests whwther there are annotations (:class:`Annotations` containing :class:`Annotation`) that are referring to this annotation (i.e. others using an AnnotationSelector).
This method is like :meth:`annotations`, but only tests and does not return the annotations, as such it is more performant.
The annotations can be filtered using keyword arguments. See :meth:`Annotation.annotations`.
Example
---------
Filter by data key and value::
key = store.dataset("linguistic-set").key("part-of-speech")
for annotation in store.annotations_in_targets(filter=key, value="noun"):
...
"""
def resources(self, limit: Optional[int] = None) -> List[TextResource]:
"""Returns a list of resources (:class:`TextResource`) this annotation refers to
Parameters
------------
`limit`: `Optional[int] = None`
The maximum number of results to return (default: unlimited)
"""
def datasets(self, limit: Optional[int] = None) -> List[AnnotationDataSet]:
"""Returns a list of annotation data sets (:class:`AnnotationDataSet`) this annotation refers to. This only returns the ones
referred to via a *DataSetSelector*, i.e. as metadata.
Parameters
------------
`limit`: `Optional[int] = None`
The maximum number of results to return (default: unlimited)
"""
def offset(self) -> Optional[Offset]:
"""Returns the offset this annotation's selector targets, exactly as specified"""
def target(self) -> Selector:
"""Returns the target selector (:class:`Selector`) for this annotation. This is mainly useful if you want to add another annotation pointing to the same target."""
def selector_kind(self) -> SelectorKind:
"""Returns the type of the selector of this annotation"""
def data(self, *args, **kwargs) -> Data:
"""
Returns annotation data (:class:`Data` containing :class:`AnnotationData`) used by this annotation.
The data can be filtered using keyword arguments. If you don't care for any filtering and just want a simple iterator overlap
the data, then just iterating over the annotation directly (:meth:`__iter__`) will be more efficient. Do note that implementing
any filtering yourself in Python is much less performant than letting this data method do it for you.
Parameters
-------------
*args: tuple, optional
Filter arguments, these can be of the following types:
* :class:`DataKey`
Returns data matching this key
* :class:`Annotation`
Returns data referenced by the mentioned annotation
* :class:`AnnotationData`
Returns only this exact data. Not very useful, use :meth:`test_data` instead.
* :class:`Annotations` | [class:`Annotation`]
Returns data references by annotations in the provided collection.
* :class:`Data` | [class:`AnnotationData`]
Returns only data that is in the provided :obj:`Data` collection (intersection)
* :class:`dict` with keys:
* **set** - An ID of a dataset (or a :class:`DataAnnotationSet` instance), only needed when specifying `key` as a string (see below)
* **key** - A key, either an instance of :class:`DataKey` or a string, in the latter case you need to specify `set` as well.
* **value** or variants (see keyword arguments below)
**kwargs: dict, optional
* limit: `Optional[int] = None`
The maximum number of results to return (default: unlimited)
* set: `Optional[Union[str,AnnotationDataSet]] = None`
An ID of a dataset (or an :class:`AnnotationDataSet` instance), only needed when specifying `key` as a string
* key: `Optional[Union[str,DataKey]] = None`
An ID of a key (or a :class:`DataKey` instance), make sure to specify `set` as well if you use a string value for this parameter.
* value: `Optional[Union[str,int,float,bool,List[Union[str,int,float,bool]]]]`
Search for data matching a specific value.
This holds exact value to search for. Further variants of this keyword are listed below:
* value_not: `Optional[Union[str,int,float,bool]]`
Value must not match
* value_greater: `Optional[Union[int,float]]`
Value must be greater than specified (int or float)
* value_less: `Optional[Union[int,float]]`
Value must be less than specified (int or float)
* value_greatereq: `Optional[Union[int,float]]`
Value must be greater than specified or equal (int or float)
* value_lesseq: `Optional[Union[int,float]]`
Value must be less than specified or equal (int or float)
* value_in: `Optional[Tuple[Union[str,int,float,bool]]]`
Value must match any in the tuple (this is a logical OR statement)
* value_not_in: `Optional[Tuple[Union[str,int,float,bool]]]`
Value must not match any in the tuple
* value_in_range: `Optional[Tuple[Union[int,float]]]`
Must be a numeric 2-tuple with min and max (inclusive) values
* limit: `Optional[int] = None`
The maximum number of results to return (default: unlimited)
Example
-----------
Get all part-of-speech data pertaining to this annotation::
key = store.dataset("linguistic-set").key("part-of-speech")
for data in annotation.data(filter=key):
...
"""
def test_data(self, *args, **kwargs) -> bool:
"""
Tests whether certain annotation data is used by this annotation.
The data can be filtered using positional and/or keyword arguments. See :meth:`data`.
Unlike :meth:`data`, this method merely tests without returning the data, and as such is more performant.
"""
def related_text(self, operator: TextSelectionOperator, *args, **kwargs) -> TextSelections:
"""
Applies a :class:`TextSelectionOperator` to find all other
text selections who are in a specific relation with the ones from the current annotation.
Returns a collection :class:`TextSelections` containing all matching :class:`TextSelection` instances.
Text selections will be returned in textual order. They may be filtered via positional and/or keyword arguments. See :meth:`Annotation.textselections`.
If you are interested in the annotations associated with the found text selections, then
add `.annotations()` to the result.
Parameters
------------
`operator`: :class:`TextSelectionOperator`
The operator to apply when comparing text selections
Keyword Arguments
-------------------
`limit`: `Optional[int] = None`
The maximum number of results to return (default: unlimited)
See :meth:`Annotation.textselections` for further keyword arguments to filter.
Examples
-------------
Find all text selections that overlap with the annotation::
for textselection in annotation.related_text(TextSelectionOperator.overlaps()):
...
If you want to get the annotations instead, just add ``.annotations()``::
for annotations in annotation.related_text(TextSelectionOperator.overlaps()).annotations():
...
Assume `sentence` is an annotation representing a sentence, we can find text selections inside (embedded in) the sentence as follows::
for textselection in sentence.related_text(TextSelectionOperator.embeds()):
...
Like above, but now we actively look for annotations that are marked as words, effectively selecting
all words in a sentence::
data_word = store.dataset("structural-set").key("type").data(value="word", limit=1)[0]
for word in sentence.related_text(TextSelectionOperator.embeds()).annotations(filter=data_word):
...
"""
def json(self) -> str:
"""Returns the annotation as STAM JSON in a string with appropriate pretty-print formatting."""
def webannotation(self, **kwargs) -> str:
"""
Returns the annotation as a W3C Web Annotation in JSON-LD, as a compact single-line string without pretty formatting (immediately usable for output to JSONL).
Keywords Arguments
--------------------
`default_annotation_iri`: `str`
IRI prefix for Annotation Identifiers. Will be prepended if the annotations public ID is not an IRI yet.
`generate_annotation_iri`: `bool`
Generate a random annotation IRI if it does not exist yet? (non-deterministic!)
`default_set_iri`: `str`
IRI prefix for Annotation Data Sets. Will be prepended if the annotation data set public ID is not an IRI yet.
`default_resource_iri`: `str`
IRI prefix for Text Resources. Will be prepended if the resource public ID is not an IRI yet.
'extra_context`: `[str]`
Extra JSON-LD context to export, these must be URLs to JSONLD files.
`auto_generated`: `bool`
Automatically add a 'generated' triple for each annotation, with the timestamp of serialisation
`auto_generator`: `bool`
Automatically add a 'generator' triple for each annotation, with the software details
`context_namespaces`: `[(str,str)]`
Automatically generate a JSON-LD context alias for all URIs in keys, maps URI prefixes to namespace prefixes
"""
def test(self, operator: TextSelectionOperator, other: Annotation) -> bool:
"""
This method is called to test whether a specific spatial relation (as expressed by the
passed operator) holds between an :class:`Annotation` and another.
A boolean is returned with the test result.
"""
def test_textselection(self, operator: TextSelectionOperator, other: TextSelection) -> bool:
"""
This method is called to test whether a specific spatial relation (as expressed by the
passed operator) holds between an :class:`Annotation` and a :class:`Textselection`.
A boolean is returned with the test result.
"""
def transpose(self, via: Annotation, **kwargs) -> Annotations:
"""
The transpose function maps an annotation, textselection, or textselection set from
one coordinate system to another. These mappings are defined in annotations called
**transpositions** and are documented here: https://github.com/annotation/stam/blob/master/extensions/stam-transpose/README.md
Transpositions link identical textual parts across resources, any annotations within
the bounds of such a mapping can then be *transposed* using this function to the other coordinate system.
The `via` parameter expresses the transposition that is being used.
The result of a transpose operation is itself again a transposition.
Keyword arguments
------------------
allow_simple: bool
Allow a simple transposition as output, by default this is set to `false` as we usually want to have an transposed annotation
no_transposition: bool
Do not produce a transposition annotation, only output the transposed annotation (allow_simple must be set to false)
This effectively throws away the provenance information.
no_resegmentation: bool
Do not produce a resegmentation annotation. If needed for a complex transposition, a resegmented annotation is still created, but
the resegmented version (used as source in the transposition) is not linked to the original source annotation. This effectively throws away provenance information.
This only comes into play if `no_transposition == False`
transposition_id: Optional[str]
An ID to assign to the transposition that is outputted
resegmentation_id: Optional[str]
An ID to assign to the resegmentation that is outputted (if any)
debug: bool
Output debug information to stderr
"""
def substore(self) -> Optional[AnnotationSubStore]:
"""
Returns the substore this annotation is a part of, or `None` if the annotation is part of the root store.
"""
def alignments(self, **kwargs) -> list[list[Union[TextSelection,Annotation]]]:
"""
If this annotation describes a transposition (https://github.com/annotation/stam/blob/master/extensions/stam-transpose/README.md),
this will extract the alignments in the transposition to a list of lists. Each inner lists hold `TextSelection` instances that are in alignment.
If you want to return `Annotation` instances instead, set the keyword argument:
Keyword arguments
------------------
annotations: bool
Return annotations instead of text selections, note that this only works for complex transpositions, for simple transpositions you always get text selections regardless of this setting.
"""
class Annotations:
"""
An `Annotations` object holds an arbitrary collection of annotations.
The annotations are references to items in an AnnotationStore, not copies.
You can iterate over it to retrieve :class:`Annotation` instances.
"""
def __iter__(self) -> Iterator[Annotation]:
"""Iterator over all annotations in this collection"""
def __len__(self) -> int:
"""Returns the number of annotations in the collection"""
def __next__(self) -> Annotation:
"""Return the next item in the iterator"""
def __getitem__(self, int) -> Annotation:
"""Returns an annotation in the collection by index"""
def is_sorted(self) -> bool:
"""Returns a boolean indicating whether the annotations in this collection are sorted chronologically (earlier annotations before later once). Note that this is distinct from any textual ordering."""
def data(self, *args, **kwargs) -> Data:
"""
Returns annotation data (:class:`Data` containing :class:`AnnotationData`) used by annotations in this collection.
The data can be filtered using positional and/or keyword arguments; see :meth:`Annotation.data`.
If no filters are set (default), all data from all annotations are returned (without duplicates).
"""
def test_data(self, *args, **kwargs) -> bool:
"""
Tests whether certain annotation data is used by any annotation in this collection.
The data can be filtered using keyword arguments. See :meth:`data`.
Unlike :meth:`data`, this method merely tests without returning the data, and as such is more performant.
"""
def annotations(self, *args, **kwargs) -> Annotations:
"""
Returns annotations (:class:`Annotations` containing :class:`Annotation`) that reference annotations in the current collection (e.g. annotations that target of the current any annotations using an AnnotationSelector).
The annotations can be filtered using positional and/or keyword arguments; see :meth:`Annotation.annotations`.
If no filters are set (default), all annotations are returned (without duplicates) in chronological order.
Example
-----------
Say `annotation` represents a word, we can get all annotations that with key "part-of-speech", that point to this annotation::
key = store.dataset("linguistic-set").key("part-of-speech")
for pos_annotation in annotation.annotations(filter=key):
data = annotation.data(filter=key,limit=1)[0]
...
"""
def annotations_in_targets(self, *args, **kwargs) -> Annotations:
"""