-
Notifications
You must be signed in to change notification settings - Fork 0
Annotations
Note: This is an old document: Some of these concepts/words/definitions are deprecated, or meanings have drifted.
What’s an annotation? (Grounded) annotations are information associated to intervals of time.
An annotation is called "grounded" when the information is directly about a specific interval of time. Often, this annotation is also associated to a specific channel.
This is about operating on time-series. Time-series annotate time. Generally speaking, an annotation is “information (signifier) about something (signified)”. We’ll focus on time annotations here, and more specifically about interval annotations where the signifier is an interval of time (conditioned by a source or context -- e.g. sensor, channel, monitored asset, etc.). Note that the representation of the signified, or parts thereof can be unknown, or implicit -- which happens often when we can factor out some commonness (e.g. we’re referring to the same sensor, or a given day, etc.).
We will first focus on a (widely applicable) subset of the problem: We have a single sensor that produces a digital signal and we acquire, timestamp and store this data. We may have captured the a continuous unbroken stream of the signal, or we may have acquired it in “sessions” of continuous data, with gaps in between.
From there we want to carry out several analysis activities, many of which boil down to creating annotations, and possibly storing and retrieving these: Spectrums, feature vectors, detection probabilities, tags, etc. Further, we may have other relevant time-series data from other sensors or sources.
Generally, an interval annotation (again, assuming the context/reference is known) is a triple (bt, tt, metadata)
where bt
(for “bottom time” or “beginning time”) and tt
(for “top time” or “terminal time”) are the bounds of the interval in some known unit. The way we represent and store these annotations is a different story. For example, tt
may be irrelevant, or redundant (if (tt - bt)
is fixed).
The waveform samples are annotations, but would we store them as (bt, sample value)
pairs? Probably not.
The waveform (wf
) produced by the sensor is assumed to be regular with a fixed (and known) sample rate (sr). This means that given the time-stamp of a single sample, we can derive the time-stamps of any other (previous or subsequent) samples in a sequence of contiguous (no samples lost) samples.
So should we store the full waveform under a single key (associated to the initial timestamp)? Probably not either, because the waveform could become too big to store in a single file or DB record. The sensible thing to do is somewhere between storing individual timestamped samples, and a single timestamped sequence of samples.
Examples of grounded annotations:
A digital waveform is really just a sequence of annotations of analog audio. A sound chunk is an annotation of a (fixed sized) segment of an audio channel If we tag an (arbitrary length) segment of a waveform, that's a grounded annotation. By contrast, an abstract annotation is information we associate (explicitly or implicitly) to several intervals of time. This could be expressed by a logical expression, an explicit collection of other annotations, etc.
Examples of abstract annotations:
A sound chunk tag is an abstract annotation.
We tag an sref
, which is a key to a sound chunk. It's an annotation of a sound chunk (which is itself, a grounded annotation).
Collections of waveforms are abstract annotations: We're giving a name to a collection of grounded annotations here.
Ontologies are abstract annotations.
How do we represent grounded annotations? I'll describe next how to represent an annotation with a dict (or equivalently, mongo doc).
An (grounded) annotation of some sensor data can represented by a dict with the fields:
-
bt
: bottom time (lower bound of the time interval). By convention, inclusive. -
tt
: top time (upper bound of the time interval). By convention, exclusive. -
c
: The channel/source. This should identify unambiguously a time series that we're annotating, or a set of (time aligned) channels when the annotation has to do with information (that might be captured by these channels) about the time segment itself. -
v
: The value of the annotation. This could be a number, but could be in general any value, object, structure... Whatever information we want to associate to this[bt, tt)
interval. Note: Believe me, I know(bt,tt)
is a bit obscure, but voted against other less obscure naming such as(gt, it)
or(lt, ut)
for (good or bad) reasons.
If the situation arises that we may have annotations that are of a different nature, we'll need to also include a kind
field to indicate this. For example, we might be annotating segments with semantic tags, or acoustic properties, or associating other contextual information to this segment. The 'kind' field serves to separate these different kinds of annotations. If all annotations in the collection are of the same nature, we don't need this.
Two more miscellaneous notes:
- In cases were we may wish to query/filter/use the length of the interval (i.e.
tt-bt
), it would be useful to add alen
field as well. - When the length of the annotations are all the same (such as sound chunks of a same channel), the
tt
field could be unnecessary.
Annotations could be:
- Manual: User selects a range and enters some info about it (tag, etc.)
- Automatic: A process runs through a channel’s signal stream and annotates it according to some criteria. Examples:
- stationary annotations in can data: partitioning signals time-series into pieces of equal value
- near-stationary annotations in audio: finding and annotating segments of an audio stream that are nearly stationary (silence, constant buzz, etc.)
- outlier segments: Finding and annotating segments that have a high density of outlier scores
Annotation operations could be:
- Meta annotations:
- Examples:
- Ontology of annotations
- Event annotations: Annotations that are defined by the co-occurrence and/or sequence of annotations.
- Annotation-indexing annotations (to accelerate annotation queries)
- Search
- All annotations of a given kind
- All annotations relevant to a given segment, for example:
- Completely contained in segment
- Overlap with segment
- Examples:
Disregarding the indexing/grouping aspects of (segment) annotations, here are a few examples of common forms of annotations:
filepath, tag
(whole file is tagged)
folder, tag
(whole contents of folder is tagged).
If we want to tag only one interval of a file:
filepath, bt, tt, tag
(part of the TS data of file is tagged).
If we want to tag an interval of a reference (i.e. some source, or set of sources), without needing to specify explicitly where/how data is stored):
reference, bt, tt, tag
.
If even the reference needn’t be specified explicitly:
bt, tt, tag
.
I said tag, but could be any kind of descriptors, such as stats about that interval:
bt, tt, stats
.
If we’re dealing with regular sized intervals, we don’t need to specify both bt and tt:
bt, tag
.
Instead of a tag, descriptor can be a feature vector:
bt, fv
.
Or a model output: bt, model_output
A segment is a reference to an interval of time, and possibly some data that "happened" during this period. It's a grounded annotation. But this annotator (or "segmenter") could have various characteristics. If often used, we should distinguish these with named categories.
Below are some proposals around that aspect.

Logically, a collection of time-stamped “sessions” of continuous signal. Physically, probably stored as fix-sized (or not) timestamped blocks with a interfaces to operate on these as if it were a single continuous stream.
It might be a good idea to have a look at how other streaming systems (e.g. YouTube, Spotify) deal with this situation.

This segmented storage makes sense for streaming sound (or any source that might have large volumes of sound), but not so much sense for uploaded sound files. If we (which would really make things easier) decide to make blocks be fixed-sized (as specified in my table), we'll have a "problem" with sounds that are not a multiple of the block size: How do we store the "tail" (the reminder)?
My answer: We don't.
Not ideal, but fixed size blocks will simplify our life for sure. The question is "is it TOO restrictive".
For streaming channels, won't be a problem since we'll have those channel-holding devices send things in blocks anyway, and missing one block means missing maximum 0.97s (if that's the default size we choose). Not much next to the hours of sound we should be getting from it.
But what if we get our channel data from uploading many small files? We systematically loose the end of the file. We can't even capture any blocks if the file is smaller than the block size.
One answer to this would be "we treat files differently; we only use block sequences for streaming channels". That could be a good solution. This means we can use the file itself as the source, instead of having to repeat the data with blocks. But this means that our signal storage interface must handle both cases.
So my vote would be to make blocks fixed size, and not block-segment uploaded files. But my opinion is not strong (yet). This is an invitation to pros/cons analysis.
Loosely put, a block is a (usually stored) continuous sequence of items. More precisely, a block contains the sub-time-series for a given interval.
In the case of waveforms stored in a filesystem, blocks are (for instance) WAV or PCM audio files that together can be used to reconstruct the signal or any part thereof. If the blocks are time-stamped, then we can reconstruct the signal that was recorded between any time interval -- though if we didn’t acquire enough signal, the reconstruction won’t be complete, or may only provide several contiguous segments with gaps between them.
To reconstruct the signal, we simply find the blocks whose (explicit or implicit) coverage time intervals intersect with our query interval, remove the samples that happened before or after, and glue everything together to reconstruct the waveform (or waveforms).
What makes this possible is the assumption that the blocks together cover all the data that was recorded, and a block will contain all the (fixed sample rate) samples recorded in its coverage, in order, and without repetition.
Blocks could be fixed size or not. Fixed size approaches lose flexibility, but reduce the complexity (since we don’t need to compute, or retrieve, the size of each individual block).
Blocks could be overlapping or not. If we can assume blocks don’t overlap, it makes reconstructing the signal easier.
Further grouping and indexing of blocks themselves can be useful. For example, we can group all sequences of contiguous blocks (with no gaps between them) under a “session”. The session acts as a super-block: The block of a sequence of blocks. But this mechanism’s logic is identical to the block mechanism, generalized to any kind of time-interval annotation sequences, with possibly different properties (e.g. no overlap or gaps allowed).
We mentioned wanting to compute, time-stamp, and store spectrograms, feature vectors, etc. These are often computed over fixed size segments of waveform, and often with a fixed sized step between these segments. This means that we can also take advantage of the block mechanism here. But it needs to be extended to take care of the fact that the items of the block have a width. Notes on annotations A waveform sample is an annotation. It is pointing to an instant in time (the signified) and associating a sensor reading to it (the signifier). Though the reading is an “instantaneous” one, we can still think of a sample as an time-interval annotation since it’s informs us of what the sensor was “sensing” in a tiny interval of time (we’re “sampling” the analog signal in that interval).
A (continuous) waveform is an annotation of the interval of time it covers. We may not necessarily know the interval of time, but the signal it recorded happened in an interval of time.
A waveform sample is an annotation. It is pointing to an instant in time (the signified) and associating a sensor reading to it (the signifier). Though the reading is an “instantaneous” one, we can still think of a sample as an time-interval annotation since it informs us of what the sensor was “sensing” in a tiny interval of time (we’re “sampling” the analog signal in that interval).
A (continuous) waveform is an annotation of the interval of time it covers. We may not necessarily know the interval of time, but the signal it recorded happened in an interval of time.