Skip to content

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration

Notifications You must be signed in to change notification settings

clamsproject/aapb-annotations

Repository files navigation

AAPB-CLAMS Annotation Repository

This repository contains datasets for manual annotation projects for the AAPB-CLAMS collaboration.

Project Information

The American Archive of Public Broadcasting (AAPB) has involved the CLAMS team to develop information extraction systems for digital archives of public media (primarily video and audio from publicly-funded tv shows and radio broadcasts). This collaboration will facilitate the research and preservation of significant historical content from this media collection. This repository provides training and evaluation data for the machine learning-based CLAMS apps in this process.

Repository Structure

This repository contains:

  • Annotations batches
  • Project subdirectories
  • This README file and some other documentation.

Annotation batches

Annotation batches are stored in the batches subdirectory, which tracks source data for the whole annotation endeavor.

Smaller selections of the AAPB collection are chosen and cataloged as batches in this subdirectory. These sets are chosen for a variety of reasons, but are typically designed to help evaluate or improve CLAMS applications. A batch is the set of the identifying GUIDs/tags for a group of media assets. An AAPB GUID is a unique identifying string that can be used at the AAPB website to find one particular media and its supporting files, eg. cpb-aacip-96d289b264c at https://americanarchive.org/catalog/cpb-aacip-96d289b264c. The AAPB GUID is not a Universally unique identifier, it is unique only within the scope of the AAPB.

Batches precede annotation projects, but some batches may be created just for the purpose of a single annotation project. Annotation projects then choose appropriate batches to structure the work (see the raw annotation data section). Each batch is defined by a BATCH_NAME.txt file in this directory.

  • Batches are often named after their relevant GitHub issue from the AAPB-CLAMS collaboration repository.

  • Each line in the file must be either a single AAPB GUID or a comment starting with a #. The first lines are typically batch-level comments, while later comment lines may specify sources for subsequent AAPB GUIDs.

Typically, batch-level comments start and end with a comment line with just hyphens, for example, for aapb-annotations-44 we have:

# --------------------------------------------------------------------------------
# A set of videos that have various instances of "scenes with text" that are ideal
# for creating labeled data for roles and fillers (key-value pairs) extraction.
#
# See https://github.com/clamsproject/aapb-annotations/issues/44 for the selection
# process and other additional information.
# --------------------------------------------------------------------------------

Project Subdirectories

Every other directory in this repository represents a specific annotation project. The subdirectory name is the name of the project. A project directory includes raw output from the annotation process, gold-formatted final output data files for tool ingestion, a software-suite for converting files from raw to gold, and a project-specific readme.md explaining the project and its annotation guidelines.

Raw annotation data

Important

YYMMDD-batchName directory

This directory contains output files from the manual annotation process created by an annotation tool or by hand.

The raw annotation files are organized by batch name and starting date of the annotation. A single "period" of the annotation is the whole process of a single batch of source data (AAPB assets) being annotated. The YYMMDD- prefix must indicate the time when a batch of annotation is conducted (that is, when the batch is first prepared and used for annotation). The batchName part of the directory name must match the basename of one of the .txt files in the annotation batches directory. The date and batch name prefixes are used for sorting annotation processes and machine ingestion of the raw data.

Different annotation tools create different file formats, hence we need conversion of the raw annotation files to files with a common format for the gold data.

Gold dataset files

Important

golds directory

This directory contains the public "gold" dataset generated by conversion scripts.

The gold dataset is a set of files that are in a format that is ready for machine consumption primarily for

  1. training ML models for CLAMS apps,
  2. evaluation of CLAMS app outputs,
  3. other public usage

There are some rules on the content and structure of the gold directory:

  1. There must be one file per GUID, and the GUID should be part of the filename.
  2. The number of gold files in this directory must match the sum of GUIDs in all batches annotated. This means that there cannot be any overlap between assets in batches.
  3. The golds directory may have subdirectories, but these subdirectories should not reflect batch structure. An example of this is for the scene-recognition project, where there are subdirectories for time points and for time frames. No further directory structure is allowed.

Given these rules, which are followed by the conversion code below, users of the AAPB-CLAMS dataset may find it easier to use gold data than raw data for machine consumption.

Scripts for format conversion

Important

(usually) process.{sh,py} and dependencies

This is typically a single script to process the raw annotation files and generate the gold data. The input file format (i.e., direct output from the annotation process) can vary (e.g. .csv, .json, .txt). The output file format must be a common machine-readable data format (CSV, JSON, but definitely not MMIF), and is subject to change for any future requirements in the consumption software. Thus, users of a gold dataset should be aware of the version of the gold dataset they are using, and are recommended to use permalinks to refer to a specific version of the gold dataset in their code or documentation.

In addition to the main script, if the code requires additional dependencies/scripts, they should be in the same level at that subdirectory. Dependencies on third-party modules can be documented in the README.md file or in a machine-friendly file with the list of dependencies (e.g. requirements.txt for pip).

And finally, check the conventions section for naming conventions for common field/column names for gold data.

README file and other project documentation

Important

README.md (and possibly guidelines.{md,ppt})

Project-specific information, including but not limited to:

  • Annotation project name

  • One-line summary of the project

  • Annotator summary. Some basic demographic information about the annotators: age group, language proficiency, occupational characteristics, etc. No personally identifiable information, unless the annotator wants to be credited.

  • Annotation environment/tool information (name, version, link, user manual, etcetera). In most cases, there is a separate codebase (ideally on https://github.com/clamsproject/) for the annotation tool which includes the manual.

  • Project changes: version changes, selection of asset batches, change in annotator personnel, etc.

  • Raw-to-gold conversion code explanation

    • dependencies, short description of process.py
    • formats of raw and gold files
    • field description, with data types
    • differences, added information, discarded information during process.py
  • Annotation guidelines - sometimes as a separate file named guidelines.{md,ppt}. This section should give sufficient documentation for how the annotation was done and what the conditions/assumptions are under which the dataset exists:

    • What tool is used, and how it is used.
    • What to annotate
    • Options of label choices
    • Label formatting.
    • Differentiation between labels, edge cases, other decisions made during annotation.
    • Concerns, limitations, precision details. (e.g. time imprecision)

Note

readme.md & guidelines.{md,ppt} files are supposed to be actively maintained by the project manager. All guideline files are recommended to be version-controlled.

Repository-level Conventions

Important

Media Time = hh:mm:ss.mmm with a DOT
Annotation times are usually a little imprecise because audiovisual phenomena are, or visualizing/labelling of such is.
Some estimates of imprecision are given by Margin of Error.
Directionality definitions help frame the boundaries meant by annotated times.
The fields in the gold datasets should be standardized.

Please see the Repository-level Conventions file for more on standardizations and conventions.

List of Current Projects/Subdirectories

This section is manually updated and may be incomplete.

  • january-slates - slates are actual visible frames within the video media that contain the metadata and other identifying information of that video.

    • eg. program name, director, producer, etc.
    • Project done in January. This is an outdated naming convention.
  • newshour-chyron - drawn from the NewsHour TV broadcast, this project annotates text appearing on screen, usually above or below the main action saying things such as "Breaking News", "Joan, author".

  • newshour-namedentity - from NewsHour. This project annotated named entities found within the video transcript along with which characters denoted that named entity and its type (see newshour-namedentity/{guidelines,readme}.md).

  • newshour-namedentity-wikipedialink - from NewsHour. This project used the previous project's dataset and added an extra label of which wikimedia link referred to the named entity annotated, eg. https://www.wikidata.org/wiki/Q931148.

  • newshour-transcript - from NewsHour. This project found the start and end times for 10 tokens of closed captioning at a time from the transcript to the video.

  • role-filler-binding - This project uses the role filler binding linguistic theory to attempt to extract and organize Optical Character Recognized (OCR) text into a structured and readable set of metadata pairs. The pairs are usually a role of a production-collaborator or role of a person-within-the-video... and the named, capitalized person name that fills that role.

  • scene-recogntion - This project builds the dataset meant to train ML models to recognize scenes/frames/timeframes that interest GBH/AAPB/CLAMS for extracting metadata such as slates, chyrons, credits, important-people-being-interviewed. This is a combined effort to recognize these kinds of frames and find the timeframes where they exist in aggregate, drawing upon findings in previous projects.

Issue Tracking and Conversation Archive

Progress and other discussion by AAPB/CLAMS/WBGH is tracked via the open and closed Github Issues feature. Finally, please email CLAMS.ai admin for other inquiries.

About

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages