Skip to content

Add an XSD #273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Mar 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "xml-schemas/ebu-tt-m-xsd"]
path = xml-schemas/ebu-tt-m-xsd
url = https://github.com/ebu/ebu-tt-m-xsd.git
6 changes: 3 additions & 3 deletions examples/non-dapt-divs.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
<div> <!-- This cannot be a Script Event because it has no xml:id -->
<p> <!-- Would be a Text if its parent were a Script Event --> </p>
</div>
<div> <!-- div parent of another div -->
<div xml:id="d2_1"> <!-- div parent of another div -->
<div xml:id="d2"> <!-- Possibly a Script Event --></div>
</div>
<div> <!-- double layer of nesting -->
<div>
<div xml:id="d3_1"> <!-- double layer of nesting -->
<div xml:id="d3_1_1">
<div xml:id="d3" begin="..." end="..." xml:lang="ja" foo:bar="baz">
<!-- A Script Event with possibly unexpected attributes -->
</div>
Expand Down
44 changes: 44 additions & 0 deletions schema-validator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# DAPT XSD Validator

Basic command line utility for validating DAPT documents using
the XML Schema Definition in the w3c/dapt repository.

This script uses the MIT licensed [`xmlschema`](https://github.com/sissaschool/xmlschema) library.

This script is provided as-is with no warranties of any kind.
The repository's `LICENSE.md` applies, with the contents of this folder
being considered a _code example_.

## Build

1. Install poetry - [installation instructions](https://python-poetry.org/docs/#installing-with-the-official-installer)
2. Ensure you have a version of Python greater than or equal to 3.11 available
for example with the command `poetry env use 3.11`
3. Install the dependencies by running `poetry install`

## Usage

```sh
poetry run validate -dapt_in path/to/dapt_file.ttml
```

or pass the document for validating in via stdin, e.g.:

```sh
poetry run validate < path/to/dapt_file.ttml
```

### Validating without pruning unrecognised vocabulary

By default this script prunes unrecognised vocabulary before
XSD validation, as required by
[DAPT §5.2.1 Unrecognised vocabulary](https://www.w3.org/TR/dapt/#unrecognised-elements-and-attributes);
this behaviour can be disabled via the command line parameter
`-noprune`.


## Tests

```sh
poetry run python -m unittest
```
88 changes: 88 additions & 0 deletions schema-validator/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

24 changes: 24 additions & 0 deletions schema-validator/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
[tool.poetry]
name = "dapt-xsd-val"
version = "0.1.0"
description = "Thin wrapper around xmlschema to support XSD validation using the supplied DAPT XSD"
authors = ["Nigel Megitt <[email protected]>"]
readme = "README.md"
packages = [
{ include = "src" },
{ include = "test" },
]

[tool.poetry.dependencies]
python = ">=3.11"
xmlschema = "^3.4.3"

[tool.poetry.group.dev.dependencies]
flake8 = "^7.1.2"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

[tool.poetry.scripts]
validate = "src.validate:main"
Empty file.
175 changes: 175 additions & 0 deletions schema-validator/src/validate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
import os
import sys
import argparse
import xmlschema
import xml.etree.ElementTree as ElementTree
from xml.etree.ElementTree import Element as Element
import logging

logging.getLogger().setLevel(logging.INFO)


schema_path = os.path.normpath(
os.path.join(
os.path.dirname(__file__),
'../../xml-schemas/dapt.xsd')
)
metadata_items_schema_path = os.path.normpath(
os.path.join(
os.path.dirname(__file__),
'../../xml-schemas/ttml2-metadata-items.xsd')
)

recognised_namespaces = set([
'http://www.w3.org/XML/1998/namespace', # xml
'http://www.w3.org/ns/ttml', # tt
'http://www.w3.org/ns/ttml#parameter', # ttp
'http://www.w3.org/ns/ttml#audio', # tta
'http://www.w3.org/ns/ttml#metadata', # ttm
'http://www.w3.org/ns/ttml/feature/',
'http://www.w3.org/ns/ttml/profile/dapt#metadata', # daptm
'http://www.w3.org/ns/ttml/profile/dapt/extension/',
'urn:ebu:tt:metadata', # ebuttm
])

# We will not prune attributes in no namespace if they are
# defined on any element in TTML or DAPT, even if they are
# not defined on the specific element on which they occur.
known_no_ns_attributes = set([
'agent',
'animate',
'begin',
'calcMode',
'clipBegin',
'clipEnd',
'condition',
'dur',
'encoding',
'end',
'family',
'fill',
'format',
'keySplines',
'keyTimes',
'length',
'name',
'range',
'region',
'repeatCount',
'src',
'style',
'timeContainer',
'type',
'weight',
])


def get_namespace(tag: str) -> str:
if (len(tag) == 0 or tag[0] != '{'):
return ''

if '}' not in tag:
raise ValueError('No closing brace found')

return tag.split('{', 1)[1].split('}', 1)[0]


def get_unqualified_name(tag: str) -> str:
if '}' not in tag:
return tag

return tag.split('}', 1)[1]


def prune_unrecognised_vocabulary(el: Element):
to_remove = []
for child in el:
if get_namespace(child.tag) not in recognised_namespaces:
logging.debug('pruning element {}'.format(child.tag))
to_remove.append(child)
else:
prune_unrecognised_vocabulary(child)

for e in to_remove:
el.remove(e)

for attr_key in el.keys():
attr_ns = get_namespace(attr_key)

attr_name = get_unqualified_name(attr_key)
if (attr_ns and attr_ns not in recognised_namespaces) \
or \
(not attr_ns and attr_name not in known_no_ns_attributes):
logging.debug('pruning {}@{}'.format(el.tag, attr_key))
el.attrib.pop(attr_key)

return el


def validate_dapt(args):
# xmlschema gets baffled following the import of metadata_items,
# so make it load it explicitly instead, which seems to work.
schema_paths = [schema_path, metadata_items_schema_path]
logging.info('Creating schema from XSDs at {}'.format(schema_paths))
schema = xmlschema.XMLSchema(schema_paths)
schema.build()
if schema.validity:
logging.info('Schemas are valid')
else:
logging.error('Schemas are not valid, exiting early')
return -1

try:
logging.info('Validating document at {}'.format(args.dapt_in.name))
dapt_in_bytes = args.dapt_in.read()
in_xml_str = str(dapt_in_bytes, encoding='utf-8', errors='strict')
root = ElementTree.fromstring(in_xml_str)

if not args.noprune:
logging.info('Pruning unrecognised vocabulary')
prune_unrecognised_vocabulary(el=root)
schema.validate(source=root)
except xmlschema.XMLSchemaValidationError as valex:
logging.error(str(valex))
logging.error(
'Document is not valid {} pruning unrecognised vocabulary.'.format(
'before' if args.noprune else 'after'
))
return -1

logging.info(
'Document is syntactically valid with respect to the '
'DAPT XML Schema Definition{}; '
'this does not check all semantic requirements of the '
'DAPT specification.'.format(
'' if args.noprune else ' after pruning unrecognised vocabulary'
))
return 0


def main():
parser = argparse.ArgumentParser()

parser.add_argument(
'-dapt_in',
type=argparse.FileType('rb'),
default=sys.stdin,
nargs='?',
action='store',
help='Input DAPT file to validate')
parser.add_argument(
'-noprune',
default=False,
required=False,
action='store_true',
help='If set, attempts to validate without '
'pruning unrecognised vocabulary.')
parser.set_defaults(func=validate_dapt)

args = parser.parse_args()
return args.func(args)


if __name__ == "__main__":
# execute only if run as a script
sys.exit(main())
Empty file.
34 changes: 34 additions & 0 deletions schema-validator/test/fixtures/valid_dapt.ttml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
<?xml version="1.0" encoding="UTF-8"?>
<tt xmlns="http://www.w3.org/ns/ttml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:tta="http://www.w3.org/ns/ttml#audio"
xmlns:ttm="http://www.w3.org/ns/ttml#metadata"
xmlns:ttp="http://www.w3.org/ns/ttml#parameter"
xmlns:daptm="http://www.w3.org/ns/ttml/profile/dapt#metadata"
xmlns:ebuttm="urn:ebu:tt:metadata"
ttp:contentProfiles="http://www.w3.org/ns/ttml/profile/dapt1.0/content"
daptm:scriptRepresents="audio"
daptm:scriptType="originalTranscript"
xml:lang="en">
<head>
<metadata xmlns:otherns="urn:some:other:namespace">
<ttm:agent xmlns:nttm="http://www.netflix.com/ns/ttml#metadata"
nttm:voice="en-US-Wavenet-B" foo="bar" otherns:test="x" type="person" xml:id="actor_A" id="huh">
<ttm:name type="full">Matthias Schoenaerts</ttm:name>
</ttm:agent>
<ttm:agent type="character" xml:id="character_2">
<ttm:name type="alias">BOOKER</ttm:name>
<ttm:actor agent="actor_A"/>
</ttm:agent>
<otherns:x/>
<otherns:y/>
</metadata>
</head>
<body>
<div xml:id="se1" begin="3s" end="10s" ttm:agent="character_2" daptm:represents="audio.dialogue" daptm:onScreen="ON">
<ttm:desc daptm:descType="scene">high mountain valley</ttm:desc>
<metadata></metadata>
<p daptm:langSrc="en"><span>Look at this beautiful valley.</span></p>
</div>
</body>
</tt>
Loading