diff --git a/schedule/week_2/images/tag_annotation_update.png b/schedule/week_2/images/tag_annotation_update.png new file mode 100644 index 00000000..5c71edcc Binary files /dev/null and b/schedule/week_2/images/tag_annotation_update.png differ diff --git a/schedule/week_2/images/tag_discontinuity_update.png b/schedule/week_2/images/tag_discontinuity_update.png new file mode 100644 index 00000000..f41efb2a Binary files /dev/null and b/schedule/week_2/images/tag_discontinuity_update.png differ diff --git a/schedule/week_2/images/tag_markup_update.png b/schedule/week_2/images/tag_markup_update.png new file mode 100644 index 00000000..81890005 Binary files /dev/null and b/schedule/week_2/images/tag_markup_update.png differ diff --git a/schedule/week_2/images/tag_no_markup_update.png b/schedule/week_2/images/tag_no_markup_update.png new file mode 100644 index 00000000..8f358ead Binary files /dev/null and b/schedule/week_2/images/tag_no_markup_update.png differ diff --git a/schedule/week_2/images/tag_overlap_update.png b/schedule/week_2/images/tag_overlap_update.png new file mode 100644 index 00000000..6e764487 Binary files /dev/null and b/schedule/week_2/images/tag_overlap_update.png differ diff --git a/schedule/week_2/tag.md b/schedule/week_2/tag.md index 6ace0091..9e8d5dc1 100644 --- a/schedule/week_2/tag.md +++ b/schedule/week_2/tag.md @@ -1,22 +1,103 @@ # Text As Graph (TAG) -The Text As Graph (TAG) model conceptualizes of documents as a *hypergraph* for text. +The Text As Graph (TAG) model conceptualizes of documents as a *hypergraph* for text. As you may be unfamiliar with a hypergraph model, we will briefly outline its key features. Like any *graph*, a hypergraph consists of nodes and edges, where the edges connect one node to another. A hypergraph contains both regular edges, which connect one node to one other node, and *hyperedges*, which connect a *set of nodes* to another *set of nodes*. -As you may be unfamiliar with a hypergraph model, we will briefly outline its key points. A *graph* consists of nodes and edges, where the edges connect one node to another. A hypergraph contains both regular edges, which connect one node to one other node, and *hyperedges*, which connect a *set of nodes* to another *set of nodes*. For more information about modeling text as a hypergraph see [“It’s more than just overlap: Text As Graph”](https://www.balisage.net/Proceedings/vol19/html/Dekker01/BalisageVol19-Dekker01.html) +You can find more information about modeling text as a hypergraph in [“It’s more than just overlap: Text As Graph”](https://www.balisage.net/Proceedings/vol19/html/Dekker01/BalisageVol19-Dekker01.html). For now, keep in mind that the hypergraph is a powerful data structure that allows you to represent a greater quantity of textual information in an inclusive and more refined way. ## Why are we looking at TAG? -The only document model in wide use in digital editions projects is XML, which is the only technology that has sufficient maturity and a sufficiently large community to be practical for general production purposes. The reason we nonetheless introduce TAG (and [LMNL](lmnl_syntax.md)) is that looking at non-XML ways of modeling documents encourages developers to think first about the model, and then about the relationship of the model to the syntax. In other words, thinking about how to model documents in LMNL and TAG can improve the quality of our models, whether we use XML or an alternative. +The only document model in wide use in digital editions projects is XML, which is the only technology that has sufficient maturity and a sufficiently large community to be practical for general production purposes. The reason we nonetheless introduce TAG (and [LMNL](lmnl_syntax.md)) is that looking at non-XML ways of modeling documents encourages us to think first about the model, and then about the relationship of the model to the syntax. In other words, thinking about how to model documents in LMNL and TAG can improve the quality of our models, whether we use XML or an alternative. -This tutorial is based on the first version of TAG, which has undergone several changes since it was introduced (e.g., directed Text-to-Text edges have been replaced by undirected ones; TAGML serves as a markup language for TAG, etc.). The differences are not important for the purpose of using TAG to encourage critical thinking about document modeling, so we have kept the original description, even though it has now been superseded in some respects. Up-to-date information about TAG is maintained at the [TAG portal on GitHub](https://github.com/HuygensING/tag). +A study of TAG's features, therefore, serve the purpose of encouraging critical thinking about document modeling. The TAG model is under active development, so in the following paragraphs we will not discuss its syntax, query language or schema, but we focus on the properties of the data model. -## TAG counterparts to XML `text()` nodes +## TAG edges +The TAG data model distinguishes a number of different edges; below we describe just the two main ones. -The text in a TAG document is a sequence of Text nodes, where the sequence begins with a Document node. The simplest TAG document, which contains only text and no markup, looks something like: +### TAG undirected edges +All edges in TAG's hypergraph are undirected. The graph models you may be more familiar with, such as a variant graph, have directed edges. This means the edges can only be traversed from node A to node B. Undirected edges, conversely, can be traversed in both ways. -![](images/tag_no_markup.png) +### TAG hyperedges +TAG uses hyperedges to associate markup with its textual content. Hyperedges can connect one or more nodes with each other, in contrast to regular edges that connect one node to another node. This means that there can be multiple Markup nodes on one Text node. An example of a hyperedge is given below. -## TAG Markup-to-Text hyperedges +## TAG Nodes + +The TAG model distinguishes four kinds of nodes in the hypergraph. They are briefly described below and illustrated using a simple example. + +### TAG Document node +The Document node represents a single TAG document. It marks the start of a sequence of Text nodes and serves as a root node. See the root node in the image below. + +### TAG Text nodes + +A Text node represents (a part of) the textual content of the document. Whitespace is included in the textal content. The simplest TAG document, which contains only text and no markup, looks something like: + +![](images/tag_no_markup_update.png) + +### TAG Markup nodes +Markup nodes store the name of the markup. They are connected to one or more Text nodes with an hyperedge. In the figure below, the hyperedge connects the Markup node `verb` with the Text node containing `est`. + +![](images/tag_markup_update.png) + +### TAG Annotation nodes +Annotations in TAG are comparible to XML attributes. Information is stored as a key:value pair. Annotation nodes have two properties: the name of the annotation (the key) and the value of the annotation (the value). + +Below an illustration of an annotation with the key `POS` and the value `fin` on the Markup node: + +![](images/tag_annotation_update.png) + + +## Modeling text in TAG +We mentioned above that the properties of the hypergraph for text data model cater for the modeling of complex text features. In other words: what's hard in XML is not hard in TAG. + +Let's take a look at the textual examples we used when illustrating [the limitations of XML](https://github.com/Pittsburgh-NEH-Institute/Institute-Materials-2017/blob/master/schedule/week_2/xml_limitations.md) and see how they translate to TAG. + +### Overlap + +Consider a fragment of Percy Bysshe Shelley’s “Ozymandias” (1818): + + +> Who said—“Two vast and trunkless legs of stone +> Stand in the desart ... + +What in a XML transcription leads to overlapping structures and thus not well formed XML: + +```xml +Who said — “Two vast and trunkless legs of stone +Stand in the desart…. +``` + +is easily expressed in TAG: + +![](images/tag_overlap_update.png) + +The phrase “Two vast and trunkless legs of stone stand in the desart” is split between two lines, each of which also contains other phrases. There is no valid way to mark this up in XML except by prioritizing one hierarchy (phrases or lines) and representing the other with empty milestones. In TAG, however, neither hierarchy is primary; phrases and lines both contain Text nodes, and both types of relationships are encoded in the same way. (See also a [complete graphic representation of “Ozymandias”](images/ozymandias_hypergraph.svg), generated by Alexandria.) + +### Discontinuity + +```xml +"and what is the use of a book," thought Alice "without pictures or conversation?" +``` + +can be modeled in TAG: + +![](images/tag_discontinuity_update.png) + +## TAG syntax +TAGML stands for _TAG Markup Language_ and, as syntax, it is a serialization of the TAG model. It is designed to represent in a straightforward manner all features of a text. + +A simple TAGML example is: + +``` +[line>The rain in Spain falls mainly on the plain.` being the start-tag and the `