Add sectioner code #1

marcverhagen · 2018-02-27T14:56:09Z

Run this on a TarsqiDocument and have it add docelement tags.

Focus on some fairly generic heuristics for splitting and a few specific ones for the Thyme data. Generic heuristics to use:

Short lines with only a few words, not ending in a period, potentially starting with a number or some other indicator that something is a header.
Empty lines, almost always indicating that the text above and below are not in the same sentence, this is similar to what the current document structure parser in https://github.com/tarsqi/ttk does.
Certain XML tags when available.

The Thyme data have many variations of the short line heuristic. There are enumerations:

Height=150.00 cm,
Weight=39.40 kg,
Height=59.06 [in_i],

Singer - Hospital Summary
Admission Date: 05-Dec-2005  Dismissal Date: 07-Dec-2005
Contributing Author: Kxixcaj Q. Oarvui

The Stanford splitter puts these in one sentence, and a bad one that explodes the parse, so we should at least not recognize this as one sentence.

Section headers are also a variation of the short line heuristic:

[end section id="20104"]

These are also usually separated by whitespace, but not always so adding the brackets to the heuristics for Thyme may be good.

Also, in Thyme, a couple of consecutive words in ALL CAPS, starting at the beginning of a line (even long lines) are a strong indication that this may be a section. This needs to be explored further.

The text was updated successfully, but these errors were encountered:

marcverhagen · 2019-03-06T17:37:46Z

Some of this was done in commit 57e26cf, but no effort was made yet to split general heuristic and THYME heuristics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sectioner code #1

Add sectioner code #1

marcverhagen commented Feb 27, 2018 •

edited

Loading

marcverhagen commented Mar 6, 2019

Add sectioner code #1

Add sectioner code #1

Comments

marcverhagen commented Feb 27, 2018 • edited Loading

marcverhagen commented Mar 6, 2019

marcverhagen commented Feb 27, 2018 •

edited

Loading