You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run this on a TarsqiDocument and have it add docelement tags.
Focus on some fairly generic heuristics for splitting and a few specific ones for the Thyme data. Generic heuristics to use:
Short lines with only a few words, not ending in a period, potentially starting with a number or some other indicator that something is a header.
Empty lines, almost always indicating that the text above and below are not in the same sentence, this is similar to what the current document structure parser in https://github.com/tarsqi/ttk does.
Certain XML tags when available.
The Thyme data have many variations of the short line heuristic. There are enumerations:
Height=150.00 cm,
Weight=39.40 kg,
Height=59.06 [in_i],
The Stanford splitter puts these in one sentence, and a bad one that explodes the parse, so we should at least not recognize this as one sentence.
Section headers are also a variation of the short line heuristic:
[end section id="20104"]
These are also usually separated by whitespace, but not always so adding the brackets to the heuristics for Thyme may be good.
Also, in Thyme, a couple of consecutive words in ALL CAPS, starting at the beginning of a line (even long lines) are a strong indication that this may be a section. This needs to be explored further.
The text was updated successfully, but these errors were encountered:
Run this on a TarsqiDocument and have it add docelement tags.
Focus on some fairly generic heuristics for splitting and a few specific ones for the Thyme data. Generic heuristics to use:
Short lines with only a few words, not ending in a period, potentially starting with a number or some other indicator that something is a header.
Empty lines, almost always indicating that the text above and below are not in the same sentence, this is similar to what the current document structure parser in https://github.com/tarsqi/ttk does.
Certain XML tags when available.
The Thyme data have many variations of the short line heuristic. There are enumerations:
The Stanford splitter puts these in one sentence, and a bad one that explodes the parse, so we should at least not recognize this as one sentence.
Section headers are also a variation of the short line heuristic:
These are also usually separated by whitespace, but not always so adding the brackets to the heuristics for Thyme may be good.
Also, in Thyme, a couple of consecutive words in ALL CAPS, starting at the beginning of a line (even long lines) are a strong indication that this may be a section. This needs to be explored further.
The text was updated successfully, but these errors were encountered: