Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sectioner code #1

Open
marcverhagen opened this issue Feb 27, 2018 · 1 comment
Open

Add sectioner code #1

marcverhagen opened this issue Feb 27, 2018 · 1 comment

Comments

@marcverhagen
Copy link
Member

marcverhagen commented Feb 27, 2018

Run this on a TarsqiDocument and have it add docelement tags.

Focus on some fairly generic heuristics for splitting and a few specific ones for the Thyme data. Generic heuristics to use:

  • Short lines with only a few words, not ending in a period, potentially starting with a number or some other indicator that something is a header.

  • Empty lines, almost always indicating that the text above and below are not in the same sentence, this is similar to what the current document structure parser in https://github.com/tarsqi/ttk does.

  • Certain XML tags when available.

The Thyme data have many variations of the short line heuristic. There are enumerations:

Height=150.00 cm,
Weight=39.40 kg,
Height=59.06 [in_i],
Singer - Hospital Summary
Admission Date: 05-Dec-2005  Dismissal Date: 07-Dec-2005
Contributing Author: Kxixcaj Q. Oarvui

The Stanford splitter puts these in one sentence, and a bad one that explodes the parse, so we should at least not recognize this as one sentence.

Section headers are also a variation of the short line heuristic:

[end section id="20104"]

These are also usually separated by whitespace, but not always so adding the brackets to the heuristics for Thyme may be good.

Also, in Thyme, a couple of consecutive words in ALL CAPS, starting at the beginning of a line (even long lines) are a strong indication that this may be a section. This needs to be explored further.

@marcverhagen
Copy link
Member Author

Some of this was done in commit 57e26cf, but no effort was made yet to split general heuristic and THYME heuristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant