Skip to content

NLP Ruleset Format and Creation Guide

Andrew edited this page Dec 8, 2020 · 1 revision

Structure

Broadly speaking, an information extraction project folder will have the following structure

+ $YOUR_PROJECT_ROOT_FOLDER
+- norm/ (deprecated - avoid using)
+-- resources_norm_norm$SOMECONCEPT1desc.txt
+-- resources_norm_norm$SOMECONCEPT2desc.txt
+-- ...
+- regexp/ 
+-- resources_regexp_reSOMECONCEPT1.txt
+-- resources_regexp_reSOMECONCEPT2.txt
+-- ...
+- rules/
+-- resources_rules_matchrules.txt
+- used_resources.txt

To get started, make a copy of the template project folder path/to/template/folder/here and rename it to match your project name.

Defining terms lists

The regexp/ folder typically contains terms lists that correspond to a given disease category. For example, in the provided template, resources_regexp_reStroke corresponds to a terms list that defines stroke.

Each individual line in the file refers to a separate representation of that term. One can define an arbitrary number of terms lists; as long as their file names are in the format resources_regexp_re$YOURCONCEPTNAME$.txt and a reference to them is made in used_resources.txt

Note

It is important to note that the terms lists are parsed line by line as regular expressions. Therefore, unexpected/unusual behaviour may occur with the usage of parentheses, certain punctuation, brackets, and other special regex characters. To use these in a search term, precede them with a backslash () to indicate to the regex compiler that the subsequent character should not be treated as a special character

Defining rules

The second file of interest is resources_rules_matchrules.txt inside the rules folder.

This file contains a list of rules that provide a mapping from a regular expression to a certain concept type. Additionally, filtering can be done to restrict matches to within certain document sections.

There are two forms of rules: concept mention rules, and exclusion rules.

Concept mention rules are rules that define how to extract concepts. They take on the form:

'RULENAME="cm_$YOUR_RULE_NAME$",REGEXP="$SOME_REGEX$",LOCATION="$SECTION_LIMIT$",NORM="$NORMALIZED_FORM$"' where $YOUR_RULE_NAME$ can be any descriptive name without a space $SOME_REGEX$ is a regular expression that matches what you wish to extract.

Remove rules

Remove rules are rules that are executed after concept mention rules. Any extracted concepts that are contained within a remove rule are removed from the returned concept set.

Remove rules take on the form:

'RULENAME="rem_$YOUR_RULE_NAME$",REGEXP="$SOME_REGEX$",LOCATION="$SECTION_LIMIT$",NORM="REMOVE"' For a naïve example, to deal with the sentence

"no previous history of cond1, cond2, cond3, cond4...." one could write a removal rule of

'RULENAME="rem_01",REGEXP="\b((no previous history of)( )?(.+)(,\s*.+)*)$",LOCATION="NA",NORM="REMOVE"' Even if cond1, cond2, cond3, cond4 are all extracted by some defined concept rule, they will not be returned as part of the results as they are located within the match of the remove rule rem_01 (although for this particular case, the context annotator should mark is as a negative annotation; thus rendering this rule unnecessary or superfluous).

A few notes

References can be made to a terms list by referencing them through the use of %re$YOURCONCEPTNAME$. For instance, to refer to the terms list located in regexp/resources_regexp_reStroke.txt, %reStroke can be used $SECTION_LIMIT$ can take on one of two forms: "NA" - removes section filtering altogether (i.e. can match anywhere) "SEC:~SECTION_CODE_1~SECTION_CODE_2~...~SECTION_CODE_N~" - matches only those section IDs. The IDs used are defined by your section tagging dictionary $NORMALIZED_FORM$ is the concept code/normalized form to associate with the given regular expression rule. If desired, references can be made back to the match found in the regex through the use of regular expression groups by including group~($groupNumber) in your normalized form.