Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vrt-augment-name-attrs: Add ne structures based on NER attributes #19

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

janiemi
Copy link
Collaborator

@janiemi janiemi commented Nov 18, 2024

Add vrt-augment-name-attrs, a VRT tool to add ne structures (elements) and their attributes (annotations) based on positional NER attributes (nertag2, nertags2, nerbio2) as produced by vrt-finnish-nertag.

Usage:

usage: vrt-augment-name-attrs [-h]
                              [--out file | --in-place | --backup bak | --in-sibling EXT]
                              [--version] [--word attr] [--lemma attr] [--nertag attr]
                              [--multi-nertag attr] [--maximal-only]
                              [file]

Augment VRT input containing positional name attributes with <ne> structures with
attributes. The tool expects a positional attribute to contain NER tags for maximal names,
matching the regular expression "(Ena|Nu|Ti)mex[A-Z][a-z]+[A-Z][a-z]+-[BEF]", as produced by
vrt-finnish-nertag. Another attribute can contain a set of NER tags for possible nested
names, with "-N" appended where N is the nesting level.

positional arguments:
  file                  input file (default stdin)

options:
  [… Standard VRT tool options …]
  --word attr           positional attribute name for word form is attr (default: "word")
  --lemma attr          positional attribute name for base form (lemma) is attr; if the
                        input does not contain attr, use word forms in the place of lemmas
                        (specify "" to suppress a warning) (default: "lemma")
  --nertag attr         positional attribute name for maximal NER tag is attr (default:
                        "nertag2")
  --multi-nertag attr   positional attribute name for multiple (nested, non-maximal) NER
                        tags is attr (default: "nertags2")
  --maximal-only        enclose only maximal names, no nested ones

Tests are included for the tool.

vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Handle intra-name tags and comments so that they are kept inside the
  name. However, cases such as "<tag> nameword1 </tag> nameword2" are
  tagged as "<tag> <ne> nameword1 </tag> nameword2 </ne>" and "nameword1
  <tag> nameword2 </tag>" as "<ne> nameword1 <tag> nameword2 </ne>
  </tag>", which is not optimal nesting.
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Add options --word, --lemma and --nertag for specifying the names of
  the corresponding positional attributes.
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Warn on invalid NER tags and NER end tags without a start tags.
- Fix not to crash on an empty NER tag value.
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Add nested <ne> structures for NER tags listed in positional attribute
  nertags2 (name can be specified with --multi-nertag), unless
  --maximal-only is specified.
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Fix not to crash when a nested name ended at the last word of a
  maximal name.
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- VrtNameAttrAugmenter.main: Extract nested function set_attrnums to
  make the main loop code more compact.
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Treat a completely empty nertag attribute value in the same way as
  "_" (no NER tag).
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Check the validity of NER tag values more precisely: they must match
  the regexp "(Ena|Nu|Ti)mex[A-Z][a-z]+[A-Z][a-z]+-[BEF](-[0-9]+)?".
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Warn if a name has no end tag within a sentence.
- Warn if a name is open at the end of input (which means that the input
  is incomplete, as a sentence is open, too).
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Allow input without a lemma attribute, in which case the word form is
  used instead. Nevertheless, warn if the input has no lemma, unless
  --lemma="" is specified.
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Fix the warning "No NER end tag for ... within sentence" to mention
  the NER tag that is currently open. (Previously, it indicated the
  value of the nertag attribute of the last word of the sentence.)
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Reword the warning for a missing end tag to 'NER start tag without end
  tag within sentence: "..."' to be more consistent with other similar
  warnings. Also change the line number in the warning to refer to the
  line of the start tag.
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Check that NER start and end tag types match, output <ne> structures
  only if they match and warn on mismatches. This is done for both
  maximal and nested NER tags.
- For nested NER tags, also warn if a tag is already open at the same
  level as a new start tag.
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Show the line number of the actual tag in nested tag warning messages,
  instead of the last line number of the maximal name.
vrt-tools/libvrt/tools/vrt_augment_name_attrs.py:
- Do not output double quotation marks around tag names in warning
  messages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant