Skip to content

Xponents 3.5 Begin Again, Again

Compare
Choose a tag to compare
@mubaldino mubaldino released this 07 Feb 16:07
· 251 commits to master since this release

Happy Valentines

Xponents 3.5.5 BeginAgain (Again)

  • Full Evaluation: internal evaluation work was redone start to finish to hone outlier gazetteer entries and
    patterns of rogue entries from new data sources. Evaluation work called out and fixed serious false-positive and recall
    errors
  • Log4J Remediation: While Log4J is not the primary choice of logging facility, it is a dependency that appears
    mainly in the Solr 7.x server distribution. Vulnerable Log4J JAR files were removed and latest ones were injected.
  • API Changes:
    • TextEntity is a text span and requires a start, end offset pair. Only constructor
      requires that pair. Other subclasses can have a zero argument constructor by exception, such as PoLiMatch
    • GeonamesUtility.isCountry() now only returns true for PCLI entries others are historical country names or territories.
    • REST API now has method and match-id on most matches to be more consistent
    • codes feature can be requested in REST API: features=geo,taxons,patterns,codes for example.
      This will emit tagged acronyms for admin boundaries for now.
    • Xponents Core TextUtils now offers trivial text span testing for common punctuation.
      For example, to quickly test if MARC __&__ U looks like a entity or is a false positive
      when tagging the phrase Marc U a common punct test was needed. These were fairly obvious
      pre-filters to employ just after tagging and before serious reasoning happens.
  • Geocoding: Tamped down on acronym false-positives on UPPERCASE and lowercase
    documents given the added gazetteer data includes lots of codes.
    • Default behavior: country codes and province codes are NOT emitted although tagged.
      These are requested explicitly by caller using the codes feature. Right, so USA
      or COD or MA are not emitted by default although those bare tokens may represent
      countries or provinces. Such codes qualifying other placenames will be emitted.
    • Gazetteer tagging ommissions: numerous transliterated short names for Pacific/Asian islands A xx, I-xx
      and various other false-positive places are NOT tagged, although present in the gazetteer.
    • About 500 dictionary words in French, German and English were added to the stop-filter
      for tokens commonly not places. E.g., amend, adept, etc.
  • Bugs Fixed:
    • Geocoder Rule HeatMap memory leak fixed
    • German is removed as a country -- its a nationality or an adjective
    • Tagger will throw ExtractionException if it tags 100,000 or more locations from gazetter

DISTRIBUTIONS:

TESTING:

Deploy: https://github.com/OpenSextant/Xponents/blob/master/Examples/Docker/docker-compose.yml

Install client library (ATTACHED)

pip3 install opensextant-1.4.6.tar.gz

Use Test suite: https://github.com/OpenSextant/Xponents/blob/master/test/xlayer-test-suite.py

DEFAULT_URL=localhost:8787
python3 xlayer-test-suite.py   $DEFAULT_URL

Test output:

  • Consult docker logs on docker container, ala docker logs xponents to see that server is alive
  • Review output to console -- unit tests results for normal geotagging, postal geotagging and tests in Arabic and Japanese should appear.