Xponents 3.5 Begin Again, Again
Happy Valentines
Xponents 3.5.5 BeginAgain (Again)
- Full Evaluation: internal evaluation work was redone start to finish to hone outlier gazetteer entries and
patterns of rogue entries from new data sources. Evaluation work called out and fixed serious false-positive and recall
errors - Log4J Remediation: While Log4J is not the primary choice of logging facility, it is a dependency that appears
mainly in the Solr 7.x server distribution. Vulnerable Log4J JAR files were removed and latest ones were injected. - API Changes:
TextEntity
is a text span and requires a start, end offset pair. Only constructor
requires that pair. Other subclasses can have a zero argument constructor by exception, such asPoLiMatch
GeonamesUtility.isCountry()
now only returns true forPCLI
entries others are historical country names or territories.- REST API now has
method
andmatch-id
on most matches to be more consistent codes
feature can be requested in REST API:features=geo,taxons,patterns,codes
for example.
This will emit tagged acronyms for admin boundaries for now.- Xponents Core
TextUtils
now offers trivial text span testing for common punctuation.
For example, to quickly test ifMARC __&__ U
looks like a entity or is a false positive
when tagging the phraseMarc U
a common punct test was needed. These were fairly obvious
pre-filters to employ just after tagging and before serious reasoning happens.
- Geocoding: Tamped down on acronym false-positives on UPPERCASE and lowercase
documents given the added gazetteer data includes lots of codes.- Default behavior: country codes and province codes are NOT emitted although tagged.
These are requested explicitly by caller using thecodes
feature. Right, soUSA
orCOD
orMA
are not emitted by default although those bare tokens may represent
countries or provinces. Such codes qualifying other placenames will be emitted. - Gazetteer tagging ommissions: numerous transliterated short names for Pacific/Asian islands
A xx
,I-xx
and various other false-positive places are NOT tagged, although present in the gazetteer. - About 500 dictionary words in French, German and English were added to the stop-filter
for tokens commonly not places. E.g.,amend
,adept
, etc.
- Default behavior: country codes and province codes are NOT emitted although tagged.
- Bugs Fixed:
- Geocoder Rule
HeatMap
memory leak fixed German
is removed as a country -- its a nationality or an adjective- Tagger will throw
ExtractionException
if it tags 100,000 or more locations from gazetter
- Geocoder Rule
DISTRIBUTIONS:
- Python: See attached Opensextant Python API 1.4.6
- Docker: https://hub.docker.com/r/mubaldino/opensextant - see "xponents-3.5" tag. Now
latest
is also a tag - Gazetteer: see Docker image; Copy
xponents-solr
out of docker image to use it outside of Docker - Java, Maven:
TESTING:
Deploy: https://github.com/OpenSextant/Xponents/blob/master/Examples/Docker/docker-compose.yml
Install client library (ATTACHED)
pip3 install opensextant-1.4.6.tar.gz
Use Test suite: https://github.com/OpenSextant/Xponents/blob/master/test/xlayer-test-suite.py
DEFAULT_URL=localhost:8787
python3 xlayer-test-suite.py $DEFAULT_URL
Test output:
- Consult docker logs on docker container, ala
docker logs xponents
to see that server is alive - Review output to console -- unit tests results for normal geotagging, postal geotagging and tests in Arabic and Japanese should appear.