Releases: OpenSextant/Xponents
Xponents Core Python API v1.6
Please see related repo:
Python API release here - https://github.com/OpenSextant/Xponents-Core/releases/tag/python-v1.6.2
This tag, v3.7.4, is for the Xponents REST server tested with the python library.
Also tested against v3.5.
OpenSextant Python API
OpenSextant ("Xponents") Python library
This library provides the most common data model, utilities and Xponents REST client to interact with OpenSextant/Xponents solutions. The main resources interfaced here are:
- Xponents REST API (
opensextant.xlayer
package offersXlayerClient
) - Xponents Gazetteer API (
opensextant.gazetteer
package offers Solr Gazetter mechanics) - Xponents Gazetteer ETL (
opensextant.gazetteer
offers classes used to curate the gazetteer in./solr
from raw sources)
Version: opensextant
library, v1.5 (2024-March), attached below.
Install: pip install opensextant-1.5.8.tar.gz
Details
- Xponents REST usage for Python
- Gazetteer queries for example, See Python section at bottom.
- opensextant API pydoc
For the REST client below, please deploy Docker image per https://hub.docker.com/r/mubaldino/opensextant
Use the resulting server_host:port
as url
below
Correct usage for opensextant.xlayer
with REST API looks like this:
from opensextant.xlayer import XlayerClient
client = XlayerClient(url) # opensextant server URL or simply "server_host:port"
tags = client.process(docid, text, features=["geo", "postal"])
# Confidence threshold = 20
# Array of opensextant.TextMatch,
# -- Consult subclass -- PlaceCandidate is a match that has geographic information
# -- Consult label to determine nature of tag. It is one of "country", "coord", "postal", "place"
# -- Consult TextMatch.attrs dictionary for useful metadata, e.g., as shown geolocation "confidence" should be used wisely.
# 100 point scale, where 20 is a default cut-off (below that tag is unlikely a location or correct.)
# -- Consult PlaceCandidate metadata in attrs as well as the place attribute for location metadata.
for t in tags:
if t.filtered_out:
# Add "filtered" to features to see what is filtered out.
continue
conf = int(t.attrs.get("confidence", -1))
if isinstance(t, PlaceCandidate):
if t.label == "coord":
print("Found a coordinate")
if conf >= 25:
print("Found a high confidence place")
Xponents v3.7 baseline
Xponents 3.7 provides these refinements on tagging:
- filtering noise and abbreviations
- enhanced tag filtering for CJK and Arabic language groups, as well as improved Spanish stopwords
- bare country codes and bare, short administrative codes are omitted, e.g.,
UK
,Uk
, oruk
is tagged as a country, but filtered out if it is not qualified/preceded by a city or province.
Library cleanup
- Converge geodesy and giscore libraries from opensextant under Xponents, rather than as separate dependencies. Java compatibility concerns, long term.
- Latest Apache Commons and logging libraries updated
- Updated
opensextant
python support library released at https://github.com/OpenSextant/Xponents/releases/tag/python-v1.5.8
Docker release of Xponents REST API and Gazetteer is here: https://hub.docker.com/r/mubaldino/opensextant . mubaldino/opensextant:xponents-3.5
image is the latest rev. v3.7 of docker image is pending.
Xponents 3.5 Final
Xponents 3.5 addresses primarily:
- addition of postal detection and geocoding, refined over the course of the past year
- remediation of Log4J vulnerabilities, bringing the level of that library version to
2.17.2
- the start of simplified documentation across gazetteer curation, Java API usage, and other references
- formal release of a Python client API and scripting for interfacing to the REST API. See this Xponents release tag https://github.com/OpenSextant/Xponents/releases/tag/python-v1.4.7
Docker release of Xponents REST API and Gazeteer is here: https://hub.docker.com/r/mubaldino/opensextant . mubaldino/opensextant:xponents-3.5
image is the latest rev
Xponents 3.5 Begin Again, Again
Happy Valentines
Xponents 3.5.5 BeginAgain (Again)
- Full Evaluation: internal evaluation work was redone start to finish to hone outlier gazetteer entries and
patterns of rogue entries from new data sources. Evaluation work called out and fixed serious false-positive and recall
errors - Log4J Remediation: While Log4J is not the primary choice of logging facility, it is a dependency that appears
mainly in the Solr 7.x server distribution. Vulnerable Log4J JAR files were removed and latest ones were injected. - API Changes:
TextEntity
is a text span and requires a start, end offset pair. Only constructor
requires that pair. Other subclasses can have a zero argument constructor by exception, such asPoLiMatch
GeonamesUtility.isCountry()
now only returns true forPCLI
entries others are historical country names or territories.- REST API now has
method
andmatch-id
on most matches to be more consistent codes
feature can be requested in REST API:features=geo,taxons,patterns,codes
for example.
This will emit tagged acronyms for admin boundaries for now.- Xponents Core
TextUtils
now offers trivial text span testing for common punctuation.
For example, to quickly test ifMARC __&__ U
looks like a entity or is a false positive
when tagging the phraseMarc U
a common punct test was needed. These were fairly obvious
pre-filters to employ just after tagging and before serious reasoning happens.
- Geocoding: Tamped down on acronym false-positives on UPPERCASE and lowercase
documents given the added gazetteer data includes lots of codes.- Default behavior: country codes and province codes are NOT emitted although tagged.
These are requested explicitly by caller using thecodes
feature. Right, soUSA
orCOD
orMA
are not emitted by default although those bare tokens may represent
countries or provinces. Such codes qualifying other placenames will be emitted. - Gazetteer tagging ommissions: numerous transliterated short names for Pacific/Asian islands
A xx
,I-xx
and various other false-positive places are NOT tagged, although present in the gazetteer. - About 500 dictionary words in French, German and English were added to the stop-filter
for tokens commonly not places. E.g.,amend
,adept
, etc.
- Default behavior: country codes and province codes are NOT emitted although tagged.
- Bugs Fixed:
- Geocoder Rule
HeatMap
memory leak fixed German
is removed as a country -- its a nationality or an adjective- Tagger will throw
ExtractionException
if it tags 100,000 or more locations from gazetter
- Geocoder Rule
DISTRIBUTIONS:
- Python: See attached Opensextant Python API 1.4.6
- Docker: https://hub.docker.com/r/mubaldino/opensextant - see "xponents-3.5" tag. Now
latest
is also a tag - Gazetteer: see Docker image; Copy
xponents-solr
out of docker image to use it outside of Docker - Java, Maven:
TESTING:
Deploy: https://github.com/OpenSextant/Xponents/blob/master/Examples/Docker/docker-compose.yml
Install client library (ATTACHED)
pip3 install opensextant-1.4.6.tar.gz
Use Test suite: https://github.com/OpenSextant/Xponents/blob/master/test/xlayer-test-suite.py
DEFAULT_URL=localhost:8787
python3 xlayer-test-suite.py $DEFAULT_URL
Test output:
- Consult docker logs on docker container, ala
docker logs xponents
to see that server is alive - Review output to console -- unit tests results for normal geotagging, postal geotagging and tests in Arabic and Japanese should appear.
Xponents 3.5 "Begin Again"
Happy New Year
Xponents 3.5.4 BeginAgain
- Full Evaluation: internal evaluation work was redone start to finish to hone outlier gazetteer entries and
patterns of rogue entries from new data sources. Evaluation work called out and fixed serious false-positive and recall
errors - Log4J Remediation: While Log4J is not the primary choice of logging facility, it is a dependency that appears
mainly in the Solr 7.x server distribution. Vulnerable Log4J JAR files were removed and latest ones were injected. - API Changes:
TextEntity
is a text span and requires a start, end offset pair. Only constructor
requires that pair. Other subclasses can have a zero argument constructor by exception, such asPoLiMatch
GeonamesUtility.isCountry()
now only returns true forPCLI
entries others are historical country names or territories.- REST API now has
method
andmatch-id
on most matches to be more consistent
DISTRIBUTIONS:
- Python: See attached Opensextant Python API 1.4.5
- Docker: https://hub.docker.com/r/mubaldino/opensextant - see "xponents-3.5" tag
- Gazetteer: see Docker image; Copy
xponents-solr
out of docker image to use it outside of Docker - Java, Maven:
TESTING:
Deploy: https://github.com/OpenSextant/Xponents/blob/master/Examples/Docker/docker-compose.yml
Install client library
pip3 install opensextant-1.4.5.tar.gz
Use Test suite: https://github.com/OpenSextant/Xponents/blob/master/test/xlayer-test-suite.py
DEFAULT_URL=localhost:8787
python3 xlayer-test-suite.py $DEFAULT_URL
Test output:
- Consult docker logs on docker container, ala
docker logs xponents
to see that server is alive - Review output to console -- unit tests results for normal geotagging, postal geotagging and tests in Arabic and Japanese should appear.
Xponents Core & SDK v3.3.5 patch
- Docker offline image
- Bug: TaxCat
.configure()
method accidentally called a second time in PlaceGeocoder - JavaDoc 8+ maintanance on HTML5 and javadoc comments
- Maven plugin versions updated
- XText module moved to 3.3.5 to release with Xponents Examples
- XCoord retested coordinate patterns on DMS and DM to ensure +/- symbols are detected and coordinate precision is provided. Moved
TEST
cases on certain patterns to appropriate family to test
Xponents Core & SDK v3.2.2 data patch
Not fully released for this round. Docker image will be produced for v3.3.
Issues:
- "taxcat" data sets are now more reliably harvested and scripted so everything is easily reproduced.
- Docker script simplification for Xponents REST
- Substantial additions in patterns extraction: "Email" test case added in Java API; "FlexPat" capability now in beta in Python API.
- Python 3.x readiness and testing
- Streamlined and retested entire Solr Gazetter build from latest geoname sources
- Streamlined and retested
script/dist.sh
build and distribution
Xponents Core & SDK v3.2
release code name: Dead Heat, Summer 2019.
Xponents was refactored in the following way for this turning point release:
- Core contains lighter weight parsers and base classes and data models (artifact:
opensextant-xponents-core
) - Tagger SDK contains the beefier Solr-based taggers and REST services (artifact (
opensextant-xponents
) - Xlayer REST was folded into SDK
- XText only needs to make use of Core
NO functional changes or data changes were made. This release is strictly an organizational matter relative to 3.1.1.
This includes the binary distribution of Xponents SDK (JARs, config files, docs) but no Xponents Solr data, due to size limitations. Docker Hub will have the full release.
Xponents SDK v3.0 (2018-OCTOBER)
Download: Xponents SDK @ Data-Releases
Improvements:
- CLI: Command line improvements on testing and running Example demos using a Groovy script (
./script/xponents-demo.sh
) - Stopword tuning by language, when language ID for text is known: see @genediazjr "stopwords-iso" project; Stopwords for Tagalog, Urdu, Farsi, Chinese, Korean, etc, contributed there. This contributes to noise reduction in post-processing naiive tagger output.
- Solr: SolrTextTagger was incorporated into Solr 7.4.0 formally. Solr 7.4 is minimum requirement. SolrJ partially deprecated; "FST50" postings format for FST is now used.
- Geocoder Rules: NAME, CODE patterns teased apart, e.g., "Boise, ID", "Boise, Id." are valid locations, where admin boundary code qualifies city. "Boise id" is not valid, though.
- Consolidation: Xponents is now one library. XText and Xlayer are separate related modules.
- Social Geo:
org.opensextant.data.social
andorg.opensextant.extractors.geo.social
represent the core functionality that was previously intest TweetGeocoder
from OpenSextant 1.0. - LangID: CyboZu LangDetect is incorporated as an extractor, but we still require the use of a valid ISO-639 table and a Language object model to manage LangID concepts.
- JSON: Jodd JSON library is now formally supported by XText and XLayer and other areas where JSON is used.