Skip to content

Latest commit

 

History

History
440 lines (339 loc) · 21.8 KB

RELEASE.md

File metadata and controls

440 lines (339 loc) · 21.8 KB

RELEASES

Visit https://github.com/OpenSextant/Xponents/releases for latest library releases and notes.

2024

Xponents 3.7

  • Core API: Convergence
    • Geodesy and GISCore geodetic libraries are now merged into the main Xponents Core
    • Tests and some data for G/G modules are copied over... however the primary test data remains in their respective repos geodesy and giscore.
    • Substantial JavaLint and security fixes were made to these 10 yr old libraries. Also dependencies such as FindBugs were replaced.
    • Migrated to a separate repo: Find this now maintained separately at Xponents-Core repo This is the first release ./Core/ folder will not appear in this source tree.
  • Xponents REST:
    • Noise filtering; added minlen parameter to Xlayer (Xponents REST) service. By default it filters out trivial stuff. About 3-5% reduction in noise tags from codes, short names, etc, -- things typically not useful geocodings nor typically names of places.

Xponents 3.6.7: Springy

  • Core API

    • Building up TextInput class to carry better metrics to enable filters and language-specific tuning.
  • PlaceGeocoder tuning:

    • Refined tagging and filtering of CJK texts using mostly length and punctuation filters; Non-trival names and phrases will likely be grams of 3 or more characters and have no interceding space or punct.

2023

Xponents 3.6.6: New Year Release

Patch focus -- handle acronym and short match noise from Postal codes and other artifacts.

  • Core API
    • library calls for checking CJK or Middle-eastern scripts (Arabic, Hebrew, etc)
    • library call for checking if a match is an A.B.Breviation... loosely.
  • PlaceGeocoder tuning:
    • Applied simpler abbreviation flagging on place candidate mentions, allowing U.K. and U.S. and E.E.U.U., etc. to be tagged properly and emitted as country match
    • Significant CJK/AR filtering as an interim solution for noise filtering and dealing with a lack of language-specific tokenization and heuristics for match filtering or promoting.

Xponents 3.6.5

  • PlaceGeocoder tuning:
    • Stop filters: Spanish, Arabic and Russian stopwords and non-places updated.
    • Libraries: Apache Commons libraries updated for IO, Compress, and Codec
    • Tagging filters: omit ADMIN level matches that are short codes, as well as avoid tagging bare country codes

Xponents 3.6.4

  • PostalGeocoder tuning:
    • Optimization: separate tagger and geocoding passes into sepearate classes; consolidate postal code filters; Feed pre-tagging matches from PlaceGeocoder to PostalGeocoder as an option. PostalGeocoder would still run that internally if not fed externally
    • Tagging Omission Fixed: POSTAL CITY or CITY POSTAL are now possible, only if a city represents an administrative area, such as ADM2-ADM4 or PPLA gazetteer feature. These would be major metropolitan areas with their own set of multiple postal codes.
    • Invalid USGS data -- and other data with lat, lon = (0,0) aka 'Unknown' location info -- is filtered out by tagger.
    • Core API -- TextMatch class is sortable to allow for tagger applications to sort matches by appearance (offset low to high)

Xponents 3.6.1

  • Taxon matching refinements and expansion. REST API now allows for geotagger to fully extract/report all taxons
  • #81 Xponents Core: XCoord parsing of MGRS
  • #77 Streamlined dependencies and now understand them better, e.g,. Log4J 2.20 updated. In Examples, XText + Tika 2.7 updated and dependencies slimmed down.
  • Lots of code cleanup and documentation tweaks
  • Python client library will be release in v3.7 with new gazetteer

2022

Xponents 3.5.10

  • Maven Dependency review and scrub

Xponents 3.5.9

  • Library versions: Post-Log4shell cleanup
    • SLF4J 1.7.36, Logback, 1.2.11, Log4J 2.17.2
    • Commons IO 2.11.0
    • Tika 1.28.3 (Examples and XText)
    • Spatial4J 0.8; JTS 1.x removed from dependencies, but still used in Solr distro
  • JavaLint results from Sonarqube scans implemented - addressing the removal of Java 5, 6, 7 code style, code formatting
  • Geocoding:
    • Postal code precision set at +/- 5000m instead of 50,000m
    • NameCodeRule rule - improved the validation for how "Name, Code" or "Name, AdminName" pairings are managed; Avoid repetitive lists such as "Country, Country, ..." or "Province, Province,... "; Avoid repetitive codes such as "CODE, CODE" as in CM MA
    • Gazetteer regenerated and substantial abbreviations and stop words added.

Xponents 3.5.7 Postal Patience

  • Geocoding:
    • Following fixes related to features=["postal",...] in Python API (XlayerClient) or REST API (/xlayer/rest/process)
    • Omit postal tags where random punctuation appears amongst valid postal data. If it is valid, you should cleanse your text a bit. Postal codes and abbreviations are very common and can often be confused with other coded data
    • Postal tags left undecorated by the end of a call are omitted, ie., tagging for postal features
      may detect a valid postal code (e.g., ZIP 90210) -- but will only report it if it is qualified by a province or valid city (e.g., CA 90210, Beverly Heights, 90210). Bare digits are not reported.
    • Precision of postal feature "A/POST" was 50,000m; it is now 5,000m.
    • Use features=["filtered",...] to retrieve items that are tagged, but omitted

Xponents 3.5.5 BeginAgain (Re-release)

  • Geocoding: Tamped down on acronym false-positives on UPPERCASE and lowercase documents given the added gazetteer data includes lots of codes.
    • Default behavior: country codes and province codes are NOT emitted although tagged. These are requested explicitly by caller using the codes feature. Right, so USA or COD or MA are not emitted by default although those bare tokens may represent countries or provinces. Such codes qualifying other placenames will be emitted.
    • Gazetteer tagging ommissions: numerous transliterated short names for Pacific/Asian islands A xx, I-xx and various other false-positive places are NOT tagged, although present in the gazetteer.
    • About 500 dictionary words in French, German and English were added to the stop-filter for tokens commonly not places. E.g., amend, adept, etc.
  • Bugs Fixed:
    • Geocoder Rule HeatMap memory leak fixed
    • German is removed as a country -- its a nationality or an adjective
    • Tagger will throw ExtractionException if it tags 100,000 or more locations from gazetter
  • API Changes:
    • codes feature can be requested in REST API: features=geo,taxons,patterns,codes for example. This will emit tagged acronyms for admin boundaries for now.
    • Xponents Core TextUtils now offers trivial text span testing for common punctuation. For example, to quickly test if MARC __&__ U looks like a entity or is a false positive when tagging the phrase Marc U a common punct test was needed. These were fairly obvious pre-filters to employ just after tagging and before serious reasoning happens.

Xponents 3.5.4 BeginAgain

  • Full Evaluation: internal evaluation work was redone start to finish to hone outlier gazetteer entries and patterns of rogue entries from new data sources. Evaluation work called out and fixed serious false-positive and recall errors
  • Log4J Remediation: While Log4J is not the primary choice of logging facility, it is a dependency that appears mainly in the Solr 7.x server distribution. Vulnerable Log4J JAR files were removed and latest ones were injected.
  • API Changes:
    • TextEntity is a text span and requires a start, end offset pair. Only constructor requires that pair. Other subclasses can have a zero argument constructor by exception, such as PoLiMatch
    • GeonamesUtility.isCountry() now only returns true for PCLI entries others are historical country names or territories.
    • REST API now has method and match-id on most matches to be more consistent

2021

Xponents 3.5.0 GiveThanks

Release Objective: Sustainable & expanded gazetteer concepts. Improved geoinferencing rules and evaluation.

  • Features: "postal" data from Geonames.org incorporated into its own tagger; Support tagging postal codes and abbreviations is available as a SDK tagger "PostalGeocoder" and also from the REST API, when using features="postal"
  • Gazetteer ETL: continued to refactor data sources thoroughly adding a full SQLite curation pipeline for master gazetteer and postal data. Python API and ./solr/script contain the bulk.
  • Gazetteer ETL: employing Google Books wordstats (2012) to identify common words
  • Fuzzy Matching: Honed the concept of phonetic and non-diacritic matching to help optimize gazetteer entry validation and tuning and tagger matching.
  • Gazetteer: Added entries for historical countries, UAE, and numerous variants and nationalities
  • Geolocation: converted gazetteer data and scoring to 100 point scale for ID bias and name bias (shed 50MB) from each data store due to not carrying around unnecessary floating point accuracy.
  • Python API: opensextant, v1.4:
  • Java API: opensextant-xponents-core v3.5:
  • Java API: opensextant-xponents v3.5. Changes:
    • ScoredPlace no longer subclasses Place: It is a holder for score and place
    • Parameters no longer has output_* fields. tag_* fields are used only to indicate user processing/output options
  • REST API: Xponents Server v3.5:
  • Sonar Scan: Java8 code style and compliance fixes throughout.

Xponents 3.4.0 Vamp

  • NOT released on docker. Skip.
  • Opensextant v1.3 python library updated and streamlined for Python3 conventions
  • Gazetteer curation refactored and added serious sources such as Geonames.org and NaturalEarth
  • Xponents Solr Gazetteer quarterly release, 2021-Q1
  • Geolocation precision improvements based in GeocodeRules used in PlaceGeocoder
  • Sonar Scan: Java8 code style and compliance fixes throughout.

Xponents 3.3.6 Sonar

  • Added Docker notes on Sonarqube scanner and fixed basic javadoc and unit test issues

2020

Xponents 3.3.5 Clarity

Xponents 3.3.3 New Vision

(( That is, "2020" = New Vision. Happy new year!))

  • XCoord improvements - accommodating for missing degree and other symbols
  • PlaceGeocoder uses feature class weighting using some common sense and less than obvious math. effect is to rank certain features higher than others. this could be data-driven, but for now is a implemented with a simple hash table of FEATURE CODES ==> WEIGHTS.

2019

Xponents 3.3.0 Holly Jolly

  • Holiday 2019 release: Date/time pattern detection and normalization improvements.
  • Python API improvements, mainly converting XlayerClient to return array of TextMatch objects, rather than raw python dicts.
  • Minor gazetteer improvements

Xponents 3.2.1 Dead of Night

  • Halloween 2019: script simplification, CLASSPATH and other cleanup. XCoord: Reduced ambiguity in some DM vs. DMS patterns

Xponents 3.2.0 Dead Heat

  • July 2019: Refactoring to split a lighter-weight "Core API" from the heavier, more involved tagger SDK

Xponents 3.1.0 Summer Solstice

  • Reverse geocoding added on request. Xlayer exposes the results of enriching found coordinates ** Use of Solr {geofilt} does not work with large number of rows of point data -- RPT wants to work with shapes and appearently tries to load resources to support more advanced shape queries. Ran out of memory with all invocationso of Solr Spatial mechanisms.
  • OpenJDK 8 and 12 testing; Experiments on GC settings

2018

Xponents 3.0.6 Pi Day/Equinox

  • Xtemporal now reduces matches filtering out submatches or duplicate date/time matches.

Xponents 3.0.5 SuperBowl

  • Reviewed low recall due to name-code filters and rules. NAME,.....CODE will not filter out a CODE if CODE is a country.
  • Solr/Lucene 7.6+
  • Tika 1.19+ on XText

Xponents 3.0.4 Columbus Day

  • Command line improvemnts on testing
  • Consolidate all tests and examples under single Groovy script

Xponents 3.0.3 Day of Rememberance (9/11)

  • Account for all decent stop word lists (see genediazjr "stopwords-iso" project); Stopwords for Tagalog, Urdu, Farsi, Chinese, Korean, etc, contributed there. These lists just make output less noisy when the language of text is known.
  • More LanguageID-driven tests added for PlaceGeocoder
  • NAME, CODE patterns teased apart
  • Solr 7.4+ is required now; SolrTextTagger miraculously is embedded in Solr, so less has to be done externally to configure it all.
  • Solr 7.4: removed deprecated Solr request optimizations, and other deprecated SolrJ usage.

Xponents 3.0 Fourth of July

  • Refactor: all major libraries converged into one project: Basic, Patterns, Xponents are now just "Xponents"
  • Refactor: XText is moved up to its own top level OpenSextant project
  • Feature: Tweet geocoding was moved from Examples to a formal part of Xponents org.opensextant.data.social and org.opensextant.extractors.geo.social represent the core functionality.
  • Feature: Added Language ID API wrapping CyboZu LangDetect; Xponents langid extractor though adds a fair amount of wrapping using Xponents Language object to make language data (ISO639 codes, etc) easier to use. As well, where CyboZu LangDetect fails on short texts or other data, Xponents has some fall-back approaches to attempt alternate lang IDs for CJK languages ( Chinese/Japanese/Korean).
  • Versions: Solr 7.3 is core Solr/Lucene version
  • Data: Formally support JSON through Jodd.org JSON package primarily with data transforms for "geocoding" data. This supports both REST (Xlayer project) and social media ingest and export.

DEPRECATED: 2.10 and earlier

Xponents 2.10.4 thru 2.10.6:

  • Bug fix: PlaceCandidate had opposite sort order (improper implementation of Comparable in ScoredPlace)
  • Bug fix: SolrGazetteer.findPlaces() missing " AND " in solr clause
  • Added demonstration code under Examples for work with gazetteer
  • Fix span tag detection in MatcherUtils

2017

Xponents 2.10.x Revival

  • Solr6 + Java8 support. Solr4 discontinued
  • Province Names table: Geocoded matches now have Province Name along side ADM1
  • Applied noise filters to TaxonMatcher
  • MatcherUtils: utility class to help special cases in sifting through matches, e.g. matches on HTML or tagged data need more filtering.

Xponents 2.9.9 Fourth of July

  • Much improved stop filters for nearly all Solr-supported languages (using lucene stopword resources)
  • Reviewed geocoding rules; Assigned default confidence to country matches, instead of 0.
  • Solr6 staging

Xponents 2.9.6

  • Addition of a MapReduce (MR) experimentation area. Capability iterates over JSON data that has "text" and "id" fields; Geotags text for coordinates and place names.
  • Filters: with advent of MR tokenization and false-positives in specific languages became glaringly obvious. Fixed: geotagging in Japanese (and Chinese, Korean, etc); As well, improved some common stop word lists for Vietnamese and Spanish leveraging stop word lists from Lucene and Carrot2.

Xponents 2.9.3

  • Solr Gazetteer: non-places split into general and spanish as the major groupings.
  • PlaceGeocoder: nonsense filter handles short matches for diacritic mismatch
  • Filters: CLASSPATH order is important, gazetteer/conf must appear first in classpath to override anything else.
    Do NOT include test JARs in production setting or in evaluation -- If you do, then be aware of CLASSPATH.

2016

Xponents 2.9.0

  • Fresh look at how resource files are pulled from CLASSPATH: InputStream (getResourceAsStream()) is the primary entry point to pulling in any sort of config file or data resource. Getting File or URL should be left to the caller of APIs. If such things are offered in these APIs it is for mere convenience. Pulling items from JAR, file system, CLASSPATH, etc. seem to behave differently in different environemnts: e.g., HDFS, Server vs. Client Applications, etc.
  • Solr 4.x refactoring: Provision Solr from Jetty v9; No longer using the crippled jetty-runner v8 JAR.
  • Patterns: Streamlined constructors given the resource file issue at top
  • Extraction: PlaceGeocoder now weights findings against explictly mentioned countries to improve disambiguation. About 1% improvement in F-score.

Xponents 2.8.18

  • Extraction: NonsenseFilter added short name + number pattern to filter out unlikely name match, that is aimed at rare gazetteer entries. Alternatively, mark such things in Gazetteer as SearchOnly = true (to avoid tagging at all, by default)

Xponents 2.8.17

  • Basics: US State metadata for mapping FIPS/ISO and ADM1/Postal code pairings
  • Extraction: NonsenseFilter added to deal with odd punctuation situations as a result of over-tagging or deep tokenization.

Xponents 2.8.16

  • Basics: Country data improvements include territories, timezones, languages spoken, etc; Backed by GeonamesUtility and GeodeticUtility
  • Basics: Place object is fleshed out more with population data, when available; ASCII and other name hueristic flags; Geohash options; overall improved Geocoding interface; Backed by techniques in TexUtils
  • Patterns: XCoord and GeocoordMatch reimplement Geocoding interface; Date/Time pattern fixes
  • Extraction: PlaceGeocoder added support for Arabic and CJK text parsing if given a language ID; Refactored rules stack and performance on scoring candidate names. Overall improvement in default score for a place match; Tweaked JRCNames to allow for better false-positive negation.
  • XText 2.9.x: Tika 1.13 upgrade; Improved Web/Sharepoint crawling logic (not perfect). Allows user to filter links worth capturing and converting
  • XText 2.9.x: TikaHTML parser/converter was not yielding reasonably obvious metadata tags (title, org, author, etc.) so I pulled in JerichoHTML to get tags.

2015

Xponents 2.8.5 - december 2015:

  • adding timezone and language metadata;
  • PlaceGeocoder: rules and tracing improved.
  • PlaceGeocoder: Added nationality detection using XTax; inferred countries lightly rank candidates higher.

Xponents 2.8.x - november 2015: Long over due refactor

  • Extraction/Geo: PlaceGeocoder now emitting reasonable choice for location of names; Still initial draft. Heavily involved in rules development in Java here. Evaluation of these features is still very much a personal/internal thing. -- TODO: document rules in plain language -- TOOD: someday opensource evaluation tools
  • Patterns (new): Splintered off FlexPat-based libraries into this new module. If all a user wants is regex style patterns, they do not need Tika or Solr or any of that.
  • Basics: TextUtils now has more text case checking tools
  • MOVES: -- Basics 'flexpat' ---> Patterns -- Extraction 'xcoord','xtemporal','poli' --> Patterns

Xponents 2.7.19 - november 2015, bug fixes and fine tuning .16 patches

Xponents 2.7.16 - october 2015

  • Extraction: 'PlaceGeocoder' saw a focused effort on improving how popular well-known entities can be used to negate gazetteer tagging. This solution makes better use of XTax as a naiive entity tagger. Overall, recall is maximized at the same time geo-tagging precision is maximized. As well, the foundation of "Geocode Rules" is established but needs further documentation.

Xponents 2.7.15 - october 2015

  • Java 8: tested strict javadoc compilation and fixed errors. Warnings remain
  • Basics: added timezone/UTC offset table to country objects (courtesty of geonames.org)
  • packaging: removed deprecated code such as progress listeners
  • Extraction: Retested Gazetteer spatial query, as certain standard solr spatial mechanisms force index to load into RAM, e.g., sort-by-dist
  • Extraction: lower-case and case-insensitive matching enabled in GazetteerMatcher for odd cases like working with social media
  • Extraction/Gazetteer: added abillity to upload JSON form of gazetteer records, e.g. aliases for existing known gazetter entries

Xponents 2.7.8 - july 2015

  • Java 7 is the norm, but tested compilation and running on Java 8.
  • XText: improved semantics for found hyperlinks in web crawls
  • XText: Tika 1.8 is latest
  • Basics: fixed country code hash maps; added more text utility for handling unicode situations: Emojis and other language issues.
  • Basics: Enhanced the concept of a "Geocoding" interface to include ADM1 Name in addition to ADM1 code
  • Extraction: Honed use of JRCNames as a keyword tagging resource in XTax
  • Extraction: Devised a rule set for a full range of geocoding ideas in PlaceGeocoder (coords, countries, places) while looking at filtering out terms and tokens for performance reasons.
  • Dist: Improved distribution packaging (script/dist.xml)

2014

Xponents 2.5.1 - July 2014

  • Java 7+ required now; Java 6 source syntax supported, but release will be Java 7 binary
  • Javadoc cleanup
  • XText refactor, given added archive file support; concept of caching and crawling is optional and moved out of main conversion logic.

Xponents 2.4.3 - June 2014

  • Extraction: MGRS filters for well known dates/months, lower case (default is to filter out lowercase), and Line endings in Latband/GZD
  • XText bug fixes; check style review: v1.5.4
  • POM cleanup and indentation; review unspecified compile time dependencies

Xponents 2.3 - May 2014

  • minor tweeks in APIs
  • added set_match_id(match, counter) to FlexPat matchers

Xponents XText 1.5 - May 2014

  • numerous fixes in XText proper, and many path normalization fixes in ConvertedDocument
  • added Mail crawler and MessageConverter for handling email
  • many improvements to JPEG/EXIF conversion