bie-index is a grails web application that indexes taxonomic content in DwC-A and provides search web services for this content. This includes:
- faceted taxonomic search with synonymy support
- search across other entities including:
- regions
- spatial layers
- data resources
- institutions
- collections
- autocomplete services
- CSV download services
- bulk taxonomic lookup services by name or GUID
- retrieval of full classification (major and non-major taxonomic ranks e.g. sub-order, sub-family)
This project provides JSON webservices and an interface for admin users. It does not include a HTML interface for end users. There is a set of front end components available providing the species pages listed here:
- bie-plugin - a grails plugin providing species pages & search interfaces
- ala-bie - the ALA version of this front end
- generic-bie - a generic version of this front end for use as a starting point for other countries reusing ALA components.
For an introduction to the approach to names within the ALA, nameology is a good place to start.
This application currently supports the ingestion of Darwin Core archive (DwC-A) with the following mandatory darwin core fields in the core file:
Additional fields can be added which will allow more sophisticated handling of names
- nameComplete An explicit complete name. If not specified, the scientificNameAuthorship (if present) is appended to the scientificName to make the canonical name.
- nameFormatted An explicitly formatted name, with name, author and structural parts marked by
<span class="...">
elements. If not explicitly present, it is constructed from the information available. See Name Formatting for details. - source A URL for a source of the name. This can be used to link to a page containing the original data.
- datasetID The uid of the collectory dataset ID. If provided, then a link to the dataset description will be provided
Additional fields added to the core file e.g. establishmentMeans or any other field will also be indexed and available for facetted searching.
An extension file of vernacular names is also supported. The format support here aligns with the same format supported by the ala-names-matching API.
Additional fields , which will allow more sophisticated handling of vernacular names are:
- status An indicator of the relative importance of the vernacular name. The controlled vocabulary is read from vernacularNameStatus.json
- source A URL for a source of the vernacular name. This can be used to link to a page containing the original data.
- datasetID The uid of the collectory dataset ID. If provided, then a link to the dataset description will be provided
An extension file of additional identifiers is also supported. The format aligns with the GBIF identfier format.
Additional fields, which will allow more sophisticated handling of identifiers are:
- status An indicator of the relative importance of the identifier. The controlled vocabulary is read from identifierStatus.json
- source A URL for a source of the identifier. This can be used to link to a page containing the original data.
- datasetID The uid of the collectory dataset ID. If provided, then a link to the dataset description will be provided
A Darwin Core Archive may contain an eml.xml
metadata file, in the Ecological Metadata Language format.
If available, default information is gathered from the metadata file:
- datasetName Derived from
eml/dataset/title
Below is an example meta.xml that would be provided in a darwin core archive.
<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
<core encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy=""" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Taxon">
<files>
<location>taxon.csv</location>
</files>
<id index="0" />
<field term="http://rs.tdwg.org/dwc/terms/taxonID"/>
<field index="1" term="http://rs.tdwg.org/dwc/terms/parentNameUsageID"/>
<field index="2" term="http://rs.tdwg.org/dwc/terms/acceptedNameUsageID"/>
<field index="3" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
<field index="4" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>
<field index="5" term="http://rs.tdwg.org/dwc/terms/taxonRank"/>
</core>
<extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy=""" ignoreHeaderLines="1" rowType="http://rs.gbif.org/terms/1.0/VernacularName">
<files>
<location>vernacular.csv</location>
</files>
<coreid index="0" />
<field term="http://rs.tdwg.org/dwc/terms/taxonID"/>
<field index="1" term="http://rs.tdwg.org/dwc/terms/vernacularName"/>
</extension>
</archive>
- UK Species inventory
In addition to indexing the content of the darwin core archive, the ingestion & index creation (optionally) indexes data from the following ALA components. It does this by harvesting JSON feeds from the listed components.
- Layers & regions - https://spatial.ala.org.au/layers - spatial layers available in the system and regions (e.g. states, countries)
- Collectory - https://collections.ala.org.au - data resource, collections, institutions
- Lists and traits - https://lists.ala.org.au - conservation lists, sensitive species lists, traits
- Biocache services - https://biocache.ala.org.au/ws - images, occurrence record counts
- Biocollect - https://biocollect.ala.org.au - projects
This application makes use of the following technologies
- Apache SOLR 6.6.x
- Grails 6.0.0
- Tomcat 7 or higher
- Java 11 or higher
Some taxonIDs are now URLs, rather than LSIDs.
When provided to the server un-encoded, everything is fine.
However, if encoded with slashes being replaced by %2F
then tomcat treats this as a security error and
returns a 400 error.
To allow encoded slashes in tomcat, start the server with -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true
An image scan will search the biocache for suitable images to act as an example image for the taxon.
The image scan configuration can be located in /data/bie-index/config/image-lists.json
and
references by imageListsUrl
in the configuration properties.
An example image list configuration is
{
"boosts": [
"record_type:Image^10",
"record_type:HumanObservaton^20",
"record_type:Observation^20",
"-record_type:PreservedSpecimen^20"
],
"ranks": [
{
"rank": "family",
"idField": null,
"nameField": "family"
},
{
"rank": "genus",
"idField": "genus_guid",
"nameField": "genus"
},
{
"rank": "species",
"idField": "species_guid",
"nameField": "taxon_name"
}
],
"lists": [
{
"uid": "dr4778",
"imageId": "imageId"
}
]
}
For each candidate taxon, a query is constructed that boosts certain characteristics of the occurrence in the
hopes of finding something that doesn't look terribly dead. The boosts
element contains the list of boosts to apply
(in the example, observations and images get a boost, preserved specimens are downgraded and images from dr130 are
preferred.
The required
and preferred
elements contain lists of filter queries that are applied to the search.
For example,geospatial_kosher:true
restricts searches to occurrences that
appear to be geospatially usable.
Only certain ranks get images attached to them.
The ranks
element contains taxon ranks that should have images associated with them,
along with the fields in the occurrence records that allow an occurrence to be found.
The idField
provides a specific biocache field that will be searched for the taxon identifier.
It can be left null for ranks that have no biocache index fields.
The nameField
provides a specific biocache field that will be searched for the taxon name.
Again, it can be left null for ranks with no name.
The lsid
field is always searched, to see if there is an ocurrence record that is specifically
named as the taxon.
The lists
elment contains species lists that manually specify preferred heroic images for species.
The species lists should have a column that supplies the image imageId, so that the image
can be found on the image server. Multiple lists can be used, with highest priority going to the first entry.
The uid
holds the list identifier to load.
Vernacular names can be drawn from species lists on the list server.
The vernacular name configuration can be located in /data/bie-index/config/vernacular-lists.json
and
referenced by vernacularListsUrl
in the configuration properties.
An example vernacular name list configuration is
{
"defaultVernacularNameField": "name",
"defaultNameIdField": "nameID",
"defaultKingdomField": "kingdom",
"defaultPhylumField": "phylum",
"defaultClassField": "class",
"defaultOrderField": "order",
"defaultFamilyField": "family",
"defaultRankField": "rank",
"defaultStatusField": "status",
"defaultLanguageField": "language",
"defaultsourceField": "source",
"defaultTemporalField": "temporal",
"defaultLocationIdField": "locationID",
"defaultLocalityField": "locality",
"defaultCountryCodeField": "countryCode",
"defaultSexField": "sex",
"defaultLifeStageField": "lifeStage",
"defaultIsPluralField": "isPlural",
"defaultIsPreferredField": "isPreferred",
"defaultOrganismPartField": "organismPart",
"defaultLabelsField": "labels",
"defaultTaxonRemarksField": "taxonRemarks",
"defaultProvenanceField": "provenance",
"defaultStatus": "common",
"defaultLanguage": "en",
"lists": [
{
"uid": "drt1464664375273",
"taxonRemarksField": "Notes",
"defaultLanguage": "xul"
"defaultStatus": "traditionalKnowledge"
},
{
"uid": "drt1464664375274",
"defaultLanguage": "fr",
"statusField": "priority"
}
]
}
The default entries provide useful defaults for things like the list fields that hold various pieces of information.
These can be overridden at the list level.
The various fields refer to the fields that can be part of the GBIF vernacular names extension.
The defaultLanguage
and defaultStatus
entries provide per-list defaults for language and status entries.
Languages should be ISO-639 two- or three-letter codes or AIATSIS codes; the bie-plugin can expand these out.
The uid
holds the list identifier to load.
Avoid using vernacularName or commonName as the vernacular name field The list server treats these in a special way, causing problems when attempting to retrieve the names.
Names with a status of deprecated
appear last in lists of names and will not be used as the "headline"
vernacular name.
They are generally names that are now offensive or doubtful.
Conservation status information can also be drawn from species lists.
The conservation configuration can be located in /data/bie-index/config/conservation-lists.json
and
referenced by conservationListsUrl
in the configuration properties.
An example conservation status list configuration is
{
"defaultSourceField": "status",
"defaultKingdomField": "kingdom",
"lists": [
{
"uid": "dr656",
"field": "conservationStatusAUS_s",
"term": "conservationStatusAUS",
"label": "AUS"
},
{
"uid": "dr655",
"field": "conservationStatusVIC_s",
"term": "conservationStatusVIC",
"label": "VIC",
"sourceField": "statusName",
"kingdomField": "kgm"
}
]
}
The uid
supplies the list identifier.
The field
supplies the solr field which will be used to store the conservation status.
The term
supplies the name of the status field.
label
gives the label to apply to the conservation status.
sourceField
gives the name of the field that contains the conservation status.
kingdomField
gives the name of the field that contains the kingdom -- handy for name lookups, if available.
To use all species lists that are recorded as both authoritative and threatened, have lists
as an empty array. These
lists must have a status
column indicating the conservation status.
Calculating weights for search and autosuggest operations gets rather complicated, score-calculated weights for seach operations are built into each document during the import process.
Anything with an idxtype field is annotated with weights.
The weighting rules come from a configuration file, which defaults to
default-weights.json and which can be
set by import.weightConfigUrl
in the configuration.
An example set of weighting rules is
{
"script": "nashorn",
"global": {
"rules": [
{
"term": "taxonomicStatus",
"exists": true,
"rules": [
{
"value": "accepted",
"weight": 2.0
},
{
"value": "misapplied",
"weight": 0.5
}
]
},
]
},
"weights": [
{
"field": "searchWeight",
"rules": []
},
{
"field": "suggestWeight",
"rules": [
{
"term": "scientificName",
"exists": true,
"condition": "_value.length() > 4",
"weightExpression": "_weight * 1.0 / (1.0 + Math.log(_value.length() * 0.01 + 1.0))",
"comment": "The longer the name, the less it should be suggested. Mean name length is 16"
}
]
}
]
}
The top level contains the following fields:
- script The name of the script engine for evaluating condition and weightExpression rules. By default, this is the nashorn javascript engine.
- global Rules to apply to all weights.
- weights Specific weight fields. These have their own field names and rules that are applied after the global rules.
Rules consist of the following entries:
- term The term this rule applies to. The term may be absent, unless the exists field is set to true.
- exists Ensures that the term either exists or does not exist. If absent, then the rule applies wether the term exists or not, although most rules will not trigger on an empty value.
- value The value required for the rule to trigger (case sensitive for string values). If the value is a list, then any matching term will trigger the rule.
- match A regular expression to match the term against. If the value is a list, then any matching term will trigger the rule.
- condition A script expression to test. The script is in whatever scripting language
is used and must return a boolean value. Any term from the input document
can be used in the expression (eg.
kinmgdom == 'Plantae'
). If there is a term supplied then the value is supplied to the script as_value
. If the value is a list, then any matching term will trigger the rule. - weight The weight to multiply the existing weight by.
- weightExpression A script expression to calculate the weight, in
a similar manner to condition The returned value does not necessarily
multiply the existing weight, it needs to be explicitly included, as
_weight
so that you can do clever things with the weight value. - rules Sub-rules. These inherit things like term from their parent and only trigger if the parent conditions are true. The weight adjustments occur after the application of any parent adjustment. Note that sub-rules can refer to different terms, creating and and-condition.
- comment If you want to document something.
As an example, the above rules, applied to the input
[idxtype:'TAXON', taxonomicStatus:'misapplied', scientificName:'Atrobucca brevis']
and with a start value of 1.0 would give.
- Global rules weight = 1.0 * 0.5 = 0.5 for a taxonomicStatus of misapplied.
- searchWeight = 0.5
- suggestWeight = 0.5 * (1 / (1 + ln(0.15 + 1))) = 0.4387, since the length of the scientificName is greater than 4.
The favourites function allows lists from the lists tool to be used to mark taxa or
common names as having a "favourite" status.
The favourite status is a term, such as preferred
or iconic
that can be used to
mark entries for faceting and weight calculation.
The favourites configuation comes from a configuration file, which defaults to default-favourites.json and which
can be set by import.favouritesConfigUrl
in the configuration.
An example favourites configuration is:
{
"defaultTerm": "favourite",
"lists": [
{
"uid": "dr4778",
"termField": "favourite",
"defaultTerm": "interest"
},
{
"uid": "dr781",
"defaultTerm": "iconic"
}
]
}
The top level contains the following entries:
- defaultTerm The term to use by default for any list where the favourite term is not specified.
- lists The lists, in the lists tool, that contain favourite lists.
Each list can contain
- uid (required) The UID of the list
- termField If the list contains per-entry terms, the field which contains the term. Defaults to none.
- defaultTerm The term to use if one is not specified for the entry. Defaults to the global default term.
Favourites only mark selected taxa and their associated common names with favourite terms. Once marked, it is up to the bie-plugin otr weighting rules to make use of these terms.