The HuBMAP Search API is a thin wrapper of the Elasticsearch. It handles data indexing and reindexing into the backend Elasticsearch. It also accepts the search query and passes through to the Elasticsearch with data access security check.
The API documentation is available on SmartAPI at https://smart-api.info/ui/7aaf02b838022d564da776b03f357158
This repository relies on the search-adaptor as a submodule to function. The file .gitmodules
contains the configuration for the URL and specific branch of the submodule that is to be used. Once you already have cloned this repository and switched to the target branch, to load the latest search-adaptor
submodule:
git submodule update --init --remote
Front end developers who need to work on the portal
index should start in
the addl_index_transformations/portal
subdirectory;
After checking out the repo, installing the dependencies, and starting a local Elasticsearch instance, tests should pass:
pip install -r src/requirements.txt
pip install -r src/requirements-dev.txt
# on mac:
brew tap elastic/tap
brew install elastic/tap/elasticsearch-full
## On MacOS 13, elasticsearch is not compatible with the default jdk. To workaround this, install openjdk and disable the machine learning functionality.
brew install openjdk
echo 'export ES_JAVA_HOME="/opt/homebrew/opt/openjdk"' >> ~/.zshrc
echo 'xpack.ml.enabled: false' >> /opt/homebrew/etc/elasticsearch/elasticsearch.yml
elasticsearch & # Wait for it to start...
./test.sh
- Make new feature or bug fix branches from
main
branch (the default branch) - Make PRs to
main
- As a codeowner, Zhou (github username
yuanzhou
) is automatically assigned as a reviewer to each PR. When all other reviewers have approved, he will approve as well, merge to TEST infrastructure, and redeploy and reindex the TEST instance. - Developer or someone on the team who is familiar with the change will test/qa the change
- When any current changes in the
main
have been approved after test/qa on TEST, Zhou will release to PROD using the same docker image that has been tested on TEST infrastructure.
- Make new feature branches off the
main
branch (the default branch) - Make PRs to
dev-integrate
- As a codeowner, Zhou (github username
yuanzhou
) is automatically assigned as a reviewer to each PR. When all other reviewers have approved, he will approve as well, merge todev-integrate
, and redeploy and reindex the DEV instance. - When a feature branch is ready for testing and release, Zhou will make a PR to
main
for testing on the TEST infrastructure as above.
The search-api base URL for each deployment environment:
- DEV:
https://search-api.dev.hubmapconsortium.org
- TEST:
https://search-api.test.hubmapconsortium.org
- PROD:
https://search.api.hubmapconsortium.org
This endpoint returns a list of supported indices, no globus token is required to make the request.
GET /indices
The Authorization header with globus token is optional
POST /search
The Authorization header with globus token is optional
POST /<index>/search
Due to data access restriction, indexed entries are protected and calls to the above endpoints require the Authorization
header with the Bearer token (globus nexus token) along with the search query JSON body. There are three cases when making a search call:
- Case #1: Authorization header is missing, default to use the
entities
index with only public data entries. - Case #2: Authorization header with valid token, but the member doesn't belong to the HuBMAP-Read group, direct the call to use the
entities
index with only public data entries. - Case #3: Authorization header presents but with invalid or expired token, return 401 (if someone is sending a token, they might be expecting more than public stuff).
- Case #4: Authorization header presents with a valid token that has the group access, ALL the user specified search query DSL (Domain Specific Language) detail will be passed to the Elasticsearch just like making queries against the Elasticsearch directly.
NOTE: currently, the Search API doesn't support comma-separated list or wildcard expression of index names in the URL path used to limit the request.
Similar to making a request against /search
but for getting the count:
GET /count
Similar to making a request against /<index>/search
but for getting the count:
GET /<index>/count
{
"query": {
"match": {
"uuid": "4cac248a51b6767e029663b273e7a8b2"
}
}
}
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"donor.group_name": "Vanderbilt TMC"
}
}
],
"filter": [
{
"match": {
"entity_type.keyword": "Sample"
}
}
]
}
}
}
{
"aggs": {
"created_by_user_displayname": {
"filter": {
"term": {
"entity_type.keyword": "Dataset"
}
}
}
}
}
query_dict = {
'query': {
'match': {
'uuid': uuid
}
}
}
response = requests.post(
'https://search-api.dev.hubmapconsortium.org/search',
json = query_dict,
headers = {'Authorization': 'Bearer ' + nexus_token})
hits = response.json()['hits']['hits']
All configuration options reside within the src/instance/search-config.yaml file which will allow you to specify configuration options for each index, that will be available via the search-api. This file will support any number of index configurations.
To get started, copy the src/instance/search-config.yaml.example. Use this as a template to further defined your index. Here's a sample of a defined index, see that options explained in further detail below:
default_index: my-index
indices:
my-index:
active: true
public: my-index-public
private: my-index-private
document_source_endpoint: https://my-document-base
elasticsearch:
url: https://localhost:9200
mappings: "default-config.yaml"
Options
default_index: [index name]
If you have multiple indices you need to specify which of these is the default. By specifying this, a call to the base /search endpoint, without specifying an index name, will use this index as the default. This should be specified even for single index definitions
indices: All indices definitions start after this declaration
active: [true or false] - designated if the index should be active (true) or inactive
public: [index name] - this will allow you to specify an index that contains data only viewable by non-authenticated users or usage for public facing endpoints
private: [index name] - this will allow you to specify an index that contains private data viewable only by a certain group, consortium or more specifically, authenticated users.
Note: the public and private indices should be the same index name if you only have a single index
document_source_endpoint: [url] (optional) - this will allow you to configure a document source (i.e., entities) which will be used by the indexer to populate your index from an alternate document store
elasticsearch: configurations specific to elasticsearch after this declaration
url: [url] - url to the server hosting Elasticsearch
mappings: [file] - used to specify a file which contains index settings or mappings (i.e., mapping.total_fields.limit: 5000) specific to Elasticsearch. see index settings. Also, the default settings are located in elasticsearch/search-default-config.yaml. This file is used during the index creation process before data is ingested by the indexer.
There are a few configurable environment variables to keep in mind:
COMMONS_BRANCH
: build argument only to be used during image creation when we need to use a branch of commons from github rather than the published PyPI package. Default to master branch if not set or null.HOST_UID
: the user id on the host machine to be mapped to the container. Default to 1001 if not set or null.HOST_GID
: the user's group id on the host machine to be mapped to the container. Default to 1001 if not set or null.
We can set and verify the environment variable like below:
export COMMONS_BRANCH=master
echo $COMMONS_BRANCH
Note: Environment variables set like this are only stored temporally. When you exit the running instance of bash by exiting the terminal, they get discarded. So for rebuilding the docker image, we'll need to make sure to set the environment variables again if necessary.
cd docker
./docker-development.sh [check|config|build|start|stop|down]
On TEST/STAGE/PROD environments, we use the same published docker image from DockerHub for deployment rather than building a new image.
cd docker
./docker-deployment.sh [start|stop|down]
For the Release candicate (RC) instance use a separate script:
./docker-rc.sh [start|stop|down]
The documentation for the API calls is hosted on SmartAPI. Modifying the search-api-spec.yaml
file and commititng the changes to github should update the API shown on SmartAPI. SmartAPI allows users to register API documents. The documentation is associated with this github account: [email protected].