Skip to content

Development documentation of lessons learned

Ming Chen edited this page Dec 15, 2016 · 14 revisions

Developers and PIs - Please use this page to record the development process from a research perspective. Specifically, what decisions did you make along the way and why. This can be formal or informal, but it is crucial to capture the information WHILE decisions are being made. Don't be shy!

Please enter content under the appropriate sections below, and create your own sub-sections for specific content. Feel free to make new sections if those below don't fit.

Data Model

Model simplification

The original model separated processes, data (information content entities), material entities, and projects, which we still have, but had objects for each kind of process or data. Over time this was simplified, so that we now have a much simpler models with processes that have inputs and outputs, all of which are part of a project.

Datasets

We needed to add a new object to the model for "dataset". A dataset is a subclass of data which may contain a single data file or multiple files.

A dataset only contains files, and all files must be part of the same project.

Web Portal

Issue 1: Bulk Registration (Solved)

In scenario 4, when user registers large number of specimens/probes, it could take a long time for the system to handle and register each metadata instance on Agave system. We would like to have some asynchronous work flow so that user does not have to wait until the web portal finishes its work.

Agave app vs. Celery We originally planned to develop an Agave app to do the bulk registration work and notify the system when the work is done. This solution turns out to be not ideal because of the following reason:

  • In order to specify the input for Agave app, we need to transmit all the metadata info through network, which is very cost.
  • Within Agave app, it is not easy to register metadata (input needs to be formatted as json). Otherwise, we might have a similar situation that creating Agave client within an Agave app (like the very old checksum app version). Also, we have already implemented data model for the web portal, it makes much more sense to directly apply that.
  • Unlike the checksum app, registering bulk metadata does not involve lots of disk usage.

Celery is great in building a queueing system for workers to finish assigned tasks. It can use different databases as broker, by default we use RabbitMQ. Celery can be powerful to run task in parallel. You can specify how many workers you want and how tasks would be queued and assigned to each worker. Meanwhile, celery is compatible with existing django-python framework very well. Perfect fit in our case.

Angular

Angular is Google's front-end framework for web applications. It is used in this project primarily to manage the front-end codes in a cleaner and efficient way. In particular, Angular well handles the modularization and dependencies injections for the ids portal. The basic angular modules in the ids project is listed as below.

app module (parent module)

  • data module
  • datasets module
  • layout module
  • probes module
  • processes module
  • projects module
  • specimen module

Each module has corresponding services and controllers to support retrieving data from back-end and render how it looks in the front-end.

Pagination: A pagination app is used in the front-end to navigate and access bulk metadata from the back-end. Generally, the pagination app syncs with the back-end agave service on page number. By default, 10 items is shown on the front-end for each page, and the app always cache next 100 items to speed up the performance.

TODO: The pagination app is now reflected on the project view for navigating bulk probes. Since it is very common to use pagination in viewing specimens, processes, or other bulk metadata, it would be good to package the pagination app in an Angular directive. In this way, it can be plugin and play. To implement that, you would need to make it as an self-contained module with its own controller and services for retrieving data. Also, one tricky thing is you might want to isolate the scope from outside front-end and inside controllers/services, depending on the way of implementation. Reference: https://docs.angularjs.org/guide/directive

TODO: Routing needs to be updated original views so that new UI using Angular would be reflected in viewing different entities.

Agave API

Issue 1: Count of metadata (solved)

When bulk of metadata registered in the agave system, it is hard to get the count or other aggregate measures over a large set of metadata. Agave provides search/query on metadata but no existing aggregate functions. In the case of user of portal searching a set of metadata, the number of returned results might exceed the limitation (by default 100). But we don't know if it is exactly 100 or greater than 100.

Workaround: Using Agave CLI, pagination by using combination of -o and -l can get the count of a search. Agavepy should have similar parameter setup.

Followup: Since each specimen and probe is treated as metadata as nature, when requesting project.parts, only the first 100 results are returned. So even the project has specimen as a part, it will not be able to shown in the page.

Issue 2: Navigate/Search in viewing large number of metadata (solved)

Once bulk specimens/probes have been registered in the agave system, we need a new front-end interface for users to access the registered metadata. The original tree structure hierarchy would not look so nice when there are tons of metadata under a particular project.

Thoughts: Since we don't have to provide a whole list of metadata, a search interface might be helpful to locate a specific metadata. Based on Issue1, returning a whole list of metadata can be cost.

Location check

We have three types of source for the files on ids: SRA, agave url, external url. For a SRA file, we check its location by ping the public web interface provided by NCBI. The general URL prefix for the web api is https://www.ncbi.nlm.nih.gov/public/?/ftp/sra/sra-instant/reads/ByRun/sra/ and then followed by a suffix composing of SRA number, for example, SRR/SRR292/SRR292241/SRR292241.sra.

For the files on agave, we use agave's api files-list to check whether it is still exists on the agave. System id and agave url are needed to check the location.

For external files, we simply ping the URL to get the file. Therefore, the URL provided should directly take you to download the file.