Skip to content

Service Hoarder

Carlos Badenes edited this page Mar 31, 2016 · 1 revision

This module is responsible for add, update or remove documents as well as its internal items, from sources previously added to the system by the api module. Thus, it is listening for events published in the event-bus with a routing-key that matches with: source.#.

Multiple threads are continuously looking for changes in a list of valid sources. These sources can be remote web services publishing research content under the OAI specification or following the Web Feed format (e.g rss, atom). Also, they can be journal websites such as Elsevier , Springer or even Wikipedia. Moreover, anyone can manually publish his/her content by using our API.

hoarder

This module may work in synchronous or asynchronous mode:

  • In asynchronous mode, it polls the source periodically for new content. When a new resource is detected, it will create a new document with that meta-information and will download, if exists, related files. For each file, it will create a new item that will be published to the event-bus to be processed by the harvester module. This event-message will contain the uri of the document or the item, once its reference information has been stored in the internal storage. This work mode is typical for sources associated to publishing services, e.g. Elsevier, Springer, RSS Server, LAN folders...

  • In synchronous mode, it only requests the data when the new source is added. After that, no more requests are executed on that source. This work mode is typical for sources associated to remote files, e.g. zip file in Dropbox or LAN folder.

At this stage, the resource has not been processed yet. For each resource, one event-message was published with the uri of the document, and, zero or more messages, are also published with the uri of the item associated to the related files of the document. The first ones are published to the channel, i.e. with routing-key, document.<action> and the second ones to item.<action>, where action could be new, when it was created, deleted, when it was removed from the source, or open, when it was modified.

Master-Slave Mode

When this module is deployed in a cluster, it must operate in a server-slave way because remote services should only be accessed once in time to avoid get duplicated resources. We need to create a system that is able to coordinate all the instances deployed in the environment, so we use Apache Zookeeper to get this functionality.
"..A simple way of doing leader election with ZooKeeper is to use the SEQUENCE|EPHEMERAL flags when creating znodes that represent "proposals" of clients. The idea is to have a znode, say "/election", such that each znode creates a child znode "/election/guid-n_" with both flags SEQUENCE|EPHEMERAL. With the sequence flag, ZooKeeper automatically appends a sequence number that is greater than any one previously appended to a child of "/election". The process that created the znode with the smallest appended sequence number is the leader. .." [zookeeper recipes]

hoarder-cluster

Routing-keys

These are the routing-keys from the event-bus that this module is listening for or publishing to:

Listen for

  • source.{new|opened|deleted}

Publish to

  • source.{closed}
  • document.{new|opened}
  • item.{new}

Use-Case

Following the use-case started in the api module, an event will be received by the hoarder module and then it will read the source from the internal storage, and will compose a valid OAI-PMH request for getting the list of resources in that specific time interval.

For each OAI record received, e.g. oai:oa.upm.es:17, it will create a new document with the received meta-information and will store it in the internal storage. After that, a new event will be published in the channel: document.new with the uri: documents/oa-upm-17.

The OAI record may contains related files. For each related file, the hoarder module will try to download it, e.g. http://oa.upm.es/17/1/eciencia_upm.pdf, and will create a new item associated to that document that includes the local file path where it was downloaded as its url.

Finally, for each item, the module will store it in the internal storage and will publish an event in the channel: item.new with the associated uri, .e.g. items/17-eciencia.

hoarder-use-case

Clone this wiki locally