Skip to content
rodche edited this page Mar 16, 2015 · 1 revision

Contents

Goal

Phase1 focuses on the development of the import pipeline for cpath^2. Data will be fetched in a semi-autonomous manner, pushed through a "Pre Merge" pipeline, then imported into a central database.

Metadata (including provider name, url to data, icon, version, release data) will be placed on a wiki page where it can be easily updated by a content manager. Code fetches this metadata on a regular basis to determine which data sources are to be imported and where the data is located. With this metadata, pathway data is fetched and pushed into the Pre Merge pipeline.

Within the Pre Merge pipeline, data will be preprocessed, converted if necessary (psi-mi), normalized, validated and finally persisted to await importing. Note, original data will be preserved before being pushed into the pipeline.

Preprocessing entails everything we currently do via python scripts that modify original data: update refType values, remove short labels, or replace names with synonyms

Normalizing will ensure that all the controlled vocabulary (CV) and entity reference (ER) type elements (i.e., protein and small molecule types) have, or receive, a standard unique resource identifier as defined by Miriam standard. Particularly, Miriam defines URIs for many external resources (databases, OBO ontologies, etc.). It also has annotation that helps resolve a URN to URLs, find URI by datasource's name/synonym or deprecated IDs, and validates the format of the resource-specific internal identifiers. For example, official URI for UniProt datasource is "urn:miriam:uniprot", and a protein is referred as "urn:miriam:uniprot:P62158". So, if an ER or CV does not have such URI, its RDFId will be reset: one of unification xrefs will be used to create the new URN. It is not required, however, that this generates primary ID right away. If required, converting to the best URI will be done during the merging step.

A validating component will check the (pre-processed and normalized) BioPAX model using the BioPAX Validator and generate a success/failure result as well as summary (stats) data; it may also try to auto-fix some of errors (BioPAX Validator has to support this). The validation result can be used for the go/no-go decision and for data provider feedback.

After validation, each data source will be persisted in a database where it awaits import. During import, pathway data is combined following a strict set of rules and then persisted in the central PC database.

Requirements

User Requirements

  • "Protein Reference", "Small Molecule Reference " and "Controlled Vocabulary" objects should have normalized IDs.
  • If two pathway data sources contain objects from these three classes with the same normalized IDs, the redundancy should be resolved in a first come is kept basis. Only the object links from the second database is transferred.
  • Controlled Vocabularies without unification xrefs should be looked up from the individual databases.
  • Each data source potentially requires "cooking".
  • Original, precooked data should be made accessible.
  • Entities with no interactions should still provide a hit upon querying to ensure user that we know the entity they are talking about but simply have no interaction information about it.
  • Controlled vocabulary hierarchy should be kept in the backend for future querying options.

System Requirements

  • Deployment/development cycle should be as fast as possible
  • Bulk of validation and "cooking" of individual data sources should be done once per release and must be decoupled from the regular deployment/development cycle.
  • Update of data sources should preferrably performed automatically as background tasks.
  • All owl files to be imported have gone through a pre-validation stage with the data provider.
  • Do not import entities for all proteins and small molecules for performance issues. Import them on demand. To satisfy user requirement 2 provide a secondary querying facility where a missed query is queried over the warehouse of proteins and small molecules.
  • During import, it is assumed that if protein A (note: that is protein reference, i.e., class of proteins) exists in two separate incoming models, the same rdf id will be used in both models (miriam-uniprot).

Design Details

Summary

We want to satisfy the above user and system requirements and also make it:

  • configurable (with minimum configuration files/options that one has to customize to compile/run the application)
  • use existing out-of-the-box components that matter (e.g., paxtools, biopax-validator, org.bridgedb, miriam-lib, etc.)
  • modular (and managed by Maven2)
  • easy to write tests and test

Main existing blocks to build our system on:

  • (for sure) Java6 SE, Apache's and other Open Source Projects
  • BioPAX API - Paxtools (2.0-SNAPSHOT; multiple modules: core, biopax model with hibernate annotations, I/O)
  • BioPAX Validator (2.0-SNAPSHOT, components incl.: core, biopax-rules, ontology manager, miriam-lib, and spring xml configuration)
  • RDBMS (currently MySQL5)
  • Hibernate (annotations, entitymanager, search, etc., - fresh meat from the JBoss repository); JPA (java-persistence 2.0-cr-1)
  • Spring Framework 3.0.3.RELEASE: core, context, beans, aop, orm and tx (hibernate and transactions), jdbc, and testing power!

Global Settings

  • CPATH2_HOME environment and JVM variable, i.e., CPATH2_HOME=/Users/joe/cpath2_home
  • cpath.properties (in the $CPATH2_HOME dir); here is an example
  • log4j.properties (in the $CPATH2_HOME dir); here is an example
  • hibernate.properties (currently - in the cpath-dao/src/main/resources;)
  • Important Constants&Keys

Spring XML (in general)

Here is a typical application context XML file. This is how most Hibernate/DAO configuration look.

There are:

  • property placeholder configuration that depends on the environment/system variable, i.e.,

<context:property-placeholder location="file:${CPATH2_HOME}/cpath.properties"/>

  • configuration variables that depend on both environment and property variables, e.g.,

<prop key="hibernate.search.default.indexBase">${CPATH2_HOME}/${main.db}</prop> (${main.db} is resolved from the cpath.properties file)

  • to be continued...

Data Source Factory

We have implemented a fantastic DataServicesFactoryBean that allows for:

  • dynamic data sources be created using the bean name as a key and immediately injected into the corresponding session factory
  • connection pooling (c3p0)
  • databases and schema (tables) drop/create during junit tests, by admin tool, and premerge!

Using Hibernate

We use Spring (AnnotationSessionFactoryBean) to configure and manage Hibernate sessions, but use native Hibernate API (not Spring's HibernateTemplate) to persist domain objects and create queries.

  • Hibernate loads properties from a hibernate.properties file found on its classpath at runtime, which then are overridden or more properties added in the session factory spring XML configuration.
  • Spring 3.0 allows for packagesToScan property be set for Hibernate to auto-create mappings from the annotated classes in the listed java packages.
  • Session Factories use DataSource and are used by DAO objects and Transaction Manager...
  • important property: hibernate.connection.release_mode=after_transaction
  • important property: hibernate.jdbc.batch_size=20

Using Transactions

We use annotations - declarative transaction approach (automatic begin, commit, rollback and connections release), e.g., <tx:annotation-driven transaction-manager="mainTransactionManager"/>.

Currently, there is one transaction manager (HibernateTransactionManager) per session factory (database). Some cpath2 classes, e.g., Admin and Merger, have to access more than one database (session) simultaneously.

Issues

There are several issues with multiple transaction managers and session factories:

  • only one <tx:annotation-driven transaction-manager="mainTransactionManager"/> that works with single session factory is allowed per spring context configuration XML file
  • it may be ok when different transaction manager beans use the same session factory (we don't use this way)
  • it does not work, however, not immediately reporting an exception though, when two different hibernate configurations, both having own <tx:annotation-driven/> and the transaction manager bean, are loaded together into the same application context (different design is required for this to work; see below...)

Quick Fix

Whenever two or more different databases (session factories) are required to work with, programmatically get each DAO bean from its ApplicationContext, and never mix any two DAO XML context configurations within the same application context, i.e., do not:

ApplicationContext context =
        new ClassPathXmlApplicationContext(new String [] { 	
            "classpath:applicationContext-whouseProteins.xml", 
            "classpath:applicationContext-whouseMolecules.xml"
});
PaxtoolsDAO proteinsDAO = (PaxtoolsDAO) context.getBean("proteinsDAO");
PaxtoolsDAO smallMoleculesDAO = (PaxtoolsDAO) context.getBean("moleculesDAO");

, but use this instead -

ApplicationContext context1 =
    new ClassPathXmlApplicationContext("classpath:applicationContext-whouseProteins.xml");
PaxtoolsDAO proteinsDAO = (PaxtoolsDAO) context1.getBean("proteinsDAO");

ApplicationContext context2 =
    new ClassPathXmlApplicationContext("classpath:applicationContext-whouseMolecules.xml");
PaxtoolsDAO smallMoleculesDAO = (PaxtoolsDAO) context2.getBean("moleculesDAO");

(TODO) avoid the following configuration (it currently creates transaction managers mess, needs re-factoring):

	<bean id="merge" class="cpath.importer.internal.MergerImpl" scope="prototype">
	  <constructor-arg ref="paxtoolsDAO"/>
	  <constructor-arg ref="metadataDAO"/>
	  <constructor-arg ref="cPathWarehouse"/>
	</bean>

Another quick fix would be storing small all the warehouse data: metadata, original pathway data in OWL, small molecules, proteins, and CVs (they are currently stored in memory), in the same database (single session factory but several DAO are possible). This would resolve 90% of the aforementioned issues and could also make code simpler. Perhaps (needs to prove), the downside of this design would be potential drop in the warehouse search performance and difficulties with updating molecule and protein reference data separately.

Other Alternatives

  • clone, rename PaxtoolsHibernateDAO class - create one per database, i.e., MainHibernateDAO, ProteinsHibernateDAO, etc., and add modifiers (since Spring 3.0) to @Transactional method annotations, e.g., @Transactional("proteins")
  • using global transactions and JtaTransactionManager may be a general solution, but we do not plan to deploy cpath2 to a J2EE container (want it work from console and Tomcat), and there might be other issues... (need more research). We may want to try Atomikos (Spring Integration)

Dates

Start - 16/02/2010, End - 29/03/2010 (now extended till July-August due to outstanding tasks and issues)

Identified Tasks

Prototype

  • Develop module which provides primitive data management and searching services (it is to be a wrapper around Paxtools)
  • Set up the project hosting, code base, tracker, as well as the software development methodology.
  • Create the Maven2 project with the DAO (Hibernate+Lucene) and web application (MVC, JSP) prototype (Spring Framework).

The following methods are provided by the PaxtoolsDAO interface:

void importModel(Model model);
void importModel(File biopaxFile) throws FileNotFoundException;
<T extends BioPAXElement> T getByID(String id);
<T extends BioPAXElement> Set<T> getObjects(Class<T> filterBy);
<T extends BioPAXElement> T getByUnificationXref(UnificationXref unificationXref);
public <T extends BioPAXElement> Set<T> getByQueryString(String query, Class<T> filterBy);

All return types are of type BioPAXElement (or a class which extends BioPAXElement). Also note that getByQueryString() requires Lucene support.

Completion

The project has been successfully set up and the DAO prototype created (using spring-hibernate and annotated BioPAX classes from Paxtools's org.biopax.paxtools.proxy..* package) and worked in simple tests. However, this did not handle extra (non-biopax) "inverse" properties (e.g., entityReferenceOf, xrefOf, etc.), which is an apparent limitation for graph-theoretic and other query algorithms, and was not well tested.

Here are current PaxtoolsDAO interface and the implementation.

Pre Merge

Pre Merge Docs

  • (COMPLETE) Fetcher
    • Getting Pathway Data (PSI-MI, OWL files) and data source metadata
  • (COMPLETE) Normalizer
    • given a Paxtools model, replaces IDs with Miriam's URIs (URNs) for Entity Refs (protein's and small molecule's) and CVs
  • (COMPLETE) Cleaner (provider-specific cookies)
    • Fix inconsistencies within the source data prior to creating paxtools model
  • (COMPLETE) Converter (protein and small molecule to Entity Reference conversion)
    • Update PSI-MI2BioPAX converter
  • (COMPLETE) Validation
    • Validator checks a (Paxtools) model and gets errors and other information (e.g., stats); (New!) can now auto-fix and normalize.
  • (99% COMPLETE) Warehouse
    • Data Scope and Requirements - auxillary data materialisation strategeis for URN normalisation and WEB API/Search requirements
    • Data Fetch and Installation - auxillary Data Warehouse developement
    • API - interface between Importer/webservice and the Warehouse Data
    • DAO and Utilities Implementation (after scoped requirements - i.e. auxiliary data structure - importing data)
  • Integration Testing for the Import Pipeline

Merge

  • (COMPLETE) Merger
    • Import pathway interaction databases in a defined order.
    • Determine if a small molecule reference /protein reference /CV exists in PC, if not import from Warehouse.
    • Transfer links from source small molecule reference /protein reference /CV to the objects in the merged database.
    • For all other objects simply move them.
    • Refactor Paxtools controllers and DAO for cloning and merging.
  • (Mostly moved to Phase2) Integration Testing
    • Identify test points
    • Create tests
    • Implement test and concordance

Outstanding Tasks

  • (DONE) Use Paxtools's core model and biopax elements implementation (Level3) instead of the org.biopax.paxtools.proxy (this required much re-factoring: put JPA/Hibernate mapping and indexing/search annotations on "inverse" properties as well as BioPAX properties)
  • (DONE) Make it easy to create/drop individual cpath^2 databases and tables (both for tests and production)
  • (DONE) Test/Fix new PaxtoolsDAO and Spring Framework configuration (Transactions, SessionFactories, Connections, etc.)
  • (DONE) consolidate properties files into one or two, use a system environment variable (CPATH2_HOME)
  • exception handling system
  • (DONE) proper distribution configuration (using a special assembly.xml maven file in the cpath-admin module); e.g., to create a neat assembly with all jars in the /lib, configs in /etc/conf, and classpath... Using http://maven.apache.org/plugins/maven-release-plugin/
  • (DONE) fix context files to remove erroneous index files made within cpath-importer and cpath-admin
  • (DONE) figure out how exactly detach biopax objects from PaxtoolsDAO to make them usable in the service layer; here are ideas:
    • (DONE) a transactional method that gets an element(s), "auto-complete" it (using paxtools's Completer, Cloner), and/or Hibernate.initialize.
    • (DONE)a transactional method that gets the element by ID and deep-clone (e.g., using custom internal Cloner) into a new model instance, where related elements are also added and from which the element can be called out later on;
    • (DONE) a method that gets the element(s), auto-completes (creating a new in-memory model) and serializes with SimpleExporter (still being inside the transaction/session) to OWL string, which is then either returned as is or first converted to another (now completely independent from the DAO) in-memory Model.
  • (DONE) tune the validation
    • check/set individual validation rules mode (error/ignore)
    • make sure the errors are checked and result (and summary) is saved
  • (DONE) batch re-build of the lucene indexes
  • (DONE) Warehouse tests for molecules and proteins data

Team Time Planning

This requires updating!..

Cpath^2 commitments

  • Ben: Warehouse Fetch Metadata (1/2 wks), Premerge Fetch (1/2 wk), Premerge Chef (1wk), Merge, etc.
  • Igor: Premerge Normalization and Fetch CVs (1 wk), Validator Component (2 day), Merge whouse Dep (1wk), Paxtools (BioPAX persistence), etc.
  • Emek: Paxtools Dependencies (1wk) , Doc(1wk), Merge 1(wk)
  • Nadia: Warehouse Datascope
  • Other Commitments:
    • Emek: BioPAX documentation (2 days), Paxtools Paper (2 days), Other research: SBGN, GEM (1 wk)
    • Ben: Away March 5th-9th (back on 10th), April 23-27 (back on 28th)
    • Nadia: 22nd-25th Feb in Boston, BioPAX documentation( 2 days), 21st-28th March in Cambridge UK, 1st-15th April in Glasgow UK.

Planned Meetings

  • Conf Call : Tuesday 3d March 13:00 pm (warehouse requirements for research)
  • Conf Call : Wednesday 10th March 13:00 am (report back options and make decisions to update Phase plan with warehouse specifics)
  • F2F Phase 2 Plan : Monday 29th - Tuesday 30th March, 10.00am-5.30pm Z-2170

Diagrams

Outdated, but you can still get the main idea... Sequence Diagram - Premerge pipeline

http://pathway-commons.googlecode.com/files/pre-merge.png created with SDEdit