Update: Please check out our new biomedical QA system for BioASQ challenge.
The Biomedical Question Answering Framework provides an effective open-source solution to automatically finding the optimal combination of components and their configurations (configuration space exploration problem, or CSE problem) in building a biomedical question answer system (e.g. to respond to a question in TREC Genomics Track, What is the role of PrnP in mad cow disease?).
The BioQA framework is not just one particular QA system, but represents infinite number of possible QA solutions by intergrating various related toolkits, algorithms, knowledge bases or other resources defined in a BioQA configuration space.
The framework employs the topic set and benchmarks from the question answering task of TREC Genomics Track, as well as commonlyused tools, resources, and algorithms cited by participants. A set of basic components has been selected and adapted to the CSE Framework implementation by writing wrapper code where necessary, and users can also easily extend to wrap other existing tools or newly developped algorithms. This configuration space represented by the extended configuration descriptors (defined for the resulting set of configured components, e.g. default-sqlite-test.yaml, default-mysql-test.yaml, bioqa-test.yaml) can be explored with the CSE Framework automatically, yielding an optimal and generalizable configuration which can outperform published results of the given components for the same task.
GitHub home: https://github.com/oaqa/bioqa
Use it in your project: Artifact is publicly available in the OAQA Repository or Central Repository.
<dependency>
<groupId>edu.cmu.lti.oaqa.bio.core</groupId>
<artifactId>bioqa</artifactId>
<version>1.0.0</version>
</dependency>
Cite it in your paper
@inproceedings{Yang:2013,
author = {Yang, Zi and Garduno, Elmer and Fang, Yan and Maiberg, Avner and McCormack, Collin and Nyberg, Eric},
title = {Building Optimal Information Systems Automatically: Configuration Space Exploration for Biomedical Information Systems},
booktitle = {Proceedings of the 22st ACM international conference on Information and knowledge management},
series = {CIKM '13},
year = {2013},
location = {San Fransisco, CA, USA},
numpages = {10},
url = {http://dx.doi.org/10.1145/2505515.2505692},
doi = {10.1145/2505515.2505692}
publisher = {ACM},
address = {New York, NY, USA},
}
- Be sure Maven is installed and properly configured to fetch the dependency artifacts.
- Offline corpus annotation and indexing
- Annotate the TREC Genomics corpus with the
legalspan
andsentence
annotations with thelegalspans.txt
file from the organizer and any sentence segmenter respectively using UIMA. Serialized the annotated CAS corresponding to each document to an XMI file. - Optionally you may gzip each xmi file.
- Index the annotated corpus with Indri search engine. (You should be able to search for extents
legalspan
andsentence
.)
- Annotate the TREC Genomics corpus with the
- A schema with no content can be downloaded from the emptydb project and save to
BIOQA_HOME/data/
. If you save it in a difference location or you change the username/password, you need to updatesrc/main/resources/bioqa/persistence/local-sqlite-persistence-provider.yaml
. - Update the YAML descriptors by providing the information how to access and Indri. Replace
INDRI_URL
andINDRI_PORT
with your actual indri url and indri port insrc/main/resources/bioqa/retrieval/default-sqlite.yaml
andsrc/main/resources/bioqa/ie/default-sqlite/yaml
. - Specify the main yaml as
src/main/resources/bioqa/default-sqlite-test.yaml
and execute:mvn exec:exec -Dconfig=bioqa.default-sqlite-test
.
- Create your own MySQL schema, and update
src/main/resources/bioqa/persistence/local-mysql-persistence-provider.yaml
with your ownurl
,username
andpassword
. - Update the YAML descriptors by providing the information how to access the Indri service in the same way as on a single machine.
- Specify the main yaml as
src/main/resources/bioqa/default-mysql-test.yaml
and execute:mvn exec:exec -Dconfig=bioqa.default-mysql-test
.
- Update the YAML descriptors by providing the information how to access Indri.
- Update the YAML descriptors by providing the information how to access the annotated corpus. Replace
XMI_DIR_PATH
with the directory or URL prefix that contains the annotated XMI files (or gzipped XMI files). For example,file:/PATH/TO/YOUR/XMIGZ/DIRECTORY
orhttp://URL:PORT/HTTP/SERVICE/URL/TO/PROVIDE/ACCESS/TO/REMOTE/FILES
. - Use MySQL as persistence database or update the main
src/main/resources/bioqa/bioqa-test.yaml
similar tosrc/main/resources/bioqa/retrieval/default-sqlite.yaml
if SQLite or other persistence media is being used. - Specify the main yaml and execute:
mvn exec:exec -Dconfig=bioqa.bioqa-test
.
(See Section 6 of the [CSE paper][] for more detailed for component description.)
Test it on a cluster with CSE Asynchrous Driver based on UIMA-AS
- Be sure UIMA-AS is installed on the cluster.
- Update the broker
URL
andPORT
insrc/main/resources/bioqa/async/cse-broker.yaml
,src/main/resources/bioqa/collection/db-collection-reader-consumer.yaml
andsrc/main/resources/bioqa/collection/db-collection-reader-provider.yaml
. - The inputs and gold-standard outputs need to stored a prior in the
inputelements
table of the databas, which will be retrieved directly from database while the program is being executed. - Update the database access information
JDBC_CONNECTION_URL
,USERNAME
, andPASSWORD
in bothsrc/main/resources/bioqa/collection/db-collection-reader-consumer.yaml
and ``src/main/resources/bioqa/collection/db-collection-reader-provider.yaml`. - Execute the producer on the cluster's master node and start the consumer on each slave node.
Please refer to OAQA Tutorial to learn how to create your own framework.
Copyright 2013 Carnegie Mellon University
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
If you have any questions or suggestions, please feel free to create an issue, or contact me.