MedFact is a set of algorithms that help assign a veracity score to text paragraphs about medical claims.
Please cite the following publication when using our source code for your research. This project is supported by the Alberta Machine Intelligence Institute (Amii).
@inproceedings{SamuelZaiane2018,
title = {{MedFact: Towards Improving Veracity of Medical Information in Social Media using Applied Machine Learning}},
author = {Samuel, Hamman and Zaiane, Osmar},
booktitle = {{31st CAIAC Canadian Conference on Artificial Intelligence (CAI)}},
pages = {108--120},
year = {2018},
organization = {{CAIAC}}
}
- This code is developed in Python 2.7.15 and tested on Anaconda
- The related Python libraries for this project can be installed via
pip install -r requirements.txt
(file generated viapipreqs --savepath=requirements.txt .
) - The datasets required in the
datasets
folder can be downloaded from GDrive.
- Word2vec embeddings embeddings pre-trained on text from MEDLINE/PubMed Baseline 2018 by AUEB's NLP group
- Simple English Wikipedia (SEW)
- Consumer Health Vocabulary (CHV)
- SNOMED CT International (requires UMLS account)
- Medical Sciences Stack Exchange (scraped via API)
- Health Stack Exchange (outdated with inconsistencies)
- Train the medical phrases classifier by running
train()
inmedclass.py
which will generate and persist the trained model - For a given incoming text paragraph, identify key phrases and medical phrases using
predict()
frommedclass.py
- Use the incoming medical phrases to query the TRIP database with
query()
to get related articles. Optionally, also query Health Canada's knowledge base usingquery()
inhealthcanada.py
- Extract the corpus phrases from the TRIP (and optionally Health Canada) articles with
extract()
inarticle.py
- Train the accord/agreement classifier via
train()
inaccordcnn.py
- Compare the incoming medical phrases with the corpus medical phrases via
predict()
inaccordcnn.py
- Calculate the veracity score via
veracity()
inmedfact.py
- Compute the confidence score via
confidence()
inmedfact.py
- Compute the triage label via
triage()
inmedfact.py
- Readability of the text being processed can be quantified with
metrics()
inreadability.py
- The veracity of websites can be computed via the batch mode which samples conteint on the given website's home page or other specified pages using a web scraper
- An example is provided on using this mode in
medfact.py
asexample2()
(the RESTful API has a URL mode that provides bulk analysis) - The bulk mode is also useful when needing to analyse paragraphs of text which would contain multiple sentences
- To run the Flask web app locally, use the command
python medfact.py api
- In your web browser, go to http://127.0.0.1:5000/api/text/?text= for processing a text sentence OR use the address http://127.0.0.1:5000/api/url/?url= for analyzing a website's page (full details on the API are documented in
api_docs.docx
) - The live MedFact API will be using IaaS hosting with Cybera
- When using IaaS hosting, you can serve the Flask web app using uWSGI
- PaaS hosting configurations depend on the provider, but here is one for Heroku