topic-modeling-harvard

Healthcare research project utilizing Natural Language Processing techniques to classify large volume corpora of patient files into categories of diseases. Experimented with regression and a novel approach to boost traditional Latent Dirichlet Allocation (LDA) algorithm performance by using expert knowledge and prior probability on relevant concepts. Prior probability was trained from public domain data including Medline, Medscape, Merck Manual and Wikipedia. Large scale pre-processing of > 50,000 medical articles representing > 1.5MM Concept Unique Identifiers was used to identify concepts that best describe each disease. Inverse probability sampling (based on correlation to the main disease) was used to construct training samples. Fifty-fold cross validation with Elastic LASSO augmented with hyperparameter optimization was used to select relevant concepts and train weights.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cos-sim		cos-sim
create-data		create-data
regression		regression
tests		tests
50attempts.py		50attempts.py
LDA.R		LDA.R
Medline Parse Example.py		Medline Parse Example.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

topic-modeling-harvard

About

Releases

Packages

Languages

chris-pan/topic-modeling-harvard

Folders and files

Latest commit

History

Repository files navigation

topic-modeling-harvard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages