Mediglot: 3D Medicinal Clustering with Polyphy/Polyglot

Link to app: https://ayush-sharma410.github.io/MediGlot/
This project is an extension of the Polyglot app, as developed by Hongwei (Henry) Zhou as part of his master's thesis, and is now a part of the PolyPhy toolkit of network-inspired data science tools (for background on the PolyPhy hub, see here). The intention behind Polyglot is to give users a hands-on experience going beyond the standard Euclidean measure of similarity and exploring a slime-mold inspired measure of similarity, see below for more details!

Mediglot is a web application for visualizing 3D medicinal embeddings. Medcinal embeddings are high-dimensional vector representations of the salts present inside a particular medicine. By reducing the dimensionality of these representations to 3D using UMAP, users are able to explore a 3D point cloud of medicines. Beyond navigating the 3D point cloud, the application also allows users to view the exploration result of the exciting and recent Monte-Carlo Physarum Machine (MCPM) metric. The algorithm simulates the self-organizing nature of the Physarum polycephalum slime mold. This particular organism has been shown to discover optimal transport networks on its own, including an instance where it replicated the structure of the Japanese railway system. The entire network is thus colored based on its MCPM similarity to the current "anchor point". The anchor point is the point from which all MCPM similarity scores are computed with respect to (e.g., if the anchor point is "dolo", the network is colored such that words with high MCPM similarity to "dolo" are brighter and dissimilar words are darker).

Background

For context, Mediglot follows this methodology:

We take a diverse dataset of all the popular medicines found on the internet, use an embedding model (such as Word2Vec, BERT, etc.) or an LLM model (such as LLaMa3.1, Gemini, Phi, etc.) to generate a set of high dimensional points associated with each medicine name and it's salts.
Use a dimensionality reduction method (such as UMAP or T-SNE) to reduce the dimensionality of each word-vector point to 3 dimensions
Use the novel MCPM metric (Monte Carlo Physarum Machine) to compute the similarities between a set of anchor points (in this case medicines selected using quasi random sampling) and the rest of the point cloud.
The web app then displays the point cloud of 3-dimensional embeddings of medicines such that the medicines having similar salt compositions can be seen clustered together (for e.g., if we search for a salt named "paracetamol" all the medincines containing "paracetamol" turns green and rest of the medicines dissapears).

Models

Gemini-Pro

Used Gemini API to get the embeddings for 55,000 unique medicines.
Used UMAP to reduce the dimentionality of embeddings to 3.
Used the generated embeddings to make clusters of medicines having similar salts and compositions.
Calculated the euclidean distances between the points to get the final point cloud.

LLaMa3.1

Used LLaMa3.1 to generate the embeddings for 55,000 unique medicines.
Used UMAP to reduce the dimentionality of embeddings to 3.
Used the generated embeddings to make clusters of medicines having similar salts and compositions.
Calculated the euclidean distances between the points to get the final point cloud.

Word2Vec

Used Word2Vec to generate the embeddings of salts associated with each medicine for a dataset of 55,000 unique medicines.
Summed up and normalized the vectors of all the salts associated with each medicine to get the final embeddings for each medicine
Used UMAP to reduce the dimentionality of embeddings to 3.
Used the generated embeddings to make clusters of medicines having similar salts and compositions.
Calculated the euclidean distances between the points to get the final point cloud.

Authors

MediGlot is extended as part of Ayush Sharma's 2024 Google Summer of Code project, mentored by Oskar Elek and Kiran Deol.

This web visualization tool was originally created by a team of researchers at University of California, Santa Cruz, Dept. of Computational Media:

This work was published as Hongwei Zhou's M.S. thesis.

A version of the original work was published in 2020 IEEE 5th Workshop on Visualization for the Digital Humanities (VIS4DH)

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
build		build
chroma.js-master		chroma.js-master
csv		csv
data		data
different_version_of_index		different_version_of_index
gsoc_images		gsoc_images
image		image
jsm		jsm
notebooks		notebooks
old		old
README.md		README.md
cursor 120x120.svg		cursor 120x120.svg
disc.png		disc.png
how_to_run.txt		how_to_run.txt
index.html		index.html
three.js		three.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mediglot: 3D Medicinal Clustering with Polyphy/Polyglot

Background

Models

Gemini-Pro

LLaMa3.1

Word2Vec

Authors

About

Releases

Packages

Languages

Ayush-Sharma410/MediGlot

Folders and files

Latest commit

History

Repository files navigation

Mediglot: 3D Medicinal Clustering with Polyphy/Polyglot

Background

Models

Gemini-Pro

LLaMa3.1

Word2Vec

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages