Skip to content

Commit

Permalink
adding some blog starters
Browse files Browse the repository at this point in the history
  • Loading branch information
brifordwylie committed Dec 27, 2024
1 parent a3eb6cf commit f03baa0
Show file tree
Hide file tree
Showing 2 changed files with 125 additions and 0 deletions.
62 changes: 62 additions & 0 deletions docs/blogs/compound_etl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Compound ETL
***From Raw SMILES to ML Pipeline Ready Molecules***

In this blog, we'll walk through the essential steps for preparing molecular data for machine learning pipelines. Starting with raw SMILES strings, we'll demonstrate how to clean, standardize, and preprocess molecules, ensuring they are ready for downstream tasks: feature selection, engineering, and modeling.

### Why Compound ETL?
Raw molecular datasets often contain inconsistencies, salts, and redundant entries. A well-structured ETL (Extract, Transform, Load) pipeline ensures the data is clean, standardized, and reproducible, which is crucial for building reliable ML models. We'll cover the following steps:

1. Validating SMILES
2. Removing duplicates
3. Handling stereochemistry
4. Selecting the largest fragment
5. Canonicalizing molecules
6. Tautomerizing molecules


### Data
**AqSolDB:** A curated reference set of aqueous solubility data, created by the Autonomous Energy Materials Discovery [AMD] research group, consists of aqueous solubility values for 9,982 unique compounds curated from 9 publicly available datasets.
Source: [Nature Scientific Data](https://www.nature.com/articles/s41597-019-0151-1)

**Download from Harvard DataVerse:**
[Harvard DataVerse: AqSolDB](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OVHAW8)


### Python Packages

- **[RDKit](https://github.com/rdkit/rdkit):** Open-source toolkit for cheminformatics, used for tasks like SMILES validation, fragment selection, and stereochemistry handling.
- **[Mordred Community](https://github.com/JacksonBurns/mordred-community):** A community-maintained molecular descriptor calculator for feature extraction and engineering.



### ETL Steps
Here are the core steps of our Compound ETL pipeline:

#### 1. Check for Invalid SMILES
Validating the SMILES strings ensures that downstream processing doesn’t fail due to malformed data. This step identifies and filters out invalid or problematic entries.

#### 2. Deduplicate
Duplicate molecules can skew analysis and modeling results. Deduplication ensures a clean and minimal dataset.

#### 3. Handle Stereochemistry
Stereochemistry affects molecular properties significantly. This step determines whether to retain or relax stereochemical definitions, depending on the use case.

#### 4. Select Largest Fragment
Many compounds contain salts or counterions. This step extracts the largest fragment with at least one heavy atom and retains any other fragments as metadata.

#### 5. Canonicalize Molecules
Canonicalization ensures that each molecule is represented in a unique and consistent format. This step is critical for reproducibility and efficient comparison.

#### 6. Tautomerize Molecules
Tautomerization standardizes different tautomeric forms of a compound into a single representation, reducing redundancy and improving consistency.



### Canonicalization and Tautomerization
For an in-depth look at why **Canonicalization** and **Tautomerization** are crucial for compound preprocessing, see our blog on [Canonicalization and Tautomerization](canonicalization_and_tautomerization.md). It covers the importance of standardizing molecular representations to ensure robust and reproducible machine learning workflows.


## Conclusion
By following this Compound ETL pipeline, you can transform raw molecular data into a clean, standardized, and ML-ready format. This foundational preprocessing step sets the stage for effective feature engineering, modeling, and analysis.

Stay tuned for the next blog, where we'll dive into feature engineering for chemical compounds!
63 changes: 63 additions & 0 deletions docs/blogs/compound_explorer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Compound Explorer
***A Workbench based Application***

In this blog, we'll walk through the steps we used for taking a 'pile of SMILES' and tranforming them into compound data for processing and display in the Compound Explorer application.

### Why Compound Explorer?
The workbench toolkit has a ton of functionality for constructing end-to-end AWS Machine Learning pipelines. We wanted to construct as application that combined components in the toolkit to create an engaging and informative web application.


### Workbench Pipeline

1. **SMILES Processing**:
- **Validation**: SMILES are syntactically correct.
- **Fragment Selection**: Retain the largest fragment (with at least one heavy atom) of each molecule.
- **Canonicalization**: Generate a unique representation for each molecule.
- **Tautomerization**: Normalize tautomers to standardize inputs.

2. **Feature Space Proximity Models**:
- Build **KNN-based proximity graphs** for:
- **Descriptor Features**: Using molecular descriptors (RDKit, Mordred).
- **Fingerprints**: Using chemical fingerprints for structural similarity (ECFP)

3. **2D Projections**:
- **LogP vs pKa**: Provide a chemically intuitive 2D Space.
- Projections (t-SNE, UMAP, etc):
- **Descriptor Space**
- **Fingerprint Space**

4. **Interactive Visualization**:
- **Hover** displays Molecular drawing
- **5 Closest Neighbors (2 sets)**:
- **Blue Lines**: Descriptor-based neighbors.
- **Green Lines**: Fingerprint-based neighbors.

### Interactivity Highlights:
- **Neighbor Connections**: Clearly differentiate relationships with color-coded edges.
- **Hover Effects**: Enable chemists to interactively explore molecular neighborhoods.
- **Projection Linking**: Allow toggling between Descriptor and Fingerprint spaces.


### Data
**AqSolDB:** A curated reference set of aqueous solubility data, created by the Autonomous Energy Materials Discovery [AMD] research group, consists of aqueous solubility values for 9,982 unique compounds curated from 9 publicly available datasets.
Source: [Nature Scientific Data](https://www.nature.com/articles/s41597-019-0151-1)

**Download from Harvard DataVerse:**
[Harvard DataVerse: AqSolDB](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OVHAW8)


### Python Packages

- **[RDKit](https://github.com/rdkit/rdkit):** Open-source toolkit for cheminformatics, used for tasks like SMILES validation, fragment selection, and stereochemistry handling.
- **[Mordred Community](https://github.com/JacksonBurns/mordred-community):** A community-maintained molecular descriptor calculator for feature extraction and engineering.



### Canonicalization and Tautomerization
For an in-depth look at why **Canonicalization** and **Tautomerization** are crucial for compound preprocessing, see our blog on [Canonicalization and Tautomerization](canonicalization_and_tautomerization.md). It covers the importance of standardizing molecular representations to ensure robust and reproducible machine learning workflows.


## Conclusion


Stay tuned for the next blog, where we'll dive into feature engineering for chemical compounds!

0 comments on commit f03baa0

Please sign in to comment.