-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
a3eb6cf
commit f03baa0
Showing
2 changed files
with
125 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# Compound ETL | ||
***From Raw SMILES to ML Pipeline Ready Molecules*** | ||
|
||
In this blog, we'll walk through the essential steps for preparing molecular data for machine learning pipelines. Starting with raw SMILES strings, we'll demonstrate how to clean, standardize, and preprocess molecules, ensuring they are ready for downstream tasks: feature selection, engineering, and modeling. | ||
|
||
### Why Compound ETL? | ||
Raw molecular datasets often contain inconsistencies, salts, and redundant entries. A well-structured ETL (Extract, Transform, Load) pipeline ensures the data is clean, standardized, and reproducible, which is crucial for building reliable ML models. We'll cover the following steps: | ||
|
||
1. Validating SMILES | ||
2. Removing duplicates | ||
3. Handling stereochemistry | ||
4. Selecting the largest fragment | ||
5. Canonicalizing molecules | ||
6. Tautomerizing molecules | ||
|
||
|
||
### Data | ||
**AqSolDB:** A curated reference set of aqueous solubility data, created by the Autonomous Energy Materials Discovery [AMD] research group, consists of aqueous solubility values for 9,982 unique compounds curated from 9 publicly available datasets. | ||
Source: [Nature Scientific Data](https://www.nature.com/articles/s41597-019-0151-1) | ||
|
||
**Download from Harvard DataVerse:** | ||
[Harvard DataVerse: AqSolDB](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OVHAW8) | ||
|
||
|
||
### Python Packages | ||
|
||
- **[RDKit](https://github.com/rdkit/rdkit):** Open-source toolkit for cheminformatics, used for tasks like SMILES validation, fragment selection, and stereochemistry handling. | ||
- **[Mordred Community](https://github.com/JacksonBurns/mordred-community):** A community-maintained molecular descriptor calculator for feature extraction and engineering. | ||
|
||
|
||
|
||
### ETL Steps | ||
Here are the core steps of our Compound ETL pipeline: | ||
|
||
#### 1. Check for Invalid SMILES | ||
Validating the SMILES strings ensures that downstream processing doesn’t fail due to malformed data. This step identifies and filters out invalid or problematic entries. | ||
|
||
#### 2. Deduplicate | ||
Duplicate molecules can skew analysis and modeling results. Deduplication ensures a clean and minimal dataset. | ||
|
||
#### 3. Handle Stereochemistry | ||
Stereochemistry affects molecular properties significantly. This step determines whether to retain or relax stereochemical definitions, depending on the use case. | ||
|
||
#### 4. Select Largest Fragment | ||
Many compounds contain salts or counterions. This step extracts the largest fragment with at least one heavy atom and retains any other fragments as metadata. | ||
|
||
#### 5. Canonicalize Molecules | ||
Canonicalization ensures that each molecule is represented in a unique and consistent format. This step is critical for reproducibility and efficient comparison. | ||
|
||
#### 6. Tautomerize Molecules | ||
Tautomerization standardizes different tautomeric forms of a compound into a single representation, reducing redundancy and improving consistency. | ||
|
||
|
||
|
||
### Canonicalization and Tautomerization | ||
For an in-depth look at why **Canonicalization** and **Tautomerization** are crucial for compound preprocessing, see our blog on [Canonicalization and Tautomerization](canonicalization_and_tautomerization.md). It covers the importance of standardizing molecular representations to ensure robust and reproducible machine learning workflows. | ||
|
||
|
||
## Conclusion | ||
By following this Compound ETL pipeline, you can transform raw molecular data into a clean, standardized, and ML-ready format. This foundational preprocessing step sets the stage for effective feature engineering, modeling, and analysis. | ||
|
||
Stay tuned for the next blog, where we'll dive into feature engineering for chemical compounds! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Compound Explorer | ||
***A Workbench based Application*** | ||
|
||
In this blog, we'll walk through the steps we used for taking a 'pile of SMILES' and tranforming them into compound data for processing and display in the Compound Explorer application. | ||
|
||
### Why Compound Explorer? | ||
The workbench toolkit has a ton of functionality for constructing end-to-end AWS Machine Learning pipelines. We wanted to construct as application that combined components in the toolkit to create an engaging and informative web application. | ||
|
||
|
||
### Workbench Pipeline | ||
|
||
1. **SMILES Processing**: | ||
- **Validation**: SMILES are syntactically correct. | ||
- **Fragment Selection**: Retain the largest fragment (with at least one heavy atom) of each molecule. | ||
- **Canonicalization**: Generate a unique representation for each molecule. | ||
- **Tautomerization**: Normalize tautomers to standardize inputs. | ||
|
||
2. **Feature Space Proximity Models**: | ||
- Build **KNN-based proximity graphs** for: | ||
- **Descriptor Features**: Using molecular descriptors (RDKit, Mordred). | ||
- **Fingerprints**: Using chemical fingerprints for structural similarity (ECFP) | ||
|
||
3. **2D Projections**: | ||
- **LogP vs pKa**: Provide a chemically intuitive 2D Space. | ||
- Projections (t-SNE, UMAP, etc): | ||
- **Descriptor Space** | ||
- **Fingerprint Space** | ||
|
||
4. **Interactive Visualization**: | ||
- **Hover** displays Molecular drawing | ||
- **5 Closest Neighbors (2 sets)**: | ||
- **Blue Lines**: Descriptor-based neighbors. | ||
- **Green Lines**: Fingerprint-based neighbors. | ||
|
||
### Interactivity Highlights: | ||
- **Neighbor Connections**: Clearly differentiate relationships with color-coded edges. | ||
- **Hover Effects**: Enable chemists to interactively explore molecular neighborhoods. | ||
- **Projection Linking**: Allow toggling between Descriptor and Fingerprint spaces. | ||
|
||
|
||
### Data | ||
**AqSolDB:** A curated reference set of aqueous solubility data, created by the Autonomous Energy Materials Discovery [AMD] research group, consists of aqueous solubility values for 9,982 unique compounds curated from 9 publicly available datasets. | ||
Source: [Nature Scientific Data](https://www.nature.com/articles/s41597-019-0151-1) | ||
|
||
**Download from Harvard DataVerse:** | ||
[Harvard DataVerse: AqSolDB](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OVHAW8) | ||
|
||
|
||
### Python Packages | ||
|
||
- **[RDKit](https://github.com/rdkit/rdkit):** Open-source toolkit for cheminformatics, used for tasks like SMILES validation, fragment selection, and stereochemistry handling. | ||
- **[Mordred Community](https://github.com/JacksonBurns/mordred-community):** A community-maintained molecular descriptor calculator for feature extraction and engineering. | ||
|
||
|
||
|
||
### Canonicalization and Tautomerization | ||
For an in-depth look at why **Canonicalization** and **Tautomerization** are crucial for compound preprocessing, see our blog on [Canonicalization and Tautomerization](canonicalization_and_tautomerization.md). It covers the importance of standardizing molecular representations to ensure robust and reproducible machine learning workflows. | ||
|
||
|
||
## Conclusion | ||
|
||
|
||
Stay tuned for the next blog, where we'll dive into feature engineering for chemical compounds! |