Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs for doslearn #9

Merged
merged 13 commits into from
Dec 12, 2024
Merged
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,17 +36,25 @@ An equivariant model is then trained to predict coefficients $d_b^{\text{ML}}$ t

For one of the original workflows for predicting the electron density under the density-fitting framework, readers are referred to [*SALTED*](https://github.com/andreagrisafi/SALTED). This uses a symmetry-adapted Gaussian process regression (SA-GPR) method via sparse kernel ridge regression to learn and predict $d_b^{\text{ML}}$.

## Electronic density of states (DOS)

The electronic density of states (DOS) provides information regarding the distribution of available electronic states in a material. With the DOS of a material, one is able to infer many optical and electronic properties of a material, such as its electrical conductivity, bandgap and absorption spectra. This allows the DOS to be relevant for material design as a tool to screen potential material candidates. The DOS is typically computed using DFT, but as mentioned above, DFT is prohibitively expensive for large and complex systems.

Machine learning has also been applied to the DOS and a variety of representatinos have been developed for the DOS. Thus far, there have been three main approaches, 1) Projecting the DOS on a discretized energy grid, 2) Projecting the integrated DOS on a discretized energy grid and, 3) Decomposing the DOS using Principal Component Analysis (PCA). Unlike the electronic density, the DOS is invariant to rotations and thus an invariant model can be employed to predict the DOS under any of the three representations.

# Goals

`rholearn` also operates under the density fitting approach. The nuclear coordinates $\to$ electonic density mapping is learned *via* a feature-based equivariant neural network whose outputs are the predicted coefficients. Currently, `rholearn` is integrated with the electronic structure code `FHI-aims` for both data generation and building of real-space fields from predicted coefficients.
`rholearn` also operates under the density fitting approach. The nuclear coordinates $\to$ electonic density mapping is learned *via* a feature-based equivariant neural network whose outputs are the predicted coefficients. Currently, `rholearn` is integrated with the electronic structure code `FHI-aims` for both data generation and building of real-space fields from predicted coefficients. `rholearn` aims to improve the scalability of the density-fitting approach to learning electronic densities.

`doslearn` represents the DOS by projecting it on a discretized energy grid. Additionally, a locality ansatz is employed whereby the global DOS of a structure, is expressed as a sum of local contributions from each atomic environment.

`rholearn` aims to improve the scalability of the density-fitting approach to learning electronic densities. It is built on top of a modular software ecosystem, with the following packages forming the main components of the workflow:
Both are built on top of a modular software ecosystem, with the following packages forming the main components of the workflow:

* **`metatensor`** ([GitHub](https://github.com/lab-cosmo/metatensor)) is used as the self-describing block-sparse data storage format, wrapping multidimensional tensors with metadata. Subpackages `metatensor-operations` and `metatensor-learn` are used to provide convenient sparse operations and ML building blocks respectively that operate on the `metatensor.TensorMap` object.
* **`rascaline`** ([GitHub](https://github.com/luthaf/rascaline)) is used to transform the nuclear coordinates into local equivariant descriptors that encode physical symmetries and geometric information for input into the neural network.
* **`featomic`** ([GitHub](https://github.com/metatensor/featomic)) is used to transform the nuclear coordinates into local equivariant descriptors that encode physical symmetries and geometric information for input into the neural network.
* **`PyTorch`** is used as the learning framework, allowing definition of arbitrarily complex neural networks that can be trained by minibatch gradient descent.

Leveraging the speed- and memory-efficient operations of `torch`, and using building on top of `metatensor` and `rascaline`, descriptors, models, and learning methodologies can be flexibly prototyped and customized for a specific learning task.
Leveraging the speed- and memory-efficient operations of `torch`, and using building on top of `metatensor` and `featomic`, descriptors, models, and learning methodologies can be flexibly prototyped and customized for a specific learning task.


# Getting Started
Expand Down
2 changes: 1 addition & 1 deletion example/doslearn-aims-tutorial/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This tutorial follows two parts: 1) data generation with `FHI-aims` and 2) model

First, data is generated with `FHI-aims`. SCF calculations are converged to compute the self consistent solutions to the Kohn-Sham equations for each frame, along with, output of the eigenvalues on a dense k-grid.

Second, the reference splined eigenvalues form the dataset for training a machine learning model. In `doslearn`, an invariant power spectrum descriptor-based neural network is used to learn the mapping from nuclear coordinates to the density of states. A model is trained iteratively over a number of epochs, optimizing the NN weights by backpropagation and gradient descent. Also optimized is an alignment vector that finds the optimal shift of the DOS for each frane to aid model training.
Second, the DOS of each structure is computed via gaussian smearing of the eigenvalues obtained from the SCF calculations. The DOS for each frame is represented by Cubic Hermite Splines to facilitate the changing of the energy reference, which is an ill-defined quantity in bulk calculations. As the energy reference chosen is arbitrary, this training workflow finds the energy reference that is optimal for model training. These splines form the dataset for training a machine learning model. In `doslearn`, an invariant power spectrum descriptor-based neural network is used to learn the mapping from nuclear coordinates to the DOS projected on a discretized energy grid. A model is trained iteratively over a number of epochs, optimizing the NN weights by backpropagation and gradient descent. Also optimized is an alignment vector that represents the energy reference for the DOS for each frame to aid model training.

## Supporting notebooks

Expand Down
2 changes: 1 addition & 1 deletion example/doslearn-aims-tutorial/part-1-dft/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ The calculation has (hopefully) converged to the SCF solution for the given inpu

Now process the SCF outputs. First, the "aims.out" file is parsed to extract various information and pickle it to file "calc_info.pickle" in the `FHI-aims` run directory.

Next, the eigenvalues contained in "Final_KS_eigenvalues.dat" are parsed and splines constructed for them. These splines are constructed according to the splines settings `DOS_SPLINES` specified in [dft-options.yaml](dft-options.yaml). These splines are stored in `TensorMap` format and saved to a series of processed data directories at path (i.e. for structure index 0) `data/processed/0/dos/dos_spline.npz`.
Next, the eigenvalues contained in "Final_KS_eigenvalues.dat" are parsed and are used to construct the DOS via gaussian smearing of the eigenvalues. To facilitate shifting of the energy reference, the DOS are represented using Cubic Hermite Splines, which are constructed according to the splines settings `DOS_SPLINES` specified in [dft-options.yaml](dft-options.yaml). In the splines settings, `min_energy` and `max_energy` are used to define the energy window in which the splines will be computed for. Endpoints for each piece of the spline are distributed uniformly on the energy grid, with a spacing defined by `interval`. Then, `sigma` is used to denote the width of gaussian smearing used to define the DOS. These splines are then stored in `TensorMap` format and saved to a series of processed data directories at path (i.e. for structure index 0) `data/processed/0/dos/dos_spline.npz`.

Splines generated with different parameters can be saved to different processed data directories by changing the default option for `RUN_DIR` in [dft-options.yaml](dft-options.yaml).

Expand Down
8 changes: 4 additions & 4 deletions example/doslearn-aims-tutorial/part-1-dft/part-1-dft.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -67,7 +67,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -80,7 +80,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "rho",
"display_name": "rholearn",
"language": "python",
"name": "python3"
},
Expand All @@ -94,7 +94,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
"version": "3.13.0"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion example/doslearn-aims-tutorial/part-2-ml/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ where each command can be wrapped in an HPC submission script.

## 2.1: Key parts of the training workflow

The training script trains a machine learning model based on the output of FHI-AIMS generated by [part-1-dft](../part-1-dft/). The training pipeline generates SOAP power spectrum descriptors based on structures in the .xyz file defined in the XYZ field of [dft-options.yaml](../part-1-dft/dft-options.yaml). A neural-network architecture (whose architecture is defined by the `HIDDEN_LAYER_WIDTHS` parameter in [ml-options.yaml](../part-2-ml/ml-options.yaml)) is used to predict the value of the atomic contributions the the global DOS on an energy grid. Minibatch gradient descent is used to train the model, with concurrent optimization of an adaptive energy reference used in loss evaluation.
The training script trains a machine learning model based on the output of FHI-AIMS generated by [part-1-dft](../part-1-dft/). The training pipeline generates SOAP power spectrum descriptors based on structures in the .xyz file defined in the XYZ field of [dft-options.yaml](../part-1-dft/dft-options.yaml). A neural-network architecture (whose architecture is defined by the `HIDDEN_LAYER_WIDTHS` parameter in [ml-options.yaml](../part-2-ml/ml-options.yaml)) is used to predict the value of the atomic contributions to the global DOS on an energy grid. The loss is evaluated by comparing the predictions against the DOS computed from the splines relative to the optimizable energy reference. Minibatch gradient descent is used to train the model, with concurrent optimization of the energy reference of each training frame.

## 2.2: Specify ML settings

Expand Down
151 changes: 151 additions & 0 deletions example/doslearn-aims-tutorial/part-2-ml/part-2-ml.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import ase\n",
"import ase.io\n",
"import ase.visualize\n",
"from rholearn.options import get_options\n",
"from rholearn.rholearn import train_utils as rho_train_utils\n",
"from rholearn.utils import io, system"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dft_options = get_options(\"dft\", \"doslearn\")\n",
"ml_options = get_options(\"ml\", \"doslearn\")\n",
"if dft_options.get(\"IDX_SUBSET\") is not None:\n",
" frame_idxs = dft_options.get(\"IDX_SUBSET\")\n",
"else:\n",
" frame_idxs = None\n",
"\n",
"all_frames = system.read_frames_from_xyz(dft_options[\"XYZ\"], frame_idxs)\\\n",
"\n",
"if frame_idxs is None:\n",
" frame_idxs = list(range(len(all_frames)))\n",
"\n",
"_, _, test_id = (\n",
" rho_train_utils.crossval_idx_split( # cross-validation split of idxs\n",
" frame_idxs=frame_idxs,\n",
" n_train=ml_options[\"N_TRAIN\"],\n",
" n_val=ml_options[\"N_VAL\"],\n",
" n_test=ml_options[\"N_TEST\"],\n",
" seed=ml_options[\"SEED\"],\n",
" )\n",
" )\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"max_energy = dft_options[\"DOS_SPLINES\"][\"max_energy\"] - ml_options[\"TARGET_DOS\"][\"max_energy_buffer\"]\n",
"min_energy = dft_options[\"DOS_SPLINES\"][\"min_energy\"]\n",
"interval = dft_options[\"DOS_SPLINES\"][\"interval\"]\n",
"\n",
"n_grid_points = int(np.ceil((max_energy - min_energy) \\\n",
" / interval))\n",
"x_dos = min_energy + np.arange(n_grid_points) * interval\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test Statistics"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_mse = np.load('./evaluation/epoch_best/Test/test_MSEs.npy')\n",
"test_emse = np.load('./evaluation/epoch_best/Test/test_eMSEs.npy')\n",
"\n",
"print (\"Test Statistics\")\n",
"print (\"RMSE: \", f'{np.sqrt(np.mean(test_mse)):.5}')\n",
"plt.title(\"Test Energywise MSE\")\n",
"plt.plot(x_dos, test_emse)\n",
"plt.xlabel(\"Energy (eV)\")\n",
"plt.ylabel(\"MSE\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inspecting highest error prediction"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"max_index = test_id[np.argmax(test_mse)]\n",
"\n",
"# Load prediction and target\n",
"\n",
"prediction = np.load(f'./evaluation/epoch_best/{max_index}/normalized_prediction.npy')\n",
"target = np.load(f'./evaluation/epoch_best/{max_index}/aligned_target.npy')\n",
"\n",
"plt.plot(x_dos, prediction, color = 'red' , label = 'Prediction')\n",
"plt.plot(x_dos, target, color = 'blue', label = 'Target')\n",
"plt.legend()\n",
"plt.title(\"Prediction with highest error\")\n",
"plt.xlabel(\"Energy (eV)\")\n",
"plt.ylabel(\"DOS\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "rholearn",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ dependencies = [
"chemfiles",
"chemiscope",
"metatensor[torch]",
"featomic-torch @ git+https://github.com/metatensor/featomic@edd717229358891fbdcc218c871ca1f193aa23d9#subdirectory=python/featomic_torch",
"featomic-torch @ git+https://github.com/metatensor/featomic@368032010f06dec46a59f3fd7fb3720c07f53c69#subdirectory=python/featomic_torch",
"vesin",

# Extra dependencies
Expand Down
76 changes: 70 additions & 6 deletions rholearn/doslearn/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@
import time
from os.path import join

import metatensor.torch as mts
import numpy as np
import torch

from rholearn.doslearn import train_utils
from rholearn.options import get_options
from rholearn.rholearn import train_utils as rho_train_utils
from rholearn.utils import io, system
Expand Down Expand Up @@ -37,10 +40,23 @@ def eval():
io.log(log_path, "===== BEGIN =====")
io.log(log_path, f"Working directory: {ml_options['ML_DIR']}")

# Load Spline Positions
spline_positions = train_utils.get_spline_positions(
min_energy=dft_options["DOS_SPLINES"]["min_energy"],
max_energy=dft_options["DOS_SPLINES"]["max_energy"]
- ml_options["TARGET_DOS"]["max_energy_buffer"],
interval=dft_options["DOS_SPLINES"]["interval"],
)

# Random seed and dtype
torch.manual_seed(ml_options["SEED"])
torch.set_default_dtype(getattr(torch, ml_options["TRAIN"]["dtype"]))

# Get a callable to the evaluation subdirectory, parametrized by frame idx
rebuild_dir = lambda frame_idx: join( # noqa: E731
ml_options["EVAL_DIR"](ml_options["EVAL"]["eval_epoch"]), f"{frame_idx}"
)

# ===== Get the test data =====

io.log(log_path, "Get the test IDs and frames")
Expand All @@ -53,6 +69,15 @@ def eval():
seed=ml_options["SEED"],
)
)
dtype = getattr(torch, ml_options["TRAIN"]["dtype"])
device = torch.device(ml_options["TRAIN"]["device"])
test_splines_mts = [
mts.load(join(dft_options["PROCESSED_DIR"](A), "dos_spline.npz")).to(
dtype=dtype, device=device
)
for A in test_id
]
test_splines = torch.vstack([i.blocks(0)[0].values for i in test_splines_mts])
test_frames = [all_frames[A] for A in test_id]

# ===== Load model and perform inference =====
Expand All @@ -68,19 +93,55 @@ def eval():
# Perform inference
io.log(log_path, "Perform inference")
t0_infer = time.time()
test_preds_mts = model.predict(frames=test_frames, frame_idxs=test_id) # noqa: F841
test_preds_mts = model(frames=test_frames, frame_idxs=test_id)
test_preds = mts.sum_over_samples(test_preds_mts, "atom")
test_preds = test_preds[0].values
dt_infer = time.time() - t0_infer
test_preds_numpy = test_preds.detach().numpy()
for index_A, A in enumerate(test_id):
os.makedirs(rebuild_dir(A), exist_ok=True)
np.save(
join(rebuild_dir(A), "prediction.npy"), test_preds_numpy[index_A]
) # Save in each directory
io.log(log_path, rho_train_utils.report_dt(dt_infer, "Model inference complete"))
io.log(
log_path, rho_train_utils.report_dt(dt_infer / len(test_id), " or per frame")
)

# TODO: evaluate some metrics, and save predictions?
# ...
# Evaluate Test DOS RMSE
# Normalize DOS wrt system size for evaluation
test_preds_eval = mts.mean_over_samples(test_preds_mts, "atom")
test_preds_eval = test_preds_eval[0].values
# Obtain shift invariant MSE
test_MSE, shited_test_targets, test_shifts = train_utils.opt_mse_spline(
test_preds_eval,
model._x_dos,
test_splines,
spline_positions,
n_epochs=200,
)
errors = (test_preds_eval - shited_test_targets) ** 2
samplewise_error = torch.trapezoid(errors, model._x_dos, dim=1).detach().numpy()
energywise_error = torch.mean(errors, dim=0).detach().numpy()
os.makedirs(rebuild_dir("Test"), exist_ok=True)
np.save(join(rebuild_dir("Test"), "test_MSEs.npy"), samplewise_error)
np.save(join(rebuild_dir("Test"), "test_eMSEs.npy"), energywise_error)
# Save Predictions and targets in each folder
for index_A, A in enumerate(test_id):
np.save(
join(rebuild_dir(A), "normalized_prediction.npy"),
test_preds_eval[index_A].detach().numpy(),
)
np.save(
join(rebuild_dir(A), "aligned_target.npy"),
shited_test_targets[index_A].detach().numpy(),
)
np.save(
join(rebuild_dir(A), "target_shift.npy"),
test_shifts[index_A].detach().numpy(),
)

# io.log(
# log_path, f"Mean % NMAE per structure: {torch.mean(torch.tensor(nmaes)):.5f}"
# )
io.log(log_path, f"Test RMSE: {torch.sqrt(test_MSE):.5f}")

# ===== Finish =====
dt_eval = time.time() - t0_eval
Expand All @@ -102,6 +163,9 @@ def _get_options():
dft_options["SCF_DIR"] = lambda frame_idx: join(
dft_options["DATA_DIR"], "raw", f"{frame_idx}"
)
dft_options["PROCESSED_DIR"] = lambda frame_idx: join(
dft_options["DATA_DIR"], "processed", f"{frame_idx}", dft_options["RUN_ID"]
)
ml_options["ML_DIR"] = os.getcwd()
ml_options["CHKPT_DIR"] = rho_train_utils.create_subdir(os.getcwd(), "checkpoint")
ml_options["EVAL_DIR"] = rho_train_utils.create_subdir(os.getcwd(), "evaluation")
Expand Down
Loading