lab-cosmo · jwa7 · Dec 12, 2024 · Nov 12, 2024 · Nov 12, 2024 · Nov 25, 2024
diff --git a/README.md b/README.md
@@ -36,17 +36,25 @@ An equivariant model is then trained to predict coefficients $d_b^{\text{ML}}$ t
 
 For one of the original workflows for predicting the electron density under the density-fitting framework, readers are referred to [*SALTED*](https://github.com/andreagrisafi/SALTED). This uses a symmetry-adapted Gaussian process regression (SA-GPR) method via sparse kernel ridge regression to learn and predict $d_b^{\text{ML}}$.
 
+## Electronic density of states (DOS)
+
+The electronic density of states (DOS) provides information regarding the distribution of available electronic states in a material. With the DOS of a material, one is able to infer many optical and electronic properties of a material, such as its electrical conductivity, bandgap and absorption spectra. This allows the DOS to be relevant for material design as a tool to screen potential material candidates. The DOS is typically computed using DFT, but as mentioned above, DFT is prohibitively expensive for large and complex systems.
+
+Machine learning has also been applied to the DOS and a variety of representatinos have been developed for the DOS. Thus far, there have been three main approaches, 1) Projecting the DOS on a discretized energy grid, 2) Projecting the integrated DOS on a discretized energy grid and, 3) Decomposing the DOS using Principal Component Analysis (PCA). Unlike the electronic density, the DOS is invariant to rotations and thus an invariant model can be employed to predict the DOS under any of the three representations.
+
 # Goals
 
-`rholearn` also operates under the density fitting approach. The nuclear coordinates $\to$ electonic density mapping is learned *via* a feature-based equivariant neural network whose outputs are the predicted coefficients. Currently, `rholearn` is integrated with the electronic structure code `FHI-aims` for both data generation and building of real-space fields from predicted coefficients.
+`rholearn` also operates under the density fitting approach. The nuclear coordinates $\to$ electonic density mapping is learned *via* a feature-based equivariant neural network whose outputs are the predicted coefficients. Currently, `rholearn` is integrated with the electronic structure code `FHI-aims` for both data generation and building of real-space fields from predicted coefficients. `rholearn` aims to improve the scalability of the density-fitting approach to learning electronic densities. 
+
+`doslearn` represents the DOS by projecting it on a discretized energy grid. Additionally, a locality ansatz is employed whereby the global DOS of a structure, is expressed as a sum of local contributions from each atomic environment.
 
-`rholearn` aims to improve the scalability of the density-fitting approach to learning electronic densities. It is built on top of a modular software ecosystem, with the following packages forming the main components of the workflow:
+Both are built on top of a modular software ecosystem, with the following packages forming the main components of the workflow:
 
 * **`metatensor`** ([GitHub](https://github.com/lab-cosmo/metatensor)) is used as the self-describing block-sparse data storage format, wrapping multidimensional tensors with metadata. Subpackages `metatensor-operations` and `metatensor-learn` are used to provide convenient sparse operations and ML building blocks respectively that operate on the `metatensor.TensorMap` object.
-* **`rascaline`** ([GitHub](https://github.com/luthaf/rascaline)) is used to transform the nuclear coordinates into local equivariant descriptors that encode physical symmetries and geometric information for input into the neural network.
+* **`featomic`** ([GitHub](https://github.com/metatensor/featomic)) is used to transform the nuclear coordinates into local equivariant descriptors that encode physical symmetries and geometric information for input into the neural network.
 * **`PyTorch`** is used as the learning framework, allowing definition of arbitrarily complex neural networks that can be trained by minibatch gradient descent.
 
-Leveraging the speed- and memory-efficient operations of `torch`, and using building on top of `metatensor` and `rascaline`, descriptors, models, and learning methodologies can be flexibly prototyped and customized for a specific learning task.
+Leveraging the speed- and memory-efficient operations of `torch`, and using building on top of `metatensor` and `featomic`, descriptors, models, and learning methodologies can be flexibly prototyped and customized for a specific learning task.
 
 
 # Getting Started

diff --git a/example/doslearn-aims-tutorial/README.md b/example/doslearn-aims-tutorial/README.md
@@ -6,7 +6,7 @@ This tutorial follows two parts: 1) data generation with `FHI-aims` and 2) model
 
 First, data is generated with `FHI-aims`. SCF calculations are converged to compute the self consistent solutions to the Kohn-Sham equations for each frame, along with, output of the eigenvalues on a dense k-grid.
 
-Second, the reference splined eigenvalues form the dataset for training a machine learning model. In `doslearn`, an invariant power spectrum descriptor-based neural network is used to learn the mapping from nuclear coordinates to the density of states. A model is trained iteratively over a number of epochs, optimizing the NN weights by backpropagation and gradient descent. Also optimized is an alignment vector that finds the optimal shift of the DOS for each frane to aid model training.
+Second, the DOS of each structure is computed via gaussian smearing of the eigenvalues obtained from the SCF calculations. The DOS for each frame is represented by Cubic Hermite Splines to facilitate the changing of the energy reference, which is an ill-defined quantity in bulk calculations. As the energy reference chosen is arbitrary, this training workflow finds the energy reference that is optimal for model training. These splines form the dataset for training a machine learning model. In `doslearn`, an invariant power spectrum descriptor-based neural network is used to learn the mapping from nuclear coordinates to the DOS projected on a discretized energy grid. A model is trained iteratively over a number of epochs, optimizing the NN weights by backpropagation and gradient descent. Also optimized is an alignment vector that represents the energy reference for the DOS for each frame to aid model training.
 
 ## Supporting notebooks
 

diff --git a/example/doslearn-aims-tutorial/part-1-dft/README.md b/example/doslearn-aims-tutorial/part-1-dft/README.md
@@ -62,7 +62,7 @@ The calculation has (hopefully) converged to the SCF solution for the given inpu
 
 Now process the SCF outputs. First, the "aims.out" file is parsed to extract various information and pickle it to file "calc_info.pickle" in the `FHI-aims` run directory.
 
-Next, the eigenvalues contained in "Final_KS_eigenvalues.dat" are parsed and splines constructed for them. These splines are constructed according to the splines settings `DOS_SPLINES` specified in [dft-options.yaml](dft-options.yaml). These splines are stored in `TensorMap` format and saved to a series of processed data directories at path (i.e. for structure index 0) `data/processed/0/dos/dos_spline.npz`. 
+Next, the eigenvalues contained in "Final_KS_eigenvalues.dat" are parsed and are used to construct the DOS via gaussian smearing of the eigenvalues. To facilitate shifting of the energy reference, the DOS are represented using Cubic Hermite Splines, which are constructed according to the splines settings `DOS_SPLINES` specified in [dft-options.yaml](dft-options.yaml). In the splines settings, `min_energy` and `max_energy` are used to define the energy window in which the splines will be computed for. Endpoints for each piece of the spline are distributed uniformly on the energy grid, with a spacing defined by `interval`. Then, `sigma` is used to denote the width of gaussian smearing used to define the DOS. These splines are then stored in `TensorMap` format and saved to a series of processed data directories at path (i.e. for structure index 0) `data/processed/0/dos/dos_spline.npz`. 
 
 Splines generated with different parameters can be saved to different processed data directories by changing the default option for `RUN_DIR` in [dft-options.yaml](dft-options.yaml). 
 

diff --git a/example/doslearn-aims-tutorial/part-1-dft/part-1-dft.ipynb b/example/doslearn-aims-tutorial/part-1-dft/part-1-dft.ipynb
@@ -45,7 +45,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -67,7 +67,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -80,7 +80,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "rho",
+   "display_name": "rholearn",
    "language": "python",
    "name": "python3"
   },
@@ -94,7 +94,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.0"
+   "version": "3.13.0"
   }
  },
  "nbformat": 4,

diff --git a/example/doslearn-aims-tutorial/part-2-ml/README.md b/example/doslearn-aims-tutorial/part-2-ml/README.md
@@ -19,7 +19,7 @@ where each command can be wrapped in an HPC submission script.
 
 ## 2.1: Key parts of the training workflow
 
-The training script trains a machine learning model based on the output of FHI-AIMS generated by [part-1-dft](../part-1-dft/). The training pipeline generates SOAP power spectrum descriptors based on structures in the .xyz file defined in the XYZ field of [dft-options.yaml](../part-1-dft/dft-options.yaml). A neural-network architecture (whose architecture is defined by the `HIDDEN_LAYER_WIDTHS` parameter in [ml-options.yaml](../part-2-ml/ml-options.yaml)) is used to predict the value of the atomic contributions the the global DOS on an energy grid. Minibatch gradient descent is used to train the model, with concurrent optimization of an adaptive energy reference used in loss evaluation.
+The training script trains a machine learning model based on the output of FHI-AIMS generated by [part-1-dft](../part-1-dft/). The training pipeline generates SOAP power spectrum descriptors based on structures in the .xyz file defined in the XYZ field of [dft-options.yaml](../part-1-dft/dft-options.yaml). A neural-network architecture (whose architecture is defined by the `HIDDEN_LAYER_WIDTHS` parameter in [ml-options.yaml](../part-2-ml/ml-options.yaml)) is used to predict the value of the atomic contributions to the global DOS on an energy grid. The loss is evaluated by comparing the predictions against the DOS computed from the splines relative to the optimizable energy reference. Minibatch gradient descent is used to train the model, with concurrent optimization of the energy reference of each training frame.
 
 ## 2.2: Specify ML settings
 

diff --git a/example/doslearn-aims-tutorial/part-2-ml/part-2-ml.ipynb b/example/doslearn-aims-tutorial/part-2-ml/part-2-ml.ipynb
@@ -0,0 +1,151 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import ase\n",
+    "import ase.io\n",
+    "import ase.visualize\n",
+    "from rholearn.options import get_options\n",
+    "from rholearn.rholearn import train_utils as rho_train_utils\n",
+    "from rholearn.utils import io, system"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initialization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dft_options = get_options(\"dft\", \"doslearn\")\n",
+    "ml_options = get_options(\"ml\", \"doslearn\")\n",
+    "if dft_options.get(\"IDX_SUBSET\") is not None:\n",
+    "        frame_idxs = dft_options.get(\"IDX_SUBSET\")\n",
+    "else:\n",
+    "    frame_idxs = None\n",
+    "\n",
+    "all_frames = system.read_frames_from_xyz(dft_options[\"XYZ\"], frame_idxs)\\\n",
+    "\n",
+    "if frame_idxs is None:\n",
+    "        frame_idxs = list(range(len(all_frames)))\n",
+    "\n",
+    "_, _, test_id = (\n",
+    "        rho_train_utils.crossval_idx_split(  # cross-validation split of idxs\n",
+    "            frame_idxs=frame_idxs,\n",
+    "            n_train=ml_options[\"N_TRAIN\"],\n",
+    "            n_val=ml_options[\"N_VAL\"],\n",
+    "            n_test=ml_options[\"N_TEST\"],\n",
+    "            seed=ml_options[\"SEED\"],\n",
+    "        )\n",
+    "    )\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "max_energy = dft_options[\"DOS_SPLINES\"][\"max_energy\"] - ml_options[\"TARGET_DOS\"][\"max_energy_buffer\"]\n",
+    "min_energy = dft_options[\"DOS_SPLINES\"][\"min_energy\"]\n",
+    "interval = dft_options[\"DOS_SPLINES\"][\"interval\"]\n",
+    "\n",
+    "n_grid_points = int(np.ceil((max_energy - min_energy) \\\n",
+    "                / interval))\n",
+    "x_dos = min_energy + np.arange(n_grid_points) * interval\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test Statistics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_mse = np.load('./evaluation/epoch_best/Test/test_MSEs.npy')\n",
+    "test_emse = np.load('./evaluation/epoch_best/Test/test_eMSEs.npy')\n",
+    "\n",
+    "print (\"Test Statistics\")\n",
+    "print (\"RMSE: \", f'{np.sqrt(np.mean(test_mse)):.5}')\n",
+    "plt.title(\"Test Energywise MSE\")\n",
+    "plt.plot(x_dos, test_emse)\n",
+    "plt.xlabel(\"Energy (eV)\")\n",
+    "plt.ylabel(\"MSE\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Inspecting highest error prediction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "max_index = test_id[np.argmax(test_mse)]\n",
+    "\n",
+    "# Load prediction and target\n",
+    "\n",
+    "prediction = np.load(f'./evaluation/epoch_best/{max_index}/normalized_prediction.npy')\n",
+    "target = np.load(f'./evaluation/epoch_best/{max_index}/aligned_target.npy')\n",
+    "\n",
+    "plt.plot(x_dos, prediction, color = 'red' , label = 'Prediction')\n",
+    "plt.plot(x_dos, target, color = 'blue', label = 'Target')\n",
+    "plt.legend()\n",
+    "plt.title(\"Prediction with highest error\")\n",
+    "plt.xlabel(\"Energy (eV)\")\n",
+    "plt.ylabel(\"DOS\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "rholearn",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/pyproject.toml b/pyproject.toml
@@ -20,7 +20,7 @@ dependencies = [
     "chemfiles",
     "chemiscope",
     "metatensor[torch]",
-    "featomic-torch @ git+https://github.com/metatensor/featomic@edd717229358891fbdcc218c871ca1f193aa23d9#subdirectory=python/featomic_torch",
+    "featomic-torch @ git+https://github.com/metatensor/featomic@368032010f06dec46a59f3fd7fb3720c07f53c69#subdirectory=python/featomic_torch",
     "vesin",
 
     # Extra dependencies

diff --git a/rholearn/doslearn/eval.py b/rholearn/doslearn/eval.py
@@ -2,8 +2,11 @@
 import time
 from os.path import join
 
+import metatensor.torch as mts
+import numpy as np
 import torch
 
+from rholearn.doslearn import train_utils
 from rholearn.options import get_options
 from rholearn.rholearn import train_utils as rho_train_utils
 from rholearn.utils import io, system
@@ -37,10 +40,23 @@ def eval():
     io.log(log_path, "===== BEGIN =====")
     io.log(log_path, f"Working directory: {ml_options['ML_DIR']}")
 
+    # Load Spline Positions
+    spline_positions = train_utils.get_spline_positions(
+        min_energy=dft_options["DOS_SPLINES"]["min_energy"],
+        max_energy=dft_options["DOS_SPLINES"]["max_energy"]
+        - ml_options["TARGET_DOS"]["max_energy_buffer"],
+        interval=dft_options["DOS_SPLINES"]["interval"],
+    )
+
     # Random seed and dtype
     torch.manual_seed(ml_options["SEED"])
     torch.set_default_dtype(getattr(torch, ml_options["TRAIN"]["dtype"]))
 
+    # Get a callable to the evaluation subdirectory, parametrized by frame idx
+    rebuild_dir = lambda frame_idx: join(  # noqa: E731
+        ml_options["EVAL_DIR"](ml_options["EVAL"]["eval_epoch"]), f"{frame_idx}"
+    )
+
     # ===== Get the test data =====
 
     io.log(log_path, "Get the test IDs and frames")
@@ -53,6 +69,15 @@ def eval():
             seed=ml_options["SEED"],
         )
     )
+    dtype = getattr(torch, ml_options["TRAIN"]["dtype"])
+    device = torch.device(ml_options["TRAIN"]["device"])
+    test_splines_mts = [
+        mts.load(join(dft_options["PROCESSED_DIR"](A), "dos_spline.npz")).to(
+            dtype=dtype, device=device
+        )
+        for A in test_id
+    ]
+    test_splines = torch.vstack([i.blocks(0)[0].values for i in test_splines_mts])
     test_frames = [all_frames[A] for A in test_id]
 
     # ===== Load model and perform inference =====
@@ -68,19 +93,55 @@ def eval():
     # Perform inference
     io.log(log_path, "Perform inference")
     t0_infer = time.time()
-    test_preds_mts = model.predict(frames=test_frames, frame_idxs=test_id)  # noqa: F841
+    test_preds_mts = model(frames=test_frames, frame_idxs=test_id)
+    test_preds = mts.sum_over_samples(test_preds_mts, "atom")
+    test_preds = test_preds[0].values
     dt_infer = time.time() - t0_infer
+    test_preds_numpy = test_preds.detach().numpy()
+    for index_A, A in enumerate(test_id):
+        os.makedirs(rebuild_dir(A), exist_ok=True)
+        np.save(
+            join(rebuild_dir(A), "prediction.npy"), test_preds_numpy[index_A]
+        )  # Save in each directory
     io.log(log_path, rho_train_utils.report_dt(dt_infer, "Model inference complete"))
     io.log(
         log_path, rho_train_utils.report_dt(dt_infer / len(test_id), "   or per frame")
     )
 
-    # TODO: evaluate some metrics, and save predictions?
-    # ...
+    # Evaluate Test DOS RMSE
+    # Normalize DOS wrt system size for evaluation
+    test_preds_eval = mts.mean_over_samples(test_preds_mts, "atom")
+    test_preds_eval = test_preds_eval[0].values
+    # Obtain shift invariant MSE
+    test_MSE, shited_test_targets, test_shifts = train_utils.opt_mse_spline(
+        test_preds_eval,
+        model._x_dos,
+        test_splines,
+        spline_positions,
+        n_epochs=200,
+    )
+    errors = (test_preds_eval - shited_test_targets) ** 2
+    samplewise_error = torch.trapezoid(errors, model._x_dos, dim=1).detach().numpy()
+    energywise_error = torch.mean(errors, dim=0).detach().numpy()
+    os.makedirs(rebuild_dir("Test"), exist_ok=True)
+    np.save(join(rebuild_dir("Test"), "test_MSEs.npy"), samplewise_error)
+    np.save(join(rebuild_dir("Test"), "test_eMSEs.npy"), energywise_error)
+    # Save Predictions and targets in each folder
+    for index_A, A in enumerate(test_id):
+        np.save(
+            join(rebuild_dir(A), "normalized_prediction.npy"),
+            test_preds_eval[index_A].detach().numpy(),
+        )
+        np.save(
+            join(rebuild_dir(A), "aligned_target.npy"),
+            shited_test_targets[index_A].detach().numpy(),
+        )
+        np.save(
+            join(rebuild_dir(A), "target_shift.npy"),
+            test_shifts[index_A].detach().numpy(),
+        )
 
-    # io.log(
-    #     log_path, f"Mean % NMAE per structure: {torch.mean(torch.tensor(nmaes)):.5f}"
-    # )
+    io.log(log_path, f"Test RMSE: {torch.sqrt(test_MSE):.5f}")
 
     # ===== Finish =====
     dt_eval = time.time() - t0_eval
@@ -102,6 +163,9 @@ def _get_options():
     dft_options["SCF_DIR"] = lambda frame_idx: join(
         dft_options["DATA_DIR"], "raw", f"{frame_idx}"
     )
+    dft_options["PROCESSED_DIR"] = lambda frame_idx: join(
+        dft_options["DATA_DIR"], "processed", f"{frame_idx}", dft_options["RUN_ID"]
+    )
     ml_options["ML_DIR"] = os.getcwd()
     ml_options["CHKPT_DIR"] = rho_train_utils.create_subdir(os.getcwd(), "checkpoint")
     ml_options["EVAL_DIR"] = rho_train_utils.create_subdir(os.getcwd(), "evaluation")