Merge pull request #2137 from NNPDF/make_code_tf_independent

Make the code compatible with other keras backends
NNPDF · Dec 6, 2024 · f4467d0 · f4467d0
2 parents 13fda2a + bd17587
commit f4467d0
Show file tree

Hide file tree

Showing 28 changed files with 405 additions and 394 deletions.
diff --git a/.github/workflows/pytorch_test.yml b/.github/workflows/pytorch_test.yml
@@ -0,0 +1,25 @@
+name: Test pytorch
+
+on: [push]
+
+jobs:
+  run_pytorch:
+    runs-on: ubuntu-latest
+    env:
+      KERAS_BACKEND: torch
+    steps:
+    - uses: actions/checkout@v4
+    - uses: actions/setup-python@v5
+      with:
+        python-version: "3.12"
+    - name: Install nnpdf without LHAPDF
+      shell: bash -l {0}
+      run: |
+        pip install .[nolha,torch]
+        # Since there is no LHAPDF in the system, initialize the folder and download pdfsets.index
+        lhapdf-management update --init
+    - name: Test we can run one runcard
+      shell: bash -l {0}
+      run: |
+        cd n3fit/runcards/examples
+        n3fit Basic_runcard.yml 4
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -12,7 +12,7 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-latest, macos-14]
-        python-version: ["3.10"] # We need an older python version to avoid conflict with the pymongo pin
+        python-version: ["3.12"]
       fail-fast: false
     runs-on: ${{ matrix.os }}
     env:

diff --git a/conda-recipe/meta.yaml b/conda-recipe/meta.yaml
@@ -19,8 +19,9 @@ requirements:
         - pip
     run:
         - python >=3.9,<3.13
-        - tensorflow >=2.10,<2.17 # 2.17 works ok but the conda-forge package for macos doesn't
-        - psutil
+        - tensorflow >=2.17
+        - keras >=3.1
+        - psutil # to ensure n3fit affinity is with the right processors
         - hyperopt
         - mongodb
         - pymongo <4

diff --git a/doc/sphinx/source/get-started/nnpdfmodules.rst b/doc/sphinx/source/get-started/nnpdfmodules.rst
@@ -14,7 +14,7 @@ for an NNPDF fit is displayed in the figure below.
 The :ref:`n3fit <n3fitindex>` fitting code
 --------------------------------------------------------------------------------
 This module implements the core fitting methodology as implemented through
-the ``TensorFlow`` framework. The ``n3fit`` library allows
+the ``Keras`` framework. The ``n3fit`` library allows
 for a flexible specification of the neural network model adopted to
 parametrise the PDFs, whose settings can be selected automatically via
 the built-in :ref:`hyperoptimization algorithm <hyperoptimization>`. These

diff --git a/doc/sphinx/source/n3fit/index.rst b/doc/sphinx/source/n3fit/index.rst
@@ -6,8 +6,7 @@ Fitting code: ``n3fit``
 -  ``n3fit`` is the next generation fitting code for NNPDF developed by the
    N3PDF team :cite:p:`Carrazza:2019mzf`
 -  ``n3fit`` is responsible for fitting PDFs from NNPDF4.0 onwards.
--  The code is implemented in python using `Tensorflow <https://www.tensorflow.org>`_
-   and `Keras <https://keras.io/>`_.
+-  The code is implemented in python using `Keras <https://keras.io/>`_ and can run with `Tensorflow <https://www.tensorflow.org>`_ (default) or `pytorch <https://pytorch.org>`_ (with the environment variable ``KERAS_BACKEND=torch``).
 -  The sections below are an overview of the ``n3fit`` design.
 
 

diff --git a/doc/sphinx/source/n3fit/methodology.rst b/doc/sphinx/source/n3fit/methodology.rst
@@ -8,8 +8,8 @@ different in comparison to the latest NNPDF (i.e. `NNPDF3.1 <https://arxiv.org/a
 methodology.
 
 .. warning::
-   The default implementation of the concepts presented here are implemented with Keras and
-   Tensorflow. The ``n3fit`` code inherits its features, so in this document we avoid the discussion of
+   The default implementation of the concepts presented here are implemented with Keras.
+   The ``n3fit`` code inherits its features, so in this document we avoid the discussion of
    specific details which can be found in the `Keras documentation <https://keras.io/>`_.
 
 .. note::
@@ -90,7 +90,7 @@ random numbers used in training-validation, ``nnseed`` for the neural network in
 Neural network architecture
 ---------------------------
 
-The main advantage of using a modern deep learning backend such as Keras/Tensorflow consists in the
+The main advantage of using a modern deep learning backend such as Keras consists in the
 possibility to change the neural network architecture quickly as the developer is not forced to fine
 tune the code in order to achieve efficient memory management and PDF convolution performance.
 
@@ -132,41 +132,36 @@ See the `Keras documentation <https://www.tensorflow.org/api_docs/python/tf/kera
 
 .. code-block:: python
 
-   from tensorflow.keras.utils import plot_model
-   from n3fit.model_gen import pdfNN_layer_generator
-   from validphys.api import API
-
-   fit_info = API.fit(fit="NNPDF40_nnlo_as_01180_1000").as_input()
-   basis_info = fit_info["fitting"]["basis"]
-
-   pdf_models = pdfNN_layer_generator(
-       nodes=[25, 20, 8],
-       activations=["tanh", "tanh", "linear"],
-       initializer_name="glorot_normal",
-       layer_type="dense",
-       flav_info=basis_info,
-       fitbasis="EVOL",
-       out=14,
-       seed=42,
-       dropout=0.0,
-       regularizer=None,
-       regularizer_args=None,
-       impose_sumrule="All",
-       scaler=None,
-       parallel_models=1,
-   )
-
-   pdf_model = pdf_models[0]
-   nn_model = pdf_model.get_layer("NN_0")
-   msr_model = pdf_model.get_layer("impose_msr")
-   models_to_plot = {
-           'plot_pdf': pdf_model,
-           'plot_nn': nn_model,
-           'plot_msr': msr_model
-           }
-
-   for name, model in models_to_plot.items():
-       plot_model(model, to_file=f"./{name}.png", show_shapes=True)
+    from keras.utils import plot_model
+    from n3fit.model_gen import pdfNN_layer_generator
+    from validphys.api import API
+
+    fit_info = API.fit(fit="NNPDF40_nnlo_as_01180_1000").as_input()
+    basis_info = fit_info["fitting"]["basis"]
+
+    pdf_model = pdfNN_layer_generator(
+        nodes=[25, 20, 8],
+        activations=["tanh", "tanh", "linear"],
+        initializer_name="glorot_normal",
+        layer_type="dense",
+        flav_info=basis_info,
+        fitbasis="EVOL",
+        out=14,
+        seed=42,
+        dropout=0.0,
+        regularizer=None,
+        regularizer_args=None,
+        impose_sumrule="All",
+        scaler=None,
+    )
+
+    nn_model = pdf_model.get_layer("pdf_input")
+    msr_model = pdf_model.get_layer("impose_msr")
+    models_to_plot = {
+            'plot_pdf': pdf_model,
+            'plot_nn': nn_model,
+            'plot_msr': msr_model
+            }
 
 
 This will produce for instance the plot of the PDF model below, and can also be used to plot the

diff --git a/doc/sphinx/source/n3fit/runcard_detailed.rst b/doc/sphinx/source/n3fit/runcard_detailed.rst
@@ -42,7 +42,7 @@ The fraction of events that are considered for the training and validation sets
 
     dataset_inputs:
     - { dataset: SLAC_NC_NOTFIXED_P_EM-F2, frac: 0.75, variant: legacy_dw}
-  
+
 It is possible to run a fit with no validation set by setting the fraction to ``1.0``, in this case the training set will be used as validation set.
 
 The random seed for the training/validation split is defined by the variable ``trvlseed``.
@@ -280,7 +280,7 @@ of better than 35%) or higher.
 Inspecting and profiling the code
 ---------------------------------
 
-It is possible to inspect the ``n3fit`` code using `TensorBoard <https://www.tensorflow.org/tensorboard/>`_.
+It is possible to inspect the ``n3fit`` code using `TensorBoard <https://www.tensorflow.org/tensorboard/>`_ when running with the tensorflow backend.
 In order to enable the TensorBoard callback in ``n3fit`` it is enough with adding the following options in the runcard:
 
 
@@ -333,7 +333,7 @@ top-level option:
   parallel_models: true
 
 Note that currently, in order to run with parallel models, one has to set ``savepseudodata: false``
-in the ``fitting`` section of the runcard. Once this is done, the user can run ``n3fit`` with a 
+in the ``fitting`` section of the runcard. Once this is done, the user can run ``n3fit`` with a
 replica range to be parallelized (in this case from replica 1 to replica 4).
 
 .. code-block:: bash
@@ -346,8 +346,8 @@ should run by setting the environment variable ``CUDA_VISIBLE_DEVICES``
 to the right index (usually ``0, 1, 2``) or leaving it explicitly empty
 to avoid running on GPU: ``export CUDA_VISIBLE_DEVICES=""``
 
-Note that in order to run the replicas in parallel using the GPUs of an Apple Silicon computer (like M1 Mac), it is necessary to also install 
-the following packages:
+Note that in order to run the replicas in parallel using the GPUs of an Apple Silicon computer (like M1 Mac), it is necessary to also install
+extra packages. At the timing of writing this worked with ``tensorflow`` 2.13.
 
 .. code-block:: bash
 

diff --git a/doc/sphinx/source/tutorials/run-fit.rst b/doc/sphinx/source/tutorials/run-fit.rst
@@ -51,7 +51,7 @@ example of the ``parameter`` dictionary that defines the Machine Learning framew
     dropout: 0.0
   ...
 
-The runcard system is designed such that the user can utilize the program 
+The runcard system is designed such that the user can utilize the program
 without having to tinker with the codebase.
 One can simply modify the options in ``parameters`` to specify the
 desired architecture of the Neural Network as well as the settings for the optimization algorithm.
@@ -164,7 +164,7 @@ folder, which contains a number of files:
 - ``runcard.exportgrid``: a file containing the PDF grid.
 - ``runcard.json``: Includes information about the fit (metadata, parameters, times) in json format.
 
-.. note:: 
+.. note::
 
   The reported χ² refers always to the actual χ², i.e., without positivity loss or other penalty terms.
 
@@ -184,25 +184,26 @@ After obtaining the fit you can proceed with the fit upload and analisis by:
 
 Performance of the fit
 ----------------------
-The ``n3fit`` framework is currently based on `Tensorflow <https://www.tensorflow.org/>`_ and as such, to
-first approximation, anything that makes Tensorflow faster will also make ``n3fit`` faster.
-
-.. note:: 
-
-  Tensorflow only supports the installation via pip. Note, however, that the TensorFlow 
-  pip package has been known to break third party packages. Install it at your own risk. 
-  Only the conda tensorflow-eigen package is tested by our CI systems.
-
-When you install the nnpdf conda package, you get the 
-`tensorflow-eigen <https://anaconda.org/anaconda/tensorflow-eigen>`_ package, 
-which is not the default. This is due to a memory explosion found in some of 
+The ``n3fit`` framework is currently based on `Keras <https://keras.io/>`_
+and it is tested to run with the `Tensorflow <https://www.tensorflow.org/>`_
+and `pytorch <https://pytorch.org>`_ backends.
+This also means that anything that make any of these packages faster will also
+make ``n3fit`` faster.
+Note that at the time of writing, ``TensorFlow`` is approximately 4 times faster than ``pytorch``.
+
+The default backend for ``keras`` is ``tensorflow``.
+In order to change the backend, the environment variable ``KERAS_BACKENDD`` need to be set (e.g., ``KERAS_BACKEND=torch``).
+
+The best results are obtained with ``tensorflow[and-cuda]`` installed from pip.
+When you install the nnpdf conda package, you get the
+`tensorflow-eigen <https://anaconda.org/anaconda/tensorflow-eigen>`_ package,
+which is not the default. This is due to a memory explosion found in some of
 the conda mkl builds.
 
-If you want to disable MKL without installing ``tensorflow-eigen`` you can always 
+If you want to disable MKL without installing ``tensorflow-eigen`` you can always
 set the environment variable ``TF_DISABLE_MKL=1`` before running ``n3fit``.
 When running ``n3fit`` all versions of the package show similar performance.
 
-
 When using the MKL version of tensorflow you gain more control of the way Tensorflow will use
 the multithreading capabilities of the machine by using the following environment variables:
 
@@ -214,7 +215,7 @@ the multithreading capabilities of the machine by using the following environmen
 These are the best values found for ``n3fit`` when using the mkl version of Tensorflow from conda
 and were found for TF 2.1 as the default values were suboptimal.
 For a more detailed explanation on the effects of ``KMP_AFFINITY`` on the performance of
-the code please see 
+the code please see
 `here <https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html>`_.
 
 By default, ``n3fit`` will try to use as many cores as possible, but this behaviour can be overriden

diff --git a/n3fit/src/n3fit/backends/keras_backend/MetaModel.py b/n3fit/src/n3fit/backends/keras_backend/MetaModel.py
@@ -8,27 +8,12 @@
 from pathlib import Path
 import re
 
+from keras import Variable
+from keras import optimizers as Kopt
+from keras.models import Model
 import numpy as np
-import tensorflow as tf
-from tensorflow.keras import optimizers as Kopt
-from tensorflow.keras.models import Model
-from tensorflow.python.keras.utils import tf_utils  # pylint: disable=no-name-in-module
-
-import n3fit.backends.keras_backend.operations as op
-
-# We need a function to transform tensors to numpy/python primitives
-# which is not part of the official TF interface and can change with the version
-if hasattr(tf_utils, "to_numpy_or_python_type"):
-    _to_numpy_or_python_type = tf_utils.to_numpy_or_python_type
-elif hasattr(tf_utils, "sync_to_numpy_or_python_type"):  # from TF 2.5
-    _to_numpy_or_python_type = tf_utils.sync_to_numpy_or_python_type
-else:  # in case of disaster
-    _to_numpy_or_python_type = lambda ret: {k: i.numpy() for k, i in ret.items()}
-
-# Starting with TF 2.16, a memory leak in TF https://github.com/tensorflow/tensorflow/issues/64170
-# makes jit compilation unusable in GPU.
-# Before TF 2.16 it was set to `False` by default. From 2.16 onwards, it is set to `True`
-JIT_COMPILE = False
+
+from . import operations as ops
 
 # Define in this dictionary new optimizers as well as the arguments they accept
 # (with default values if needed be)
@@ -55,7 +40,7 @@
 def _default_loss(y_true, y_pred):  # pylint: disable=unused-argument
     """Default loss to be used when the model is compiled with loss = Null
     (for instance if the prediction of the model is already the loss"""
-    return op.sum(y_pred)
+    return ops.sum(y_pred)
 
 
 class MetaModel(Model):
@@ -108,7 +93,7 @@ def __init__(self, input_tensors, output_tensors, scaler=None, input_values=None
             if k in input_values:
                 x_in[k] = input_values[k]
             elif hasattr(v, "tensor_content"):
-                x_in[k] = op.numpy_to_tensor(v.tensor_content)
+                x_in[k] = ops.numpy_to_tensor(v.tensor_content)
             else:
                 self.required_slots.add(k)
         super().__init__(input_tensors, output_tensors, **kwargs)
@@ -121,7 +106,6 @@ def __init__(self, input_tensors, output_tensors, scaler=None, input_values=None
         self.compute_losses_function = None
         self._scaler = scaler
 
-    @tf.autograph.experimental.do_not_convert
     def _parse_input(self, extra_input=None):
         """Returns the input data the model was compiled with.
         Introduces the extra_input in the places asigned to the placeholders.
@@ -173,8 +157,8 @@ def perform_fit(self, x=None, y=None, epochs=1, **kwargs):
         steps_per_epoch = self._determine_steps_per_epoch(epochs)
 
         for k, v in x_params.items():
-            x_params[k] = tf.repeat(v, steps_per_epoch, axis=0)
-        y = [tf.repeat(yi, steps_per_epoch, axis=0) for yi in y]
+            x_params[k] = ops.repeat(v, steps_per_epoch, axis=0)
+        y = [ops.repeat(yi, steps_per_epoch, axis=0) for yi in y]
 
         history = super().fit(
             x=x_params, y=y, epochs=epochs // steps_per_epoch, batch_size=1, **kwargs
@@ -228,13 +212,13 @@ def compute_losses(self):
                 inputs[k] = v[:1]
 
             # Compile a evaluation function
-            @tf.function
+            @ops.decorator_compiler
             def losses_fun():
                 predictions = self(inputs)
                 # If we only have one dataset the output changes
                 if len(out_names) == 2:
                     predictions = [predictions]
-                total_loss = tf.reduce_sum(predictions, axis=0)
+                total_loss = ops.sum(predictions, axis=0)
                 ret = [total_loss] + predictions
                 return dict(zip(out_names, ret))
 
@@ -244,7 +228,7 @@ def losses_fun():
 
         # The output of this function is to be used by python (and numpy)
         # so we need to convert the tensors
-        return _to_numpy_or_python_type(ret)
+        return ops.dict_to_numpy_or_python(ret)
 
     def compile(
         self,
@@ -305,13 +289,16 @@ def compile(
 
         # If given target output is None, target_output is unnecesary, save just a zero per output
         if target_output is None:
-            self.target_tensors = [op.numpy_to_tensor(np.zeros((1, 1))) for i in self.output_shape]
+            self.target_tensors = [ops.numpy_to_tensor(np.zeros((1, 1))) for _ in self.output_shape]
         else:
             if not isinstance(target_output, list):
                 target_output = [target_output]
             self.target_tensors = target_output
 
-        super().compile(optimizer=opt, loss=loss, jit_compile=JIT_COMPILE)
+        # For debug purposes it may be interesting to set in the compile call
+        # jit_compile = False
+        # run_eager = True
+        super().compile(optimizer=opt, loss=loss)
 
     def set_masks_to(self, names, val=0.0):
         """Set all mask value to the selected value
@@ -509,9 +496,9 @@ def get_layer_replica_weights(layer, i_replica: int):
     """
     if is_stacked_single_replicas(layer):
         weights_ref = layer.get_layer(f"{NN_PREFIX}_{i_replica}").weights
-        weights = [tf.Variable(w, name=w.name) for w in weights_ref]
+        weights = [Variable(w, name=w.name) for w in weights_ref]
     else:
-        weights = [tf.Variable(w[i_replica : i_replica + 1], name=w.name) for w in layer.weights]
+        weights = [Variable(w[i_replica : i_replica + 1], name=w.name) for w in layer.weights]
 
     return weights