diff --git a/.github/workflows/config/spelling_allowlist.txt b/.github/workflows/config/spelling_allowlist.txt
index 20bb2315a9..b398ad6d15 100644
--- a/.github/workflows/config/spelling_allowlist.txt
+++ b/.github/workflows/config/spelling_allowlist.txt
@@ -84,11 +84,13 @@ POSIX
 PSIRT
 Pauli
 Paulis
+Photonic
 Photonics
 PyPI
 Pygments
 QAOA
 QCaaS
+QEC
 QIR
 QIS
 QPP
@@ -109,6 +111,7 @@ SLED
 SLES
 SLURM
 SVD
+Sqale
 Stim
 Superpositions
 Superstaq
@@ -260,6 +263,7 @@ parallelizing
 parameterization
 performant
 photonic
+photonics
 precompute
 precomputed
 prepend
diff --git a/docs/sphinx/applications/python/vqe_advanced.ipynb b/docs/sphinx/applications/python/vqe_advanced.ipynb
index 3135b93829..aa2b19446a 100644
--- a/docs/sphinx/applications/python/vqe_advanced.ipynb
+++ b/docs/sphinx/applications/python/vqe_advanced.ipynb
@@ -480,7 +480,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, run the code again (the three previous cells) and specify `num_qpus` to be more than one if you have access to multiple GPUs and notice resulting speedup.  Thanks to CUDA-Q, this code could be used without modification in a setting where multiple physical QPUs were availible."
+    "Now, run the code again (the three previous cells) and specify `num_qpus` to be more than one if you have access to multiple GPUs and notice resulting speedup.  Thanks to CUDA-Q, this code could be used without modification in a setting where multiple physical QPUs were available."
    ]
   },
   {
diff --git a/docs/sphinx/examples/python/executing_photonic_kernels.ipynb b/docs/sphinx/examples/python/executing_photonic_kernels.ipynb
deleted file mode 100644
index 62e8901c98..0000000000
--- a/docs/sphinx/examples/python/executing_photonic_kernels.ipynb
+++ /dev/null
@@ -1,171 +0,0 @@
-{
- "cells": [
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Executing Quantum Photonic Circuits \n",
-    "\n",
-    "In CUDA-Q, there are 2 ways in which one can execute quantum photonic kernels: \n",
-    "\n",
-    "1. `sample`: yields measurement counts \n",
-    "3. `get_state`: yields the quantum statevector of the computation \n",
-    "\n",
-    "## Sample\n",
-    "\n",
-    "Quantum states collapse upon measurement and hence need to be sampled many times to gather statistics. The CUDA-Q `sample` call enables this: \n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import cudaq\n",
-    "import numpy as np\n",
-    "\n",
-    "qumode_count = 2\n",
-    "\n",
-    "# Define the simulation target.\n",
-    "cudaq.set_target(\"orca-photonics\")\n",
-    "\n",
-    "# Define a quantum kernel function.\n",
-    "\n",
-    "\n",
-    "@cudaq.kernel\n",
-    "def kernel(qumode_count: int):\n",
-    "    level = qumode_count + 1\n",
-    "    qumodes = [qudit(level) for _ in range(qumode_count)]\n",
-    "\n",
-    "    # Apply the create gate to the qumodes.\n",
-    "    for i in range(qumode_count):\n",
-    "        create(qumodes[i])  # |00⟩ -> |11⟩\n",
-    "\n",
-    "    # Apply the beam_splitter gate to the qumodes.\n",
-    "    beam_splitter(qumodes[0], qumodes[1], np.pi / 6)\n",
-    "\n",
-    "    # measure all qumodes\n",
-    "    mz(qumodes)\n",
-    "\n",
-    "\n",
-    "result = cudaq.sample(kernel, qumode_count, shots_count=1000)\n",
-    "\n",
-    "print(result)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "\n",
-    "## Get state\n",
-    "\n",
-    "The `get_state` function gives us access to the quantum statevector of the computation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import cudaq\n",
-    "import numpy as np\n",
-    "\n",
-    "qumode_count = 2\n",
-    "\n",
-    "# Define the simulation target.\n",
-    "cudaq.set_target(\"orca-photonics\")\n",
-    "\n",
-    "# Define a quantum kernel function.\n",
-    "\n",
-    "\n",
-    "@cudaq.kernel\n",
-    "def kernel(qumode_count: int):\n",
-    "    level = qumode_count + 1\n",
-    "    qumodes = [qudit(level) for _ in range(qumode_count)]\n",
-    "\n",
-    "    # Apply the create gate to the qumodes.\n",
-    "    for i in range(qumode_count):\n",
-    "        create(qumodes[i])  # |00⟩ -> |11⟩\n",
-    "\n",
-    "    # Apply the beam_splitter gate to the qumodes.\n",
-    "    beam_splitter(qumodes[0], qumodes[1], np.pi / 6)\n",
-    "\n",
-    "    # measure some of all qumodes if need to be measured\n",
-    "    # mz(qumodes)\n",
-    "\n",
-    "\n",
-    "# Compute the statevector of the kernel\n",
-    "result = cudaq.get_state(kernel, qumode_count)\n",
-    "\n",
-    "print(np.array(result))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The statevector generated by the `get_state` command follows little-endian convention for associating numbers with their digit string representations, which places the least significant digit on the right. That is, for the example of a 2-qumode system of level 3 (in which possible states are 0, 1, and 2), we have the following translation between integers and digit string:\n",
-    "$$\\begin{matrix} \n",
-    "\\text{Integer} & \\text{digit string representation}\\\\\n",
-    "& \\text{least significant bit on right}\\\\\n",
-    "0 = \\textcolor{blue}{0}*3^1 + \\textcolor{red}{0}*3^0 & \\textcolor{blue}{0}\\textcolor{red}{0} \\\\\n",
-    "1 = \\textcolor{blue}{0}*3^1 + \\textcolor{red}{1}*3^0 & \\textcolor{blue}{0}\\textcolor{red}{1}\\\\\n",
-    "2 = \\textcolor{blue}{0}*3^1 + \\textcolor{red}{2}*3^0 & \\textcolor{blue}{0}\\textcolor{red}{2}\\\\\n",
-    "3 = \\textcolor{blue}{1}*3^1 + \\textcolor{red}{0}*3^0 & \\textcolor{blue}{1}\\textcolor{red}{0} \\\\\n",
-    "4 = \\textcolor{blue}{1}*3^1 + \\textcolor{red}{1}*3^0 & \\textcolor{blue}{1}\\textcolor{red}{1} \\\\\n",
-    "5 = \\textcolor{blue}{1}*3^1 + \\textcolor{red}{2}*3^0 & \\textcolor{blue}{1}\\textcolor{red}{2} \\\\\n",
-    "6 = \\textcolor{blue}{2}*3^1 + \\textcolor{red}{0}*3^0 & \\textcolor{blue}{2}\\textcolor{red}{0} \\\\\n",
-    "7 = \\textcolor{blue}{2}*3^1 + \\textcolor{red}{1}*3^0 & \\textcolor{blue}{2}\\textcolor{red}{1} \\\\\n",
-    "8 = \\textcolor{blue}{2}*3^1 + \\textcolor{red}{2}*3^0 & \\textcolor{blue}{2}\\textcolor{red}{2} \n",
-    "\\end{matrix}\n",
-    "$$\n"
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "\n",
-    "## Parallelization Techniques\n",
-    "\n",
-    "The most intensive task in the computation is the execution of the quantum photonic kernel hence each execution function: `sample`, and `get_state` can be parallelized given access to multiple quantum processing units (multi-QPU). We emulate each QPU with a CPU."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(cudaq.__version__)"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.12"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/sphinx/examples/python/measuring_kernels.ipynb b/docs/sphinx/examples/python/measuring_kernels.ipynb
index 40fa54bff8..d5560dfa0d 100644
--- a/docs/sphinx/examples/python/measuring_kernels.ipynb
+++ b/docs/sphinx/examples/python/measuring_kernels.ipynb
@@ -69,32 +69,59 @@
    "id": "fb5dd767-5db7-4847-b04e-ae5695066800",
    "metadata": {},
    "source": [
-    "### Midcircuit Measurement and Conditional Logic\n",
+    "### Mid-circuit Measurement and Conditional Logic\n",
     "\n",
-    "In certain cases, it it is helpful for some operations in a quantum kernel to depend on measurement results following previous operations. This is accomplished in the following example by performing a Hadamard on qubit 0, then measuring qubit 0 and savig the result as `b0`. Then, an if statement performs a Hadamard on qubit 1 only if `b0` is 1. Measuring this qubit 1 verifies this process as a 1 is the result 25% of the time."
+    "In certain cases, it it is helpful for some operations in a quantum kernel to depend on measurement results following previous operations. This is accomplished in the following example by performing a Hadamard on qubit 0, then measuring qubit 0 and saving the result as `b0`. Then, qubit 0 can be reset and used later in the computation.  In this case it is flipped ot a 1. Finally, an if statement performs a Hadamard on qubit 1 if `b0` is 1. \n",
+    "\n",
+    "The results show qubit 0 is one, indicating the reset worked, and qubit 1 has a 75/25 distribution, demonstrating the mid-circuit measurement worked as expexted."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 6,
    "id": "44001a51-3733-472c-8bc1-ee694e957708",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{ \n",
+      "  __global__ : { 10:728 11:272 }\n",
+      "   b0 : { 0:505 1:495 }\n",
+      "}\n",
+      "\n"
+     ]
+    }
+   ],
    "source": [
     "@cudaq.kernel\n",
     "def kernel():\n",
     "    q = cudaq.qvector(2)\n",
+    "    \n",
     "    h(q[0])\n",
     "    b0 = mz(q[0])\n",
+    "    reset(q[0])\n",
+    "    x(q[0])\n",
+    "    \n",
     "    if b0:\n",
-    "        h(q[1])\n",
-    "        mz(q[1])"
+    "        h(q[1])    \n",
+    "\n",
+    "print(cudaq.sample(kernel))"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d525be71-a745-43a5-a7ca-a2720c536f8c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
diff --git a/docs/sphinx/index.rst b/docs/sphinx/index.rst
index 2b6837424c..71de881a2d 100644
--- a/docs/sphinx/index.rst
+++ b/docs/sphinx/index.rst
@@ -31,4 +31,4 @@ You are browsing the documentation for |version| version of CUDA-Q. You can find
       Other Versions <versions.rst>
 
 .. |---|   unicode:: U+2014 .. EM DASH
-   :trim:
\ No newline at end of file
+   :trim:
diff --git a/docs/sphinx/releases.rst b/docs/sphinx/releases.rst
index 8e455ff9dd..e900454256 100644
--- a/docs/sphinx/releases.rst
+++ b/docs/sphinx/releases.rst
@@ -87,7 +87,7 @@ The full change log can be found `here <https://github.com/NVIDIA/cuda-quantum/r
 
 **0.7.0**
 
-The 0.7.0 release adds support for using :doc:`NVIDIA Quantum Cloud <using/backends/nvqc>`,
+The 0.7.0 release adds support for using :doc:`NVIDIA Quantum Cloud <using/backends/cloud/nvqc>`,
 giving you access to our most powerful GPU-accelerated simulators even if you don't have an NVIDIA GPU.
 With 0.7.0, we have furthermore greatly increased expressiveness of the Python and C++ language frontends. 
 Check out our `documentation <https://nvidia.github.io/cuda-quantum/0.7.0/using/quick_start.html>`__ 
diff --git a/docs/sphinx/snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py b/docs/sphinx/snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
index de5db921dd..499775c32c 100644
--- a/docs/sphinx/snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
+++ b/docs/sphinx/snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
@@ -16,59 +16,67 @@
     exit(0)
 
 np.random.seed(1)
-cudaq.set_target("nvidia", option="mqpu")
+cudaq.set_target("nvidia")
 
 qubit_count = 5
 sample_count = 10000
 h = spin.z(0)
 parameter_count = qubit_count
 
-# Below we run a circuit for 10000 different input parameters.
+# prepare 10000 different input parameter sets.
 parameters = np.random.default_rng(13).uniform(low=0,
                                                high=1,
                                                size=(sample_count,
                                                      parameter_count))
 
-kernel, params = cudaq.make_kernel(list)
 
-qubits = kernel.qalloc(qubit_count)
-qubits_list = list(range(qubit_count))
+@cudaq.kernel
+def kernel(params: list[float]):
+
+    qubits = cudaq.qvector(5)
+
+    for i in range(5):
+        rx(params[i], qubits[i])
+
 
-for i in range(qubit_count):
-    kernel.rx(params[i], qubits[i])
 # [End prepare]
 
 # [Begin single]
-import timeit
+import time
+
+start_time = time.time()
+cudaq.observe(kernel, h, parameters)
+end_time = time.time()
+print(end_time - start_time)
 
-timeit.timeit(lambda: cudaq.observe(kernel, h, parameters),
-              number=1)  # Single GPU result.
 # [End single]
 
 # [Begin split]
-print('We have', parameters.shape[0],
-      'parameters which we would like to execute')
+print('There are', parameters.shape[0], 'parameter sets to execute')
 
 xi = np.split(
     parameters,
-    4)  # We split our parameters into 4 arrays since we have 4 GPUs available.
+    4)  # Split the parameters into 4 arrays since 4 GPUs are available.
 
-print('We split this into', len(xi), 'batches of', xi[0].shape[0], ',',
+print('Split parameters into', len(xi), 'batches of', xi[0].shape[0], ',',
       xi[1].shape[0], ',', xi[2].shape[0], ',', xi[3].shape[0])
 # [End split]
 
 # [Begin multiple]
 # Timing the execution on a single GPU vs 4 GPUs,
-# one will see a 4x performance improvement if 4 GPUs are available.
+# one will see a nearly 4x performance improvement if 4 GPUs are available.
 
+cudaq.set_target("nvidia", option="mqpu")
 asyncresults = []
 num_gpus = cudaq.num_available_gpus()
 
+start_time = time.time()
 for i in range(len(xi)):
     for j in range(xi[i].shape[0]):
         qpu_id = i * num_gpus // len(xi)
         asyncresults.append(
             cudaq.observe_async(kernel, h, xi[i][j, :], qpu_id=qpu_id))
-
 result = [res.get() for res in asyncresults]
+end_time = time.time()
+print(end_time - start_time)
 # [End multiple]
diff --git a/docs/sphinx/snippets/python/using/examples/multi_gpu_workflows/hamiltonian_batching.py b/docs/sphinx/snippets/python/using/examples/multi_gpu_workflows/hamiltonian_batching.py
index 18878b318a..57b142b300 100644
--- a/docs/sphinx/snippets/python/using/examples/multi_gpu_workflows/hamiltonian_batching.py
+++ b/docs/sphinx/snippets/python/using/examples/multi_gpu_workflows/hamiltonian_batching.py
@@ -20,29 +20,40 @@
 qubit_count = 15
 term_count = 100000
 
-kernel = cudaq.make_kernel()
-qubits = kernel.qalloc(qubit_count)
-kernel.h(qubits[0])
-for i in range(1, qubit_count):
-    kernel.cx(qubits[0], qubits[i])
 
-# We create a random Hamiltonian
+@cudaq.kernel
+def kernel(n_qubits: int):
+
+    qubits = cudaq.qvector(n_qubits)
+
+    h(qubits[0])
+    for i in range(1, n_qubits):
+        x.ctrl(qubits[0], qubits[i])
+
+
+# Create a random Hamiltonian
 hamiltonian = cudaq.SpinOperator.random(qubit_count, term_count)
 
-# The observe calls allows us to calculate the expectation value of the Hamiltonian with respect to a specified kernel.
+# The observe calls allows calculation of the the expectation value of the Hamiltonian with respect to a specified kernel.
 
 # Single node, single GPU.
-result = cudaq.observe(kernel, hamiltonian)
+result = cudaq.observe(kernel, hamiltonian, qubit_count)
 result.expectation()
 
-# If we have multiple GPUs/ QPUs available, we can parallelize the workflow with the addition of an argument in the observe call.
+# If multiple GPUs/ QPUs are available, the computation can parallelize with the addition of an argument in the observe call.
 
 # Single node, multi-GPU.
-result = cudaq.observe(kernel, hamiltonian, execution=cudaq.parallel.thread)
+result = cudaq.observe(kernel,
+                       hamiltonian,
+                       qubit_count,
+                       execution=cudaq.parallel.thread)
 result.expectation()
 
 # Multi-node, multi-GPU.
-result = cudaq.observe(kernel, hamiltonian, execution=cudaq.parallel.mpi)
+result = cudaq.observe(kernel,
+                       hamiltonian,
+                       qubit_count,
+                       execution=cudaq.parallel.mpi)
 result.expectation()
 
 cudaq.mpi.finalize()
diff --git a/docs/sphinx/using/backends/backends.png b/docs/sphinx/using/backends/backends.png
new file mode 100644
index 0000000000..dcddd84402
Binary files /dev/null and b/docs/sphinx/using/backends/backends.png differ
diff --git a/docs/sphinx/using/backends/backends.rst b/docs/sphinx/using/backends/backends.rst
index 9ee2bbbbbf..322d1c8f3b 100644
--- a/docs/sphinx/using/backends/backends.rst
+++ b/docs/sphinx/using/backends/backends.rst
@@ -1,41 +1,32 @@
+*************************
 CUDA-Q Backends
-**********************
+*************************
+.. _backends:
+
+
+The CUDA-Q platform has is a powerful tool with many different backends for running hybrid quantum applications and other simulations.  This page will help you understand what backends are available and what the best choices are for your purpose. 
+
+The figure below groups the backends into four categories, and described the general purpose for each.  See the following sections for a breakdown of the backends included in each section.
+
+.. image:: backends.png
+   :width: 1000
+
+Click on the links below for each category to learn more about the backends it contains.  the list below also covers all of the backends available in CUDA-Q.
+
+.. toctree::
+   :maxdepth: 3
+      
+        Circuit Simulation <simulators.rst>
+        Quantum Hardware (QPUs) <hardware.rst>
 
 .. toctree::
-   :caption: Backend Targets
    :maxdepth: 1
+      
+        Dynamics Simulation <dynamics.rst>
+
+.. toctree::
+   :maxdepth: 2
+
+        Cloud  <cloud.rst>
+
 
-      Simulation <simulators.rst>
-      Quantum Hardware <hardware.rst>
-      NVIDIA Quantum Cloud <nvqc.rst>
-      Multi-Processor Platforms <platform.rst>
-
-**The following is a comprehensive list of the available targets in CUDA-Q:**
-
-* :ref:`anyon <anyon-backend>`
-* :ref:`braket <braket-backend>`
-* :ref:`density-matrix-cpu <default-simulator>`
-* :ref:`fermioniq <fermioniq-backend>`
-* :ref:`infleqtion <infleqtion-backend>`
-* :ref:`ionq <ionq-backend>`
-* :ref:`iqm <iqm-backend>`
-* :ref:`nvidia <nvidia-backend>`
-* :ref:`nvidia-fp64 <nvidia-fp64-backend>`
-* :ref:`nvidia-mgpu <nvidia-mgpu-backend>`
-* :ref:`nvidia-mqpu <mqpu-platform>`
-* :ref:`nvidia-mqpu-fp64 <mqpu-platform>`
-* :doc:`nvqc <nvqc>`
-* :ref:`oqc <oqc-backend>`
-* :ref:`orca <orca-backend>`
-* :ref:`qpp-cpu <qpp-cpu-backend>`
-* :ref:`quantinuum <quantinuum-backend>`
-* :ref:`quera <quera-backend>`
-* :ref:`remote-mqpu <mqpu-platform>`
-* :ref:`stim <stim-backend>`
-* :ref:`tensornet <tensor-backends>`
-* :ref:`tensornet-mps <tensor-backends>`
-
-.. deprecated:: 0.8
-   The `nvidia-fp64`, `nvidia-mgpu`, `nvidia-mqpu`, and `nvidia-mqpu-fp64` targets can be
-   enabled as extensions of the unified `nvidia` target (see `nvidia` :ref:`target documentation <nvidia-backend>`).
-   These target names might be removed in a future release.
\ No newline at end of file
diff --git a/docs/sphinx/using/backends/circuitsimulators.png b/docs/sphinx/using/backends/circuitsimulators.png
new file mode 100644
index 0000000000..51c9387ce4
Binary files /dev/null and b/docs/sphinx/using/backends/circuitsimulators.png differ
diff --git a/docs/sphinx/using/backends/cloud.rst b/docs/sphinx/using/backends/cloud.rst
new file mode 100644
index 0000000000..20ef1d7411
--- /dev/null
+++ b/docs/sphinx/using/backends/cloud.rst
@@ -0,0 +1,11 @@
+CUDA-Q Cloud Backends
+***************************************
+CUDA-Q provides a number of options to access hardware resources (GPUs and QPUs) through the cloud. Such options provide users with more flexible access to simulation and hardware resources. See the links below for more information on running CUDA-Q with cloud resources.
+
+
+.. toctree::
+   :maxdepth: 1
+      
+        Amazon Braket (braket) <cloud/braket.rst>
+        NVIDIA Quantum Cloud (nvqc) <cloud/nvqc.rst>
+
diff --git a/docs/sphinx/using/backends/cloud/braket.rst b/docs/sphinx/using/backends/cloud/braket.rst
new file mode 100644
index 0000000000..cb39ae73e8
--- /dev/null
+++ b/docs/sphinx/using/backends/cloud/braket.rst
@@ -0,0 +1,105 @@
+Amazon Braket
+++++++++++++++
+
+.. _braket-backend:
+
+`Amazon Braket <https://aws.amazon.com/braket/>`__ is a fully managed AWS 
+service which provides Jupyter notebook environments, high-performance quantum 
+circuit simulators, and secure, on-demand access to various quantum computers.
+To get started users must enable Amazon Braket in their AWS account by following 
+`these instructions <https://docs.aws.amazon.com/braket/latest/developerguide/braket-enable-overview.html>`__.
+To learn more about Amazon Braket, you can view the `Amazon Braket Documentation <https://docs.aws.amazon.com/braket/>`__ 
+and `Amazon Braket Examples <https://github.com/amazon-braket/amazon-braket-examples>`__.
+A list of available devices and regions can be found `here <https://docs.aws.amazon.com/braket/latest/developerguide/braket-devices.html>`__. 
+
+Users can run CUDA-Q programs on Amazon Braket with `Hybrid Job <https://docs.aws.amazon.com/braket/latest/developerguide/braket-what-is-hybrid-job.html>`__.
+See `this guide <https://docs.aws.amazon.com/braket/latest/developerguide/braket-jobs-first.html>`__ to get started.
+
+Setting Credentials
+```````````````````
+
+After enabling Amazon Braket in AWS, set credentials using any of the documented `methods <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html>`__.
+One of the simplest ways is to use `AWS CLI <https://aws.amazon.com/cli/>`__.
+
+.. code:: bash
+
+    aws configure
+
+Alternatively, users can set the following environment variables.
+
+.. code:: bash
+
+  export AWS_DEFAULT_REGION="<region>"
+  export AWS_ACCESS_KEY_ID="<key_id>"
+  export AWS_SECRET_ACCESS_KEY="<access_key>"
+  export AWS_SESSION_TOKEN="<token>"
+
+Submission from C++
+`````````````````````````
+
+To target quantum kernel code for execution in Amazon Braket,
+pass the flag ``--target braket`` to the ``nvq++`` compiler.
+By default jobs are submitted to the state vector simulator, `SV1`.
+
+.. code:: bash
+
+    nvq++ --target braket src.cpp
+
+To execute your kernels on different device, pass the ``--braket-machine`` flag to the ``nvq++`` compiler
+to specify which machine to submit quantum kernels to:
+
+.. code:: bash
+
+    nvq++ --target braket --braket-machine "arn:aws:braket:eu-north-1::device/qpu/iqm/Garnet" src.cpp ...
+
+where ``arn:aws:braket:eu-north-1::device/qpu/iqm/Garnet`` refers to IQM Garnet QPU.
+
+To emulate the device locally, without submitting through the cloud,
+you can also pass the ``--emulate`` flag to ``nvq++``. 
+
+.. code:: bash
+
+    nvq++ --emulate --target braket src.cpp
+
+To see a complete example for using Amazon Braket backends, take a look at our :ref:`C++ examples <examples>`.
+
+Submission from Python
+`````````````````````````
+
+The target to which quantum kernels are submitted 
+can be controlled with the ``cudaq::set_target()`` function.
+
+.. code:: python
+
+    cudaq.set_target("braket")
+
+By default, jobs are submitted to the state vector simulator, `SV1`.
+
+To specify which Amazon Braket device to use, set the :code:`machine` parameter.
+
+.. code:: python
+
+    device_arn = "arn:aws:braket:eu-north-1::device/qpu/iqm/Garnet"
+    cudaq.set_target("braket", machine=device_arn)
+
+where ``arn:aws:braket:eu-north-1::device/qpu/iqm/Garnet`` refers to IQM Garnet QPU.
+
+To emulate the device locally, without submitting through the cloud,
+you can also set the ``emulate`` flag to ``True``.
+
+.. code:: python
+
+    cudaq.set_target("braket", emulate=True)
+
+The number of shots for a kernel execution can be set through the ``shots_count``
+argument to ``cudaq.sample``. By default, the ``shots_count`` is set to 1000.
+
+.. code:: python
+
+    cudaq.sample(kernel, shots_count=100)
+
+To see a complete example for using Amazon Braket backends, take a look at our :ref:`Python examples <examples>`.
+
+.. note:: 
+
+    The ``cudaq.observe`` API is not yet supported on the `braket` target.
diff --git a/docs/sphinx/using/backends/nvqc.rst b/docs/sphinx/using/backends/cloud/nvqc.rst
similarity index 93%
rename from docs/sphinx/using/backends/nvqc.rst
rename to docs/sphinx/using/backends/cloud/nvqc.rst
index 7586d08719..ba69faef43 100644
--- a/docs/sphinx/using/backends/nvqc.rst
+++ b/docs/sphinx/using/backends/cloud/nvqc.rst
@@ -1,5 +1,6 @@
 NVIDIA Quantum Cloud
----------------------
++++++++++++++++++++++
+
 NVIDIA Quantum Cloud (NVQC) offers universal access to the world’s most powerful computing platform, 
 for every quantum researcher to do their life’s work.
 To learn more about NVQC visit this `link <https://www.nvidia.com/en-us/solutions/quantum-computing/cloud>`__. 
@@ -8,7 +9,7 @@ Apply for early access `here <https://developer.nvidia.com/quantum-cloud-early-a
 Access to the Quantum Cloud early access program requires an NVIDIA Developer account.
 
 Quick Start
-+++++++++++
+^^^^^^^^^^^
 Once you have been approved for an early access to NVQC, you will be able to follow these instructions to use it.
 
 1. Follow the instructions in your NVQC Early Access welcome email to obtain an API Key for NVQC. 
@@ -30,7 +31,7 @@ By selecting the `nvqc` target, the quantum circuit simulation will run on NVQC
 
 .. tab:: Python
     
-    .. literalinclude:: ../../snippets/python/using/cudaq/nvqc/nvqc_intro.py
+    .. literalinclude:: ../../../snippets/python/using/cudaq/nvqc/nvqc_intro.py
         :language: python
         :start-after: [Begin Documentation]
 
@@ -49,7 +50,7 @@ By selecting the `nvqc` target, the quantum circuit simulation will run on NVQC
 
 .. tab:: C++
 
-    .. literalinclude:: ../../snippets/cpp/using/cudaq/nvqc/nvqc_intro.cpp
+    .. literalinclude:: ../../../snippets/cpp/using/cudaq/nvqc/nvqc_intro.cpp
         :language: cpp
         :start-after: [Begin Documentation]
 
@@ -76,9 +77,9 @@ By selecting the `nvqc` target, the quantum circuit simulation will run on NVQC
 
 
 Simulator Backend Selection
-++++++++++++++++++++++++++++
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-NVQC hosts all CUDA-Q simulator backends (see :doc:`simulators`). 
+NVQC hosts all CUDA-Q simulator backends (see :ref:`simulators <simulators>`). 
 You may use the NVQC `backend` (Python) or `--nvqc-backend` (C++) option to select the simulator to be used by the service.
 
 For example, to request the `tensornet` simulator backend, the user can do the following for C++ or Python.
@@ -101,9 +102,9 @@ For example, to request the `tensornet` simulator backend, the user can do the f
   By default, the single-GPU single-precision `custatevec-fp32` simulator backend will be selected if backend information is not specified.
 
 Multiple GPUs
-+++++++++++++
+^^^^^^^^^^^^^^
 
-Some CUDA-Q simulator backends are capable of multi-GPU distribution as detailed in :doc:`simulators`.
+Some CUDA-Q simulator backends are capable of multi-GPU distribution as detailed in :ref:`simulators <simulators>`.
 For example, the `nvidia-mgpu` backend can partition and distribute state vector simulation to multiple GPUs to simulate 
 a larger number of qubits, whose state vector size grows beyond the memory size of a single GPU.
 
@@ -190,7 +191,7 @@ To select a specific number of GPUs on the NVQC managed service, the following `
     
 
 Multiple QPUs Asynchronous Execution
-+++++++++++++++++++++++++++++++++++++
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 NVQC provides scalable QPU virtualization services, whereby clients
 can submit asynchronous jobs simultaneously to NVQC. These jobs are
@@ -202,13 +203,13 @@ calculating the expectation value along with parameter-shift gradients simultane
 
 .. tab:: Python
 
-    .. literalinclude:: ../../snippets/python/using/cudaq/nvqc/nvqc_mqpu.py
+    .. literalinclude:: ../../../snippets/python/using/cudaq/nvqc/nvqc_mqpu.py
         :language: python
         :start-after: [Begin Documentation]
 
 .. tab:: C++
 
-    .. literalinclude:: ../../snippets/cpp/using/cudaq/nvqc/nvqc_mqpu.cpp
+    .. literalinclude:: ../../../snippets/cpp/using/cudaq/nvqc/nvqc_mqpu.cpp
         :language: cpp
         :start-after: [Begin Documentation]
 
@@ -230,7 +231,7 @@ calculating the expectation value along with parameter-shift gradients simultane
     multi-QPU distribution may not deliver any substantial speedup.  
 
 FAQ
-++++
+^^^^^
 
 1. How do I get more information about my NVQC API submission?
 
diff --git a/docs/sphinx/using/backends/dynamics.rst b/docs/sphinx/using/backends/dynamics.rst
index 6e6e48e567..d8b6819268 100644
--- a/docs/sphinx/using/backends/dynamics.rst
+++ b/docs/sphinx/using/backends/dynamics.rst
@@ -1,5 +1,5 @@
-CUDA-Q Dynamics 
-*********************************
+Dynamics Simulation 
++++++++++++++++++++++
 
 CUDA-Q enables the design, simulation and execution of quantum dynamics via 
 the ``evolve`` API. Specifically, this API allows us to solve the time evolution 
@@ -8,7 +8,7 @@ backend target, which is based on the cuQuantum library, optimized for performan
 on NVIDIA GPU.
 
 Quick Start
-+++++++++++
+^^^^^^^^^^^^
 
 In the example below, we demonstrate a simple time evolution simulation workflow comprising of the 
 following steps:
@@ -88,7 +88,7 @@ Examples that illustrate how to use the ``dynamics`` target are available
 in the `CUDA-Q repository <https://github.com/NVIDIA/cuda-quantum/tree/main/docs/sphinx/examples/python/dynamics>`__. 
 
 Operator
-+++++++++++
+^^^^^^^^^^
 
 .. _operators:
 
@@ -159,7 +159,7 @@ The latter is specified by the dimension map that is provided to the `cudaq.evol
 
 
 Time-Dependent Dynamics
-++++++++++++++++++++++++
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. _time_dependent:
 
@@ -221,7 +221,7 @@ the desired value for each parameter:
         :end-before: [End Schedule2]
 
 Numerical Integrators
-++++++++++++++++++++++
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. _integrators:
 
@@ -276,7 +276,7 @@ backend target.
     using their Docker images.
 
 Multi-GPU Multi-Node Execution
-+++++++++++++++++++++++++++++++
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. _cudensitymat_mgmn:
 
@@ -316,3 +316,4 @@ Specifically, it will detect the number of processes (GPUs) and distribute the c
     - Computing the expectation value of a mixed quantum state is not supported. Thus, `collapse_operators` are not supported if expectation calculation is required.
 
     - Some combinations of quantum states and quantum many-body operators are not supported. Errors will be raised in those cases. 
+
diff --git a/docs/sphinx/using/backends/hardware.rst b/docs/sphinx/using/backends/hardware.rst
index 33f4907cec..21283665fc 100644
--- a/docs/sphinx/using/backends/hardware.rst
+++ b/docs/sphinx/using/backends/hardware.rst
@@ -1,903 +1,24 @@
-CUDA-Q Hardware Backends
-*********************************
-
-CUDA-Q supports submission to a set of hardware providers. 
+Quantum Hardware (QPU) 
+******************************
+CUDA-Q supports submission to a set of hardware providers.
 To submit to a hardware backend, you need an account with the respective provider.
 
-
-Amazon Braket
-==================================
-
-.. _braket-backend:
-
-`Amazon Braket <https://aws.amazon.com/braket/>`__ is a fully managed AWS 
-service which provides Jupyter notebook environments, high-performance quantum 
-circuit simulators, and secure, on-demand access to various quantum computers.
-To get started users must enable Amazon Braket in their AWS account by following 
-`these instructions <https://docs.aws.amazon.com/braket/latest/developerguide/braket-enable-overview.html>`__.
-To learn more about Amazon Braket, you can view the `Amazon Braket Documentation <https://docs.aws.amazon.com/braket/>`__ 
-and `Amazon Braket Examples <https://github.com/amazon-braket/amazon-braket-examples>`__.
-A list of available devices and regions can be found `here <https://docs.aws.amazon.com/braket/latest/developerguide/braket-devices.html>`__. 
-
-Users can run CUDA-Q programs on Amazon Braket with `Hybrid Job <https://docs.aws.amazon.com/braket/latest/developerguide/braket-what-is-hybrid-job.html>`__.
-See `this guide <https://docs.aws.amazon.com/braket/latest/developerguide/braket-jobs-first.html>`__ to get started.
-
-Setting Credentials
-```````````````````
-
-After enabling Amazon Braket in AWS, set credentials using any of the documented `methods <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html>`__.
-One of the simplest ways is to use `AWS CLI <https://aws.amazon.com/cli/>`__.
-
-.. code:: bash
-
-    aws configure
-
-Alternatively, users can set the following environment variables.
-
-.. code:: bash
-
-  export AWS_DEFAULT_REGION="<region>"
-  export AWS_ACCESS_KEY_ID="<key_id>"
-  export AWS_SECRET_ACCESS_KEY="<access_key>"
-  export AWS_SESSION_TOKEN="<token>"
-
-Submission from C++
-`````````````````````````
-
-To target quantum kernel code for execution in Amazon Braket,
-pass the flag ``--target braket`` to the ``nvq++`` compiler.
-By default jobs are submitted to the state vector simulator, `SV1`.
-
-.. code:: bash
-
-    nvq++ --target braket src.cpp
-
-
-To execute your kernels on different device, pass the ``--braket-machine`` flag to the ``nvq++`` compiler
-to specify which machine to submit quantum kernels to:
-
-.. code:: bash
-
-    nvq++ --target braket --braket-machine "arn:aws:braket:eu-north-1::device/qpu/iqm/Garnet" src.cpp ...
-
-where ``arn:aws:braket:eu-north-1::device/qpu/iqm/Garnet`` refers to IQM Garnet QPU.
-
-To emulate the device locally, without submitting through the cloud,
-you can also pass the ``--emulate`` flag to ``nvq++``. 
-
-.. code:: bash
-
-    nvq++ --emulate --target braket src.cpp
-
-To see a complete example for using Amazon Braket backends, take a look at our :doc:`C++ examples <../examples/examples>`.
-
-Submission from Python
-`````````````````````````
-
-The target to which quantum kernels are submitted 
-can be controlled with the ``cudaq::set_target()`` function.
-
-.. code:: python
-
-    cudaq.set_target("braket")
-
-By default, jobs are submitted to the state vector simulator, `SV1`.
-
-To specify which Amazon Braket device to use, set the :code:`machine` parameter.
-
-.. code:: python
-
-    device_arn = "arn:aws:braket:eu-north-1::device/qpu/iqm/Garnet"
-    cudaq.set_target("braket", machine=device_arn)
-
-where ``arn:aws:braket:eu-north-1::device/qpu/iqm/Garnet`` refers to IQM Garnet QPU.
-
-To emulate the device locally, without submitting through the cloud,
-you can also set the ``emulate`` flag to ``True``.
-
-.. code:: python
-
-    cudaq.set_target("braket", emulate=True)
-
-The number of shots for a kernel execution can be set through the ``shots_count``
-argument to ``cudaq.sample``. By default, the ``shots_count`` is set to 1000.
-
-.. code:: python
-
-    cudaq.sample(kernel, shots_count=100)
-
-To see a complete example for using Amazon Braket backends, take a look at our :doc:`Python examples <../examples/examples>`.
-
-.. note:: 
-
-    The ``cudaq.observe`` API is not yet supported on the `braket` target.
-
-Infleqtion
-==================================
-
-.. _infleqtion-backend:
-
-Infleqtion is a quantum hardware provider of gate-based neutral atom quantum computers. Their backends may be
-accessed via `Superstaq <https://superstaq.infleqtion.com/>`__, Infleqtion’s cross-platform software API
-that performs low-level compilation and cross-layer optimization. To get started users can create a Superstaq
-account by following `these instructions <https://superstaq.readthedocs.io/en/latest/get_started/credentials.html>`__.
-
-For access to Infleqtion's neutral atom quantum computer, Sqale,
-`pre-registration <https://www.infleqtion.com/sqale-preregistration>`__ is now open.
-
-Setting Credentials
-`````````````````````````
-
-Programmers of CUDA-Q may access Infleqtion backends from either C++ or Python. Generate
-an API key from your `Superstaq account <https://superstaq.infleqtion.com/profile>`__ and export
-it as an environment variable:
-
-.. code:: bash
-
-  export SUPERSTAQ_API_KEY="superstaq_api_key"
-
-Submission from C++
-`````````````````````````
-
-To target quantum kernel code for execution on Infleqtion's backends,
-pass the flag ``--target infleqtion`` to the ``nvq++`` compiler.
-
-.. code:: bash
-
-    nvq++ --target infleqtion src.cpp
-
-This will take the API key and handle all authentication with, and submission to, Infleqtion's QPU 
-(or simulator). By default, quantum kernel code will be submitted to Infleqtion's Sqale
-simulator.
-
-To execute your kernels on a QPU, pass the ``--infleqtion-machine`` flag to the ``nvq++`` compiler
-to specify which machine to submit quantum kernels to:
-
-.. code:: bash
-
-    nvq++ --target infleqtion --infleqtion-machine cq_sqale_qpu src.cpp ...
-
-where ``cq_sqale_qpu`` is an example of a physical QPU.
-
-To run an ideal dry-run execution on the QPU, additionally pass ``dry-run`` with the ``--infleqtion-method`` 
-flag to the ``nvq++`` compiler:
-
-.. code:: bash
-
-    nvq++ --target infleqtion --infleqtion-machine cq_sqale_qpu --infleqtion-method dry-run src.cpp ...
-
-To noisily simulate the QPU instead, pass ``noise-sim`` to the ``--infleqtion-method`` flag like so:
-
-.. code:: bash
-
-    nvq++ --target infleqtion --infleqtion-machine cq_sqale_qpu --infleqtion-method noise-sim src.cpp ...
-
-Alternatively, to emulate the Infleqtion machine locally, without submitting through the cloud,
-you can also pass the ``--emulate`` flag to ``nvq++``. This will emit any target
-specific compiler diagnostics, before running a noise free emulation.
-
-.. code:: bash
-
-    nvq++ --emulate --target infleqtion src.cpp
-
-To see a complete example for using Infleqtion's backends, take a look at our :doc:`C++ examples <../examples/examples>`.
-
-Submission from Python
-`````````````````````````
-
-The target to which quantum kernels are submitted
-can be controlled with the ``cudaq::set_target()`` function.
-
-.. code:: python
-
-    cudaq.set_target("infleqtion")
-
-By default, quantum kernel code will be submitted to Infleqtion's Sqale
-simulator.
-
-To specify which Infleqtion QPU to use, set the :code:`machine` parameter.
-
-.. code:: python
-
-    cudaq.set_target("infleqtion", machine="cq_sqale_qpu")
-
-where ``cq_sqale_qpu`` is an example of a physical QPU.
-
-To run an ideal dry-run execution of the QPU, additionally set the ``method`` flag to ``"dry-run"``.
-
-.. code:: python
-
-    cudaq.set_target("infleqtion", machine="cq_sqale_qpu", method="dry-run")
-
-To noisily simulate the QPU instead, set the ``method`` flag to ``"noise-sim"``.
-
-.. code:: python
-
-    cudaq.set_target("infleqtion", machine="cq_sqale_qpu", method="noise-sim")
-
-Alternatively, to emulate the Infleqtion machine locally, without submitting through the cloud,
-you can also set the ``emulate`` flag to ``True``. This will emit any target
-specific compiler diagnostics, before running a noise free emulation.
-
-.. code:: python
-
-    cudaq.set_target("infleqtion", emulate=True)
-
-The number of shots for a kernel execution can be set through
-the ``shots_count`` argument to ``cudaq.sample`` or ``cudaq.observe``. By default,
-the ``shots_count`` is set to 1000.
-
-.. code:: python
-
-    cudaq.sample(kernel, shots_count=100)
-
-To see a complete example for using Infleqtion's backends, take a look at our :doc:`Python examples <../examples/examples>`.
-Moreover, for an end-to-end application workflow example executed on the Infleqtion QPU, take a look at the 
-:doc:`Anderson Impurity Model ground state solver <../applications>` notebook.
-
-IonQ
-==================================
-
-.. _ionq-backend:
-
-Setting Credentials
-`````````````````````````
-
-Programmers of CUDA-Q may access the `IonQ Quantum Cloud
-<https://cloud.ionq.com/>`__ from either C++ or Python. Generate
-an API key from your `IonQ account <https://cloud.ionq.com/>`__ and export
-it as an environment variable:
-
-.. code:: bash
-
-  export IONQ_API_KEY="ionq_generated_api_key"
-
-Submission from C++
-`````````````````````````
-
-To target quantum kernel code for execution in the IonQ Cloud,
-pass the flag ``--target ionq`` to the ``nvq++`` compiler.
-
-.. code:: bash
-
-    nvq++ --target ionq src.cpp
-
-This will take the API key and handle all authentication with, and submission to,
-the IonQ QPU(s). By default, quantum kernel code will be submitted to the IonQ
-simulator.
-
-.. note:: 
-
-    A "target" in :code:`cudaq` refers to a quantum compute provider, such as :code:`ionq`.
-    However, IonQ's documentation uses the term "target" to refer to specific QPU's themselves.
-
-To execute your kernels on a QPU, pass the ``--ionq-machine`` flag to the ``nvq++`` compiler
-to specify which machine to submit quantum kernels to:
-
-.. code:: bash
-
-    nvq++ --target ionq --ionq-machine qpu.aria-1 src.cpp ...
-
-where ``qpu.aria-1`` is an example of a physical QPU.
-
-A list of available QPUs can be found `in the API documentation
-<https://docs.ionq.com/#tag/jobs>`__. To see which backends are available 
-with your subscription login to your `IonQ account <https://cloud.ionq.com/jobs>`__.
-
-To emulate the IonQ machine locally, without submitting through the cloud,
-you can also pass the ``--emulate`` flag to ``nvq++``. This will emit any target 
-specific compiler diagnostics, before running a noise free emulation.
-
-.. code:: bash
-
-    nvq++ --emulate --target ionq src.cpp
-
-To see a complete example for using IonQ's backends, take a look at our :doc:`C++ examples <../examples/examples>`.
-
-Submission from Python
-`````````````````````````
-
-The target to which quantum kernels are submitted 
-can be controlled with the ``cudaq::set_target()`` function.
-
-.. code:: python
-
-    cudaq.set_target('ionq')
-
-By default, quantum kernel code will be submitted to the IonQ
-simulator.
-
-.. note:: 
-
-    A "target" in :code:`cudaq` refers to a quantum compute provider, such as :code:`ionq`.
-    However, IonQ's documentation uses the term "target" to refer to specific QPU's themselves.
-
-To specify which IonQ QPU to use, set the :code:`qpu` parameter.
-
-.. code:: python
-
-    cudaq.set_target("ionq", qpu="qpu.aria-1")
-
-where ``qpu.aria-1`` is an example of a physical QPU.
-
-A list of available QPUs can be found `in the API documentation
-<https://docs.ionq.com/#tag/jobs>`__. To see which backends are available 
-with your subscription login to your `IonQ account <https://cloud.ionq.com/jobs>`__.
-
-To emulate the IonQ machine locally, without submitting through the cloud,
-you can also set the ``emulate`` flag to ``True``. This will emit any target 
-specific compiler diagnostics, before running a noise free emulation.
-
-.. code:: python
-
-    cudaq.set_target('ionq', emulate=True)
-
-The number of shots for a kernel execution can be set through
-the ``shots_count`` argument to ``cudaq.sample`` or ``cudaq.observe``. By default,
-the ``shots_count`` is set to 1000.
-
-.. code:: python
-
-    cudaq.sample(kernel, shots_count=10000)
-
-To see a complete example for using IonQ's backends, take a look at our :doc:`Python examples <../examples/examples>`.
-
-Anyon Technologies/Anyon Computing
-==================================
-
-.. _anyon-backend:
-
-Setting Credentials
-```````````````````
-
-Programmers of CUDA-Q may access the Anyon API from either
-C++ or Python. Anyon requires a credential configuration file with username and password. 
-The configuration file can be generated as follows, replacing
-the ``<username>`` and ``<password>`` in the first line with your Anyon Technologies
-account details. The credential in the file will be used by CUDA-Q to login to Anyon quantum services 
-and will be updated by CUDA-Q with an obtained API token and refresh token. 
-Note, the credential line will be deleted in the updated configuration file. 
-
-.. code:: bash
-    
-    echo 'credentials: {"username":"<username>","password":"<password>"}' > $HOME/.anyon_config
-
-Users can also login and get the keys manually using the following commands:
-
-.. code:: bash
-
-    # You may need to run: `apt-get update && apt-get install curl jq`
-    curl -X POST --user "<username>:<password>"  -H "Content-Type: application/json" \
-    https://api.anyon.cloud:5000/login > credentials.json
-    id_token=`cat credentials.json | jq -r '."id_token"'`
-    refresh_token=`cat credentials.json | jq -r '."refresh_token"'`
-    echo "key: $id_token" > ~/.anyon_config
-    echo "refresh: $refresh_token" >> ~/.anyon_config
-
-The path to the configuration can be specified as an environment variable:
-
-.. code:: bash
-
-    export CUDAQ_ANYON_CREDENTIALS=$HOME/.anyon_config
-
-
-Submission from C++
-`````````````````````````
-
-To target quantum kernel code for execution in the Anyon Technologies backends,
-pass the flag ``--target anyon`` to the ``nvq++`` compiler. CUDA-Q will 
-authenticate via the Anyon Technologies REST API using the credential in your configuration file.
-
-.. code:: bash
-
-    nvq++ --target anyon --<backend-type> <machine> src.cpp ...
-
-To execute your kernels using Anyon Technologies backends, pass the ``--anyon-machine`` flag to the ``nvq++`` compiler
-as the ``--<backend-type>`` to specify which machine to submit quantum kernels to:
-
-.. code:: bash
-
-    nvq++ --target anyon --anyon-machine telegraph-8q src.cpp ...
-
-where ``telegraph-8q`` is an example of a physical QPU (Architecture: Telegraph, Qubit Count: 8).
-
-Currently, ``telegraph-8q`` and ``berkeley-25q`` are available for access over CUDA-Q.
-
-To emulate the Anyon Technologies machine locally, without submitting through the cloud,
-you can also pass the ``--emulate`` flag as the ``--<backend-type>`` to ``nvq++``. This will emit any target 
-specific compiler warnings and diagnostics, before running a noise free emulation.
-
-.. code:: bash
-
-    nvq++ --target anyon --emulate src.cpp
-
-To see a complete example for using Anyon's backends, take a look at our :doc:`C++ examples <../examples/examples>`.
-
-
-Submission from Python
-`````````````````````````
-
-The target to which quantum kernels are submitted 
-can be controlled with the ``cudaq.set_target()`` function.
-
-To execute your kernels using Anyon Technologies backends, specify which machine to submit quantum kernels to
-by setting the :code:`machine` parameter of the target. 
-If :code:`machine` is not specified, the default machine will be ``telegraph-8q``.
-
-.. code:: python
-
-    cudaq.set_target('anyon', machine='telegraph-8q')
-
-As shown above, ``telegraph-8q`` is an example of a physical QPU.
-
-To emulate the Anyon Technologies machine locally, without submitting through the cloud,
-you can also set the ``emulate`` flag to ``True``. This will emit any target 
-specific compiler warnings and diagnostics, before running a noise free emulation.
-
-.. code:: python
-
-    cudaq.set_target('anyon', emulate=True)
-
-The number of shots for a kernel execution can be set through
-the ``shots_count`` argument to ``cudaq.sample`` or ``cudaq.observe``. By default,
-the ``shots_count`` is set to 1000.
-
-.. code:: python 
-
-    cudaq.sample(kernel, shots_count=10000)
-
-To see a complete example for using Anyon's backends, take a look at our :doc:`Python examples <../examples/examples>`.
-
-IQM
-==================================
-
-.. _iqm-backend:
-
-Support for submissions to IQM is currently under development. 
-In particular, two-qubit gates can only be performed on adjacent qubits. For more information, we refer to the respective hardware documentation.
-Support for automatically injecting the necessary operations during compilation to execute arbitrary multi-qubit gates will be added in future versions.
-
-Setting Credentials
-`````````````````````````
-
-Programmers of CUDA-Q may access the IQM Server from either C++ or Python. Following the `quick start guide <https://iqm-finland.github.io/cortex-cli/readme.html#using-cortex-cli>`__, install `iqm-cortex-cli` and login to initialize the tokens file.
-The path to the tokens file can either be passed explicitly via an environment variable or it will be loaded automatically if located in
-the default location :code:`~/.cache/iqm-cortex-cli/tokens.json`.
-
-.. code:: bash
-
-    export IQM_TOKENS_FILE="path/to/tokens.json"
-
-Submission from C++
-`````````````````````````
-
-To target quantum kernel code for execution on an IQM Server,
-pass the ``--target iqm`` flag to the ``nvq++`` compiler, along with a specified ``--iqm-machine``.
-
-.. note::
-    The ``--iqm-machine`` is  a mandatory argument. This provided architecture must match
-    the device architecture that the program has been compiled against. The hardware architecture for a
-    specific IQM Server may be checked  via `https://<IQM server>/cocos/quantum-architecture`.
-
-.. code:: bash
-
-    nvq++ --target iqm --iqm-machine Adonis src.cpp
-
-Once the binary for a specific IQM QPU architecture is compiled, it can be executed against any IQM Server with the same QPU architecture:
-
-.. code:: bash
-
-    nvq++ --target iqm --iqm-machine Adonis src.cpp -o program
-    IQM_SERVER_URL="https://demo.qc.iqm.fi/cocos" ./program
-
-    # Executing the same program against an IQM Server with a different underlying QPU
-    # architecture will result in an error.
-    IQM_SERVER_URL="https://<Apollo IQM Server>/cocos" ./program
-
-To emulate the IQM machine locally, without submitting to the IQM Server,
-you can also pass the ``--emulate`` flag to ``nvq++``. This will emit any target
-specific compiler diagnostics, before running a noise free emulation.
-
-.. code:: bash
-
-    nvq++ --emulate --target iqm --iqm-machine Adonis src.cpp
-
-To see a complete example for using IQM server backends, take a look at our :doc:`C++ examples <../examples/examples>`.
-
-Submission from Python
-`````````````````````````
-
-The target to which quantum kernels are submitted
-can be controlled with the ``cudaq::set_target()`` function.
-
-.. code:: python
-
-    cudaq.set_target("iqm", url="https://<IQM Server>/cocos", **{"qpu-architecture": "Adonis"})
-
-To emulate the IQM Server locally, without submitting to the IQM Server,
-you can also set the ``emulate`` flag to ``True``. This will emit any target
-specific compiler diagnostics, before running a noise free emulation.
-
-.. code:: python
-
-    cudaq.set_target('iqm', emulate=True)
-
-The number of shots for a kernel execution can be set through
-the ``shots_count`` argument to ``cudaq.sample`` or ``cudaq.observe``. By default,
-the ``shots_count`` is set to 1000.
-
-.. code:: python
-
-    cudaq.sample(kernel, shots_count=10000)
-
-To see a complete example for using IQM server backends, take a look at our :doc:`Python examples<../examples/examples>`.
-
-OQC
-==================================
-
-.. _oqc-backend:
-
-`Oxford Quantum Circuits <https://oxfordquantumcircuits.com/>`__ (OQC) is currently providing CUDA-Q integration for multiple Quantum Processing Unit types.
-The 8 qubit ring topology Lucy device and the 32 qubit Kagome lattice topology Toshiko device are both supported via machine options described below.
-
-Setting Credentials
-`````````````````````````
-
-In order to use the OQC devices you will need to register.
-Registration is achieved by contacting `oqc_qcaas_support@oxfordquantumcircuits.com`.
-
-Once registered you will be able to authenticate with your ``email`` and ``password``
-
-There are three environment variables that the OQC target will look for during configuration:
-
-1. ``OQC_URL``
-2. ``OQC_EMAIL``
-3. ``OQC_PASSWORD`` - is mandatory
-
-Submission from C++
-`````````````````````````
-
-To target quantum kernel code for execution on the OQC platform, provide the flag ``--target oqc`` to the ``nvq++`` compiler.
-
-Users may provide their :code:`email` and :code:`url` as extra arguments
-
-.. code:: bash
-
-    nvq++ --target oqc --oqc-email <email> --oqc-url <url> src.cpp -o executable
-
-Where both environment variables and extra arguments are supplied, precedent is given to the extra arguments.
-To run the output, provide the runtime loaded variables and invoke the pre-built executable
-
-.. code:: bash
-
-   OQC_PASSWORD=<password> ./executable
-
-To emulate the OQC device locally, without submitting through the OQC QCaaS services, you can pass the ``--emulate`` flag to ``nvq++``.
-This will emit any target specific compiler warnings and diagnostics, before running a noise free emulation.
-
-.. code:: bash
-
-    nvq++ --emulate --target oqc src.cpp -o executable
-
-
-.. note::
-
-    The oqc target supports a ``--oqc-machine`` option.
-    The default is the 8 qubit Lucy device.
-    You can set this to be either ``toshiko`` or ``lucy`` via this flag.
-
-.. note::
-
-    The OQC quantum assembly toolchain (qat) which is used to compile and execute instructions can be found on github as `oqc-community/qat <https://github.com/oqc-community/qat>`__
-
-Submission from Python
-`````````````````````````
-
-To set which OQC URL, set the :code:`url` parameter.
-To set which OQC email, set the :code:`email` parameter.
-To set which OQC machine, set the :code:`machine` parameter.
-
-.. code:: python
-
-    import os
-    import cudaq
-    # ...
-    os.environ['OQC_PASSWORD'] = password
-    cudaq.set_target("oqc", url=url, machine="lucy")
-
-You can then execute a kernel against the platform using the OQC Lucy device
-
-.. code:: python
-
-    kernel = cudaq.make_kernel()
-    qvec = kernel.qalloc(2)
-    kernel.h(qvec[0])
-    kernel.x(qvec[1])
-    kernel.cx(qvec[0], qvec[1])
-    kernel.mz(qvec)
-    str(cudaq.sample(kernel=kernel, shots_count=1000))
-
-
-ORCA Computing
-==================================
-
-.. _orca-backend:
-
-ORCA Computing's PT Series implement the boson sampling model of quantum computation, in which 
-multiple single photons are interfered with each other within a network of beam splitters, and 
-photon detectors measure where the photons leave this network. This process is implemented within 
-a time-bin interferometer (TBI) architecture where photons are created in different time-bins 
-and interfered within a network of delay lines. This can be represented by a circuit diagram, 
-like the one below, where this illustration example corresponds to 4 photons in 8 modes sent into 
-alternating time-bins in a circuit composed of two delay lines in series.
-
-.. image:: ../examples/images/orca_tbi.png
-   :width: 400px
+.. figure:: qpus.png
+   :width: 1000
    :align: center
 
 
-Setting Credentials
-```````````````````
-
-Programmers of CUDA-Q may access the ORCA API from either C++ or Python. There is an environment 
-variable ``ORCA_ACCESS_URL`` that can be set so that the ORCA target can look for it during 
-configuration.
-
-.. code:: bash
-
-  export ORCA_ACCESS_URL="https://<ORCA API Server>"
-
-
-Sometimes the requests to the PT-1 require an authentication token. This token can be set as an
-environment variable named ``ORCA_AUTH_TOKEN``. For example, if the token is :code:`AbCdEf123456`,
-you can set the environment variable as follows:
-
-.. code:: bash
-
-  export ORCA_AUTH_TOKEN="AbCdEf123456"
-
-
-Submission from C++
-`````````````````````````
-
-To execute a boson sampling experiment on the ORCA platform, provide the flag 
-``--target orca`` to the ``nvq++`` compiler. You should then pass the ``--orca-url`` flag set with 
-the previously set environment variable ``$ORCA_ACCESS_URL`` or an :code:`url`.
-
-.. code:: bash
-
-    nvq++ --target orca --orca-url $ORCA_ACCESS_URL src.cpp -o executable
-
-or
-
-.. code:: bash
-
-    nvq++ --target orca --orca-url <url> src.cpp -o executable
-
-To run the output, invoke the executable
-
-.. code:: bash
-
-   ./executable
-
-
-To see a complete example for using ORCA server backends, take a look at our :doc:`C++ examples <../examples/hardware_providers>`.
-
-Submission from Python
-`````````````````````````
-
-To set which ORCA URL to be used, set the :code:`url` parameter.
-
-.. code:: python
-
-    import os
-    import cudaq
-    # ...
-    orca_url = os.getenv("ORCA_ACCESS_URL", "http://localhost/sample")
-
-    cudaq.set_target("orca", url=orca_url)
-
-
-You can then execute a time-bin boson sampling experiment against the platform using an ORCA device.
-
-.. code:: python
-
-    bs_angles = [np.pi / 3, np.pi / 6]
-    input_state = [1, 1, 1]
-    loop_lengths = [1]
-    counts = cudaq.orca.sample(input_state, loop_lengths, bs_angles)
-
-To see a complete example for using ORCA's backends, take a look at our :doc:`Python examples <../examples/hardware_providers>`.
-
-Quantinuum
-==================================
-
-.. _quantinuum-backend:
-
-Setting Credentials
-```````````````````
-
-Programmers of CUDA-Q may access the Quantinuum API from either
-C++ or Python. Quantinuum requires a credential configuration file. 
-The configuration file can be generated as follows, replacing
-the ``email`` and ``credentials`` in the first line with your Quantinuum
-account details.
-
-.. code:: bash
-
-    # You may need to run: `apt-get update && apt-get install curl jq`
-    curl -X POST -H "Content Type: application/json" \
-        -d '{ "email":"<your_alias>@email.com","password":"<your_password>" }' \
-        https://qapi.quantinuum.com/v1/login > $HOME/credentials.json
-    id_token=`cat $HOME/credentials.json | jq -r '."id-token"'`
-    refresh_token=`cat $HOME/credentials.json | jq -r '."refresh-token"'`
-    echo "key: $id_token" >> $HOME/.quantinuum_config
-    echo "refresh: $refresh_token" >> $HOME/.quantinuum_config
-
-The path to the configuration can be specified as an environment variable:
-
-.. code:: bash
-
-    export CUDAQ_QUANTINUUM_CREDENTIALS=$HOME/.quantinuum_config
-
-
-Submission from C++
-`````````````````````````
-
-To target quantum kernel code for execution in the Quantinuum backends,
-pass the flag ``--target quantinuum`` to the ``nvq++`` compiler. CUDA-Q will 
-authenticate via the Quantinuum REST API using the credential in your configuration file.
-By default, quantum kernel code will be submitted to the Quantinuum syntax checker.
-Submission to the syntax checker merely validates the program; the kernels are not executed.
-
-.. code:: bash
-
-    nvq++ --target quantinuum src.cpp ...
-
-To execute your kernels, pass the ``--quantinuum-machine`` flag to the ``nvq++`` compiler
-to specify which machine to submit quantum kernels to:
-
-.. code:: bash
-
-    nvq++ --target quantinuum --quantinuum-machine H1-2 src.cpp ...
-
-where ``H1-2`` is an example of a physical QPU. Hardware specific
-emulators may be accessed by appending an ``E`` to the end (e.g, ``H1-2E``). For 
-access to the syntax checker for the provided machine, you may append an ``SC`` 
-to the end (e.g, ``H1-1SC``).
-
-For a comprehensive list of available machines, login to your `Quantinuum user account <https://um.qapi.quantinuum.com/user>`__ 
-and navigate to the "Account" tab, where you should find a table titled "Machines".
-
-To emulate the Quantinuum machine locally, without submitting through the cloud,
-you can also pass the ``--emulate`` flag to ``nvq++``. This will emit any target 
-specific compiler warnings and diagnostics, before running a noise free emulation.
-
-.. code:: bash
-
-    nvq++ --emulate --target quantinuum src.cpp
-
-To see a complete example for using Quantinuum's backends, take a look at our :doc:`C++ examples <../examples/examples>`.
-
-
-Submission from Python
-`````````````````````````
-
-The target to which quantum kernels are submitted 
-can be controlled with the ``cudaq::set_target()`` function.
-
-.. code:: python
-
-    cudaq.set_target('quantinuum')
-
-By default, quantum kernel code will be submitted to the Quantinuum syntax checker.
-Submission to the syntax checker merely validates the program; the kernels are not executed.
-
-To execute your kernels, specify which machine to submit quantum kernels to
-by setting the :code:`machine` parameter of the target.
-
-.. code:: python
-
-    cudaq.set_target('quantinuum', machine='H1-2')
-
-where ``H1-2`` is an example of a physical QPU. Hardware specific
-emulators may be accessed by appending an ``E`` to the end (e.g, ``H1-2E``). For 
-access to the syntax checker for the provided machine, you may append an ``SC`` 
-to the end (e.g, ``H1-1SC``).
-
-For a comprehensive list of available machines, login to your `Quantinuum user account <https://um.qapi.quantinuum.com/user>`__ 
-and navigate to the "Account" tab, where you should find a table titled "Machines".
-
-To emulate the Quantinuum machine locally, without submitting through the cloud,
-you can also set the ``emulate`` flag to ``True``. This will emit any target 
-specific compiler warnings and diagnostics, before running a noise free emulation.
-
-.. code:: python
-
-    cudaq.set_target('quantinuum', emulate=True)
-
-The number of shots for a kernel execution can be set through
-the ``shots_count`` argument to ``cudaq.sample`` or ``cudaq.observe``. By default,
-the ``shots_count`` is set to 1000.
-
-.. code:: python 
-
-    cudaq.sample(kernel, shots_count=10000)
-
-To see a complete example for using Quantinuum's backends, take a look at our :doc:`Python examples <../examples/examples>`.
-
-QuEra Computing
-==================================
-
-.. _quera-backend:
-
-Setting Credentials
-```````````````````
-
-Programmers of CUDA-Q may access Aquila, QuEra's first generation of quantum
-processing unit (QPU) via Amazon Braket. Hence, users must first enable Braket by 
-following `these instructions <https://docs.aws.amazon.com/braket/latest/developerguide/braket-enable-overview.html>`__. 
-Then set credentials using any of the documented `methods <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html>`__.
-One of the simplest ways is to use `AWS CLI <https://aws.amazon.com/cli/>`__.
-
-.. code:: bash
-
-    aws configure
-
-Alternatively, users can set the following environment variables.
-
-.. code:: bash
-
-  export AWS_DEFAULT_REGION="us-east-1"
-  export AWS_ACCESS_KEY_ID="<key_id>"
-  export AWS_SECRET_ACCESS_KEY="<access_key>"
-  export AWS_SESSION_TOKEN="<token>"
-
-
-Submission from C++
-`````````````````````````
-
-Not yet supported.
-
-
-Submission from Python
-`````````````````````````
-
-The target to which quantum kernels are submitted 
-can be controlled with the ``cudaq::set_target()`` function.
-
-.. code:: python
-
-    cudaq.set_target('quera')
-
-By default, analog Hamiltonian will be submitted to the Aquila system.
-
-Aquila is a "field programmable qubit array" operated as an analog 
-Hamiltonian simulator on a user-configurable architecture, executing 
-programmable coherent quantum dynamics on up to 256 neutral-atom qubits.
-Refer to QuEra's `whitepaper <https://cdn.prod.website-files.com/643b94c382e84463a9e52264/648f5bf4d19795aaf36204f7_Whitepaper%20June%2023.pdf>`__ for details.
-
-Due to the nature of the underlying hardware, this target only supports the 
-``evolve`` and ``evolve_async`` APIs.
-The `hamiltonian` must be an `Operator` of the type `RydbergHamiltonian`. Only 
-other parameters supported are `schedule` (mandatory) and `shots_count` (optional).
+.. toctree::
+   :maxdepth: 2
+      
+        Ion Trap QPUs <hardware/iontrap.rst>
+        Superconducting QPUs <hardware/superconducting.rst>
+        Neutral Atom QPUs <hardware/neutralatom.rst>
+        Photonic QPUs <hardware/photonic.rst>
 
-For example,
 
-.. code:: python
 
-    evolution_result = evolve(RydbergHamiltonian(atom_sites=register,
-                                                 amplitude=omega,
-                                                 phase=phi,
-                                                 delta_global=delta),
-                               schedule=schedule)
 
-The number of shots for a kernel execution can be set through the ``shots_count``
-argument to ``evolve`` or ``evolve_async``. By default, the ``shots_count`` is 
-set to 100.
 
-.. code:: python 
 
-    cudaq.evolve(RydbergHamiltonian(...), schedule=s, shots_count=1000)
 
-To see a complete example for using QuEra's backend, take a look at our :doc:`Python examples <../examples/hardware_providers>`.
diff --git a/docs/sphinx/using/backends/hardware/iontrap.rst b/docs/sphinx/using/backends/hardware/iontrap.rst
new file mode 100644
index 0000000000..596618bd4f
--- /dev/null
+++ b/docs/sphinx/using/backends/hardware/iontrap.rst
@@ -0,0 +1,216 @@
+Ion Trap
+============
+
+IonQ
++++++++
+
+.. _ionq-backend:
+
+Setting Credentials
+`````````````````````````
+
+Programmers of CUDA-Q may access the `IonQ Quantum Cloud
+<https://cloud.ionq.com/>`__ from either C++ or Python. Generate
+an API key from your `IonQ account <https://cloud.ionq.com/>`__ and export
+it as an environment variable:
+
+.. code:: bash
+
+  export IONQ_API_KEY="ionq_generated_api_key"
+
+
+Submitting
+`````````````````````````
+.. tab:: Python
+
+    First, set the :code:`ionq` backend.
+
+    .. code:: python
+
+        cudaq.set_target('ionq')
+
+    By default, quantum kernel code will be submitted to the IonQ simulator.
+
+    .. note:: 
+
+       A "target" in :code:`cudaq` refers to a quantum compute provider, such as :code:`ionq`.
+       However, IonQ's documentation uses the term "target" to refer to specific QPU's themselves.
+
+    To specify which IonQ QPU to use, set the :code:`qpu` parameter.
+
+    .. code:: python
+
+       cudaq.set_target("ionq", qpu="qpu.aria-1")
+
+    where ``qpu.aria-1`` is an example of a physical QPU.
+
+   A list of available QPUs can be found `in the API documentation <https://docs.ionq.com/#tag/jobs>`__. To see which backends are available with your subscription login to your `IonQ account <https://cloud.ionq.com/jobs>`__.
+
+   To emulate the IonQ machine locally, without submitting through the cloud, you can also set the ``emulate`` flag to ``True``. This will emit any target specific compiler diagnostics, before running a noise free emulation.
+
+   .. code:: python
+
+       cudaq.set_target('ionq', emulate=True)
+
+   The number of shots for a kernel execution can be set through the ``shots_count`` argument to ``cudaq.sample`` or ``cudaq.observe``. By default, the ``shots_count`` is set to 1000.
+
+   .. code:: python
+
+       cudaq.sample(kernel, shots_count=10000)
+
+   To see a complete example for using IonQ's backends, take a look at our :doc:`Python examples <../../examples/examples>`.
+
+
+.. tab:: C++
+
+        To target quantum kernel code for execution in the IonQ Cloud,
+        pass the flag ``--target ionq`` to the ``nvq++`` compiler.
+
+        .. code:: bash
+
+        nvq++ --target ionq src.cpp
+
+        This will take the API key and handle all authentication with, and submission to, the IonQ QPU(s). By default, quantum kernel code will be submitted to the IonQsimulator.
+
+        .. note:: 
+
+                A "target" in :code:`cudaq` refers to a quantum compute provider, such as :code:`ionq`.
+                However, IonQ's documentation uses the term "target" to refer to specific QPU's themselves.
+
+        To execute your kernels on a QPU, pass the ``--ionq-machine`` flag to the ``nvq++`` compiler to specify which machine to submit quantum kernels to:
+
+        .. code:: bash
+
+                nvq++ --target ionq --ionq-machine qpu.aria-1 src.cpp ...
+
+        where ``qpu.aria-1`` is an example of a physical QPU.
+
+        A list of available QPUs can be found `in the API documentation <https://docs.ionq.com/#tag/jobs>`__. To see which backends are available  with your subscription login to your `IonQ account <https://cloud.ionq.com/jobs>`__.
+
+        To emulate the IonQ machine locally, without submitting through the cloud, you can also pass the ``--emulate`` flag to ``nvq++``. This will emit any target  specific compiler diagnostics, before running a noise free emulation.
+
+        .. code:: bash
+
+                nvq++ --emulate --target ionq src.cpp
+
+        To see a complete example for using IonQ's backends, take a look at our :doc:`C++ examples <../../examples/examples>`.
+  
+
+Quantinuum
++++++++++++
+
+.. _quantinuum-backend:
+
+Setting Credentials
+```````````````````
+
+Programmers of CUDA-Q may access the Quantinuum API from either
+C++ or Python. Quantinuum requires a credential configuration file. 
+The configuration file can be generated as follows, replacing
+the ``email`` and ``credentials`` in the first line with your Quantinuum
+account details.
+
+.. code:: bash
+
+    # You may need to run: `apt-get update && apt-get install curl jq`
+    curl -X POST -H "Content Type: application/json" \
+        -d '{ "email":"<your_alias>@email.com","password":"<your_password>" }' \
+        https://qapi.quantinuum.com/v1/login > $HOME/credentials.json
+    id_token=`cat $HOME/credentials.json | jq -r '."id-token"'`
+    refresh_token=`cat $HOME/credentials.json | jq -r '."refresh-token"'`
+    echo "key: $id_token" >> $HOME/.quantinuum_config
+    echo "refresh: $refresh_token" >> $HOME/.quantinuum_config
+
+The path to the configuration can be specified as an environment variable:
+
+.. code:: bash
+
+    export CUDAQ_QUANTINUUM_CREDENTIALS=$HOME/.quantinuum_config
+
+
+Submitting
+`````````````````````````
+.. tab:: Python
+
+       
+        The backend to which quantum kernels are submitted 
+        can be controlled with the ``cudaq::set_target()`` function.
+
+        .. code:: python
+
+            cudaq.set_target('quantinuum')
+
+        By default, quantum kernel code will be submitted to the Quantinuum syntax checker.
+        Submission to the syntax checker merely validates the program; the kernels are not executed.
+
+        To execute your kernels, specify which machine to submit quantum kernels to
+        by setting the :code:`machine` parameter of the target.
+
+        .. code:: python
+
+            cudaq.set_target('quantinuum', machine='H1-2')
+
+        where ``H1-2`` is an example of a physical QPU. Hardware specific
+        emulators may be accessed by appending an ``E`` to the end (e.g, ``H1-2E``). For 
+        access to the syntax checker for the provided machine, you may append an ``SC`` 
+        to the end (e.g, ``H1-1SC``).
+
+        For a comprehensive list of available machines, login to your `Quantinuum user account <https://um.qapi.quantinuum.com/user>`__ 
+        and navigate to the "Account" tab, where you should find a table titled "Machines".
+
+        To emulate the Quantinuum machine locally, without submitting through the cloud,
+        you can also set the ``emulate`` flag to ``True``. This will emit any target 
+        specific compiler warnings and diagnostics, before running a noise free emulation.
+
+        .. code:: python
+
+            cudaq.set_target('quantinuum', emulate=True)
+
+        The number of shots for a kernel execution can be set through
+        the ``shots_count`` argument to ``cudaq.sample`` or ``cudaq.observe``. By default,
+        the ``shots_count`` is set to 1000.
+
+        .. code:: python 
+
+            cudaq.sample(kernel, shots_count=10000)
+
+        To see a complete example for using Quantinuum's backends, take a look at our :doc:`Python examples <../../examples/examples>`.
+
+
+.. tab:: C++
+
+        To target quantum kernel code for execution in the Quantinuum backends,
+        pass the flag ``--target quantinuum`` to the ``nvq++`` compiler. CUDA-Q will 
+        authenticate via the Quantinuum REST API using the credential in your configuration file.
+        By default, quantum kernel code will be submitted to the Quantinuum syntax checker.
+        Submission to the syntax checker merely validates the program; the kernels are not executed.
+
+        .. code:: bash
+
+            nvq++ --target quantinuum src.cpp ...
+
+        To execute your kernels, pass the ``--quantinuum-machine`` flag to the ``nvq++`` compiler
+        to specify which machine to submit quantum kernels to:
+
+        .. code:: bash
+
+            nvq++ --target quantinuum --quantinuum-machine H1-2 src.cpp ...
+
+        where ``H1-2`` is an example of a physical QPU. Hardware specific
+        emulators may be accessed by appending an ``E`` to the end (e.g, ``H1-2E``). For 
+        access to the syntax checker for the provided machine, you may append an ``SC`` 
+        to the end (e.g, ``H1-1SC``).
+
+        For a comprehensive list of available machines, login to your `Quantinuum user account <https://um.qapi.quantinuum.com/user>`__ 
+        and navigate to the "Account" tab, where you should find a table titled "Machines".
+
+        To emulate the Quantinuum machine locally, without submitting through the cloud,
+        you can also pass the ``--emulate`` flag to ``nvq++``. This will emit any target 
+        specific compiler warnings and diagnostics, before running a noise free emulation.
+
+        .. code:: bash
+
+            nvq++ --emulate --target quantinuum src.cpp
+
+        To see a complete example for using Quantinuum's backends, take a look at our :doc:`C++ examples <../../examples/examples>`.
+
diff --git a/docs/sphinx/using/backends/hardware/neutralatom.rst b/docs/sphinx/using/backends/hardware/neutralatom.rst
new file mode 100644
index 0000000000..4ad5bb89db
--- /dev/null
+++ b/docs/sphinx/using/backends/hardware/neutralatom.rst
@@ -0,0 +1,202 @@
+Neutral Atom
+=============
+
+Infleqtion
++++++++++++
+
+.. _infleqtion-backend:
+
+Infleqtion is a quantum hardware provider of gate-based neutral atom quantum computers. Their backends may be
+accessed via `Superstaq <https://superstaq.infleqtion.com/>`__, a cross-platform software API from Infleqtion,
+that performs low-level compilation and cross-layer optimization. To get started users can create a Superstaq
+account by following `these instructions <https://superstaq.readthedocs.io/en/latest/get_started/credentials.html>`__.
+
+For access to Infleqtion's neutral atom quantum computer, Sqale,
+`pre-registration <https://www.infleqtion.com/sqale-preregistration>`__ is now open.
+
+Setting Credentials
+`````````````````````````
+
+Programmers of CUDA-Q may access Infleqtion backends from either C++ or Python. Generate
+an API key from your `Superstaq account <https://superstaq.infleqtion.com/profile>`__ and export
+it as an environment variable:
+
+.. code:: bash
+
+  export SUPERSTAQ_API_KEY="superstaq_api_key"
+
+
+Submitting
+`````````````````````````
+
+.. tab:: Python
+
+        The target to which quantum kernels are submitted
+        can be controlled with the ``cudaq::set_target()`` function.
+
+        .. code:: python
+
+            cudaq.set_target("infleqtion")
+
+        By default, quantum kernel code will be submitted to Infleqtion's Sqale
+        simulator.
+
+        To specify which Infleqtion QPU to use, set the :code:`machine` parameter.
+
+        .. code:: python
+
+            cudaq.set_target("infleqtion", machine="cq_sqale_qpu")
+
+        where ``cq_sqale_qpu`` is an example of a physical QPU.
+
+        To run an ideal dry-run execution of the QPU, additionally set the ``method`` flag to ``"dry-run"``.
+
+        .. code:: python
+
+            cudaq.set_target("infleqtion", machine="cq_sqale_qpu", method="dry-run")
+
+        To noisily simulate the QPU instead, set the ``method`` flag to ``"noise-sim"``.
+
+        .. code:: python
+
+            cudaq.set_target("infleqtion", machine="cq_sqale_qpu", method="noise-sim")
+
+        Alternatively, to emulate the Infleqtion machine locally, without submitting through the cloud,
+        you can also set the ``emulate`` flag to ``True``. This will emit any target
+        specific compiler diagnostics, before running a noise free emulation.
+
+        .. code:: python
+
+            cudaq.set_target("infleqtion", emulate=True)
+
+        The number of shots for a kernel execution can be set through
+        the ``shots_count`` argument to ``cudaq.sample`` or ``cudaq.observe``. By default,
+        the ``shots_count`` is set to 1000.
+
+        .. code:: python
+
+            cudaq.sample(kernel, shots_count=100)
+
+        To see a complete example for using Infleqtion's backends, take a look at our :doc:`Python examples <../../examples/examples>`.
+        Moreover, for an end-to-end application workflow example executed on the Infleqtion QPU, take a look at the 
+        :doc:`Anderson Impurity Model ground state solver <../../applications>` notebook.
+
+
+.. tab:: C++
+
+
+        To target quantum kernel code for execution on Infleqtion's backends,
+        pass the flag ``--target infleqtion`` to the ``nvq++`` compiler.
+
+        .. code:: bash
+
+            nvq++ --target infleqtion src.cpp
+
+        This will take the API key and handle all authentication with, and submission to, Infleqtion's QPU 
+        (or simulator). By default, quantum kernel code will be submitted to Infleqtion's Sqale
+        simulator.
+
+        To execute your kernels on a QPU, pass the ``--infleqtion-machine`` flag to the ``nvq++`` compiler
+        to specify which machine to submit quantum kernels to:
+
+        .. code:: bash
+
+            nvq++ --target infleqtion --infleqtion-machine cq_sqale_qpu src.cpp ...
+
+        where ``cq_sqale_qpu`` is an example of a physical QPU.
+
+        To run an ideal dry-run execution on the QPU, additionally pass ``dry-run`` with the ``--infleqtion-method`` 
+        flag to the ``nvq++`` compiler:
+
+        .. code:: bash
+
+            nvq++ --target infleqtion --infleqtion-machine cq_sqale_qpu --infleqtion-method dry-run src.cpp ...
+
+        To noisily simulate the QPU instead, pass ``noise-sim`` to the ``--infleqtion-method`` flag like so:
+
+        .. code:: bash
+
+            nvq++ --target infleqtion --infleqtion-machine cq_sqale_qpu --infleqtion-method noise-sim src.cpp ...
+
+        Alternatively, to emulate the Infleqtion machine locally, without submitting through the cloud,
+        you can also pass the ``--emulate`` flag to ``nvq++``. This will emit any target
+        specific compiler diagnostics, before running a noise free emulation.
+
+        .. code:: bash
+
+            nvq++ --emulate --target infleqtion src.cpp
+
+        To see a complete example for using Infleqtion's backends, take a look at our :doc:`C++ examples <../../examples/examples>`.
+
+
+
+
+QuEra Computing
+++++++++++++++++
+
+
+.. _quera-backend:
+
+Setting Credentials
+```````````````````
+
+Programmers of CUDA-Q may access Aquila, QuEra's first generation of quantum
+processing unit (QPU) via Amazon Braket. Hence, users must first enable Braket by 
+following `these instructions <https://docs.aws.amazon.com/braket/latest/developerguide/braket-enable-overview.html>`__. 
+Then set credentials using any of the documented `methods <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html>`__.
+One of the simplest ways is to use `AWS CLI <https://aws.amazon.com/cli/>`__.
+
+.. code:: bash
+
+    aws configure
+
+Alternatively, users can set the following environment variables.
+
+.. code:: bash
+
+  export AWS_DEFAULT_REGION="us-east-1"
+  export AWS_ACCESS_KEY_ID="<key_id>"
+  export AWS_SECRET_ACCESS_KEY="<access_key>"
+  export AWS_SESSION_TOKEN="<token>"
+
+Submission from Python
+`````````````````````````
+
+The target to which quantum kernels are submitted 
+can be controlled with the ``cudaq::set_target()`` function.
+
+.. code:: python
+
+    cudaq.set_target('quera')
+
+By default, analog Hamiltonian will be submitted to the Aquila system.
+
+Aquila is a "field programmable qubit array" operated as an analog 
+Hamiltonian simulator on a user-configurable architecture, executing 
+programmable coherent quantum dynamics on up to 256 neutral-atom qubits.
+Refer to QuEra's `whitepaper <https://cdn.prod.website-files.com/643b94c382e84463a9e52264/648f5bf4d19795aaf36204f7_Whitepaper%20June%2023.pdf>`__ for details.
+
+Due to the nature of the underlying hardware, this target only supports the 
+``evolve`` and ``evolve_async`` APIs.
+The `hamiltonian` must be an `Operator` of the type `RydbergHamiltonian`. Only 
+other parameters supported are `schedule` (mandatory) and `shots_count` (optional).
+
+For example,
+
+.. code:: python
+
+    evolution_result = evolve(RydbergHamiltonian(atom_sites=register,
+                                                 amplitude=omega,
+                                                 phase=phi,
+                                                 delta_global=delta),
+                               schedule=schedule)
+
+The number of shots for a kernel execution can be set through the ``shots_count``
+argument to ``evolve`` or ``evolve_async``. By default, the ``shots_count`` is 
+set to 100.
+
+.. code:: python 
+
+    cudaq.evolve(RydbergHamiltonian(...), schedule=s, shots_count=1000)
+
+To see a complete example for using QuEra's backend, take a look at our :doc:`Python examples <../../examples/hardware_providers>`.
diff --git a/docs/sphinx/using/backends/hardware/photonic.rst b/docs/sphinx/using/backends/hardware/photonic.rst
new file mode 100644
index 0000000000..b00cf1609c
--- /dev/null
+++ b/docs/sphinx/using/backends/hardware/photonic.rst
@@ -0,0 +1,96 @@
+Photonic
+==========
+
+ORCA Computing
++++++++++++++++
+
+.. _orca-backend:
+
+ORCA Computing's PT Series implement the boson sampling model of quantum computation, in which 
+multiple single photons are interfered with each other within a network of beam splitters, and 
+photon detectors measure where the photons leave this network. This process is implemented within 
+a time-bin interferometer (TBI) architecture where photons are created in different time-bins 
+and interfered within a network of delay lines. This can be represented by a circuit diagram, 
+like the one below, where this illustration example corresponds to 4 photons in 8 modes sent into 
+alternating time-bins in a circuit composed of two delay lines in series.
+
+.. image:: ../../examples/images/orca_tbi.png
+   :width: 400px
+   :align: center
+
+
+Setting Credentials
+```````````````````
+
+Programmers of CUDA-Q may access the ORCA API from either C++ or Python. There is an environment 
+variable ``ORCA_ACCESS_URL`` that can be set so that the ORCA target can look for it during 
+configuration.
+
+.. code:: bash
+
+  |:spellcheck-disable:|export ORCA_ACCESS_URL="https://<ORCA API Server>"|:spellcheck-enable:|
+
+
+Sometimes the requests to the PT-1 require an authentication token. This token can be set as an
+environment variable named ``ORCA_AUTH_TOKEN``. For example, if the token is :code:`AbCdEf123456`,
+you can set the environment variable as follows:
+
+.. code:: bash
+
+  |:spellcheck-disable:|export ORCA_AUTH_TOKEN="AbCdEf123456"|:spellcheck-enable:|
+
+Submitting
+`````````````````````````
+
+.. tab:: Python
+
+        To set which ORCA URL to be used, set the :code:`url` parameter.
+
+        .. code:: python
+
+            import os
+            import cudaq
+            # ...
+            orca_url = os.getenv("ORCA_ACCESS_URL", "http://localhost/sample")
+
+            cudaq.set_target("orca", url=orca_url)
+
+
+        You can then execute a time-bin boson sampling experiment against the platform using an ORCA device.
+
+        .. code:: python
+
+            bs_angles = [np.pi / 3, np.pi / 6]
+            input_state = [1, 1, 1]
+            loop_lengths = [1]
+            counts = cudaq.orca.sample(input_state, loop_lengths, bs_angles)
+
+        To see a complete example for using ORCA's backends, take a look at our :doc:`Python examples <../../examples/hardware_providers>`.
+
+
+
+.. tab:: C++
+
+        
+        To execute a boson sampling experiment on the ORCA platform, provide the flag 
+        ``--target orca`` to the ``nvq++`` compiler. You should then pass the ``--orca-url`` flag set with 
+        the previously set environment variable ``$ORCA_ACCESS_URL`` or an :code:`url`.
+
+        .. code:: bash
+
+            nvq++ --target orca --orca-url $ORCA_ACCESS_URL src.cpp -o executable
+
+        or
+
+        .. code:: bash
+
+            nvq++ --target orca --orca-url <url> src.cpp -o executable
+
+        To run the output, invoke the executable
+
+        .. code:: bash
+
+           ./executable
+
+
+        To see a complete example for using ORCA server backends, take a look at our :doc:`C++ examples <../../examples/hardware_providers>`.
diff --git a/docs/sphinx/using/backends/hardware/superconducting.rst b/docs/sphinx/using/backends/hardware/superconducting.rst
new file mode 100644
index 0000000000..1217c05d38
--- /dev/null
+++ b/docs/sphinx/using/backends/hardware/superconducting.rst
@@ -0,0 +1,300 @@
+Superconducting
+=================
+
+Anyon Technologies/Anyon Computing
++++++++++++++++++++++++++++++++++++
+
+.. _anyon-backend:
+
+Setting Credentials
+```````````````````
+
+Programmers of CUDA-Q may access the Anyon API from either
+C++ or Python. Anyon requires a credential configuration file with username and password. 
+The configuration file can be generated as follows, replacing
+the ``<username>`` and ``<password>`` in the first line with your Anyon Technologies
+account details. The credential in the file will be used by CUDA-Q to login to Anyon quantum services 
+and will be updated by CUDA-Q with an obtained API token and refresh token. 
+Note, the credential line will be deleted in the updated configuration file. 
+
+.. code:: bash
+    
+    echo 'credentials: {"username":"<username>","password":"<password>"}' > $HOME/.anyon_config
+
+Users can also login and get the keys manually using the following commands:
+
+.. code:: bash
+
+    # You may need to run: `apt-get update && apt-get install curl jq`
+    curl -X POST --user "<username>:<password>"  -H "Content-Type: application/json" \
+    https://api.anyon.cloud:5000/login > credentials.json
+    id_token=`cat credentials.json | jq -r '."id_token"'`
+    refresh_token=`cat credentials.json | jq -r '."refresh_token"'`
+    echo "key: $id_token" > ~/.anyon_config
+    echo "refresh: $refresh_token" >> ~/.anyon_config
+
+The path to the configuration can be specified as an environment variable:
+
+.. code:: bash
+
+    export CUDAQ_ANYON_CREDENTIALS=$HOME/.anyon_config
+
+Submitting
+```````````````````
+
+.. tab:: Python
+
+
+        The target to which quantum kernels are submitted 
+        can be controlled with the ``cudaq.set_target()`` function.
+
+        To execute your kernels using Anyon Technologies backends, specify which machine to submit quantum kernels to
+        by setting the :code:`machine` parameter of the target. 
+        If :code:`machine` is not specified, the default machine will be ``telegraph-8q``.
+
+        .. code:: python
+
+            cudaq.set_target('anyon', machine='telegraph-8q')
+
+        As shown above, ``telegraph-8q`` is an example of a physical QPU.
+
+        To emulate the Anyon Technologies machine locally, without submitting through the cloud,
+        you can also set the ``emulate`` flag to ``True``. This will emit any target 
+        specific compiler warnings and diagnostics, before running a noise free emulation.
+
+        .. code:: python
+
+            cudaq.set_target('anyon', emulate=True)
+
+        The number of shots for a kernel execution can be set through
+        the ``shots_count`` argument to ``cudaq.sample`` or ``cudaq.observe``. By default,
+        the ``shots_count`` is set to 1000.
+
+        .. code:: python 
+
+            cudaq.sample(kernel, shots_count=10000)
+
+        To see a complete example for using Anyon's backends, take a look at our :doc:`Python examples <../../examples/examples>`.
+
+
+.. tab:: C++
+
+
+        To target quantum kernel code for execution in the Anyon Technologies backends,
+        pass the flag ``--target anyon`` to the ``nvq++`` compiler. CUDA-Q will 
+        authenticate via the Anyon Technologies REST API using the credential in your configuration file.
+
+        .. code:: bash
+
+            nvq++ --target anyon --<backend-type> <machine> src.cpp ...
+
+        To execute your kernels using Anyon Technologies backends, pass the ``--anyon-machine`` flag to the ``nvq++`` compiler
+        as the ``--<backend-type>`` to specify which machine to submit quantum kernels to:
+
+        .. code:: bash
+
+            nvq++ --target anyon --anyon-machine telegraph-8q src.cpp ...
+
+        where ``telegraph-8q`` is an example of a physical QPU (Architecture: Telegraph, Qubit Count: 8).
+
+        Currently, ``telegraph-8q`` and ``berkeley-25q`` are available for access over CUDA-Q.
+
+        To emulate the Anyon Technologies machine locally, without submitting through the cloud,
+        you can also pass the ``--emulate`` flag as the ``--<backend-type>`` to ``nvq++``. This will emit any target 
+        specific compiler warnings and diagnostics, before running a noise free emulation.
+
+        .. code:: bash
+
+            nvq++ --target anyon --emulate src.cpp
+
+        To see a complete example for using Anyon's backends, take a look at our :doc:`C++ examples <../../examples/examples>`.
+
+
+IQM
++++++++++
+
+.. _iqm-backend:
+
+Support for submissions to IQM is currently under development. 
+In particular, two-qubit gates can only be performed on adjacent qubits. For more information, we refer to the respective hardware documentation.
+Support for automatically injecting the necessary operations during compilation to execute arbitrary multi-qubit gates will be added in future versions.
+
+Setting Credentials
+`````````````````````````
+
+Programmers of CUDA-Q may access the IQM Server from either C++ or Python. Following the `quick start guide <https://iqm-finland.github.io/cortex-cli/readme.html#using-cortex-cli>`__, install `iqm-cortex-cli` and login to initialize the tokens file.
+The path to the tokens file can either be passed explicitly via an environment variable or it will be loaded automatically if located in
+the default location :code:`~/.cache/iqm-cortex-cli/tokens.json`.
+
+.. code:: bash
+
+    export IQM_TOKENS_FILE="path/to/tokens.json"
+
+
+    
+Submitting
+`````````````````````````
+    
+.. tab:: Python 
+    
+    
+        The target to which quantum kernels are submitted
+        can be controlled with the ``cudaq::set_target()`` function.
+
+        .. code:: python
+
+            cudaq.set_target("iqm", url="https://<IQM Server>/cocos",**{"qpu-architecture": "Adonis"})
+
+        To emulate the IQM Server locally, without submitting to the IQM Server,
+        you can also set the ``emulate`` flag to ``True``. This will emit any target
+        specific compiler diagnostics, before running a noise free emulation.
+
+        .. code:: python
+
+            cudaq.set_target('iqm', emulate=True)
+
+        The number of shots for a kernel execution can be set through
+        the ``shots_count`` argument to ``cudaq.sample`` or ``cudaq.observe``. By default,
+        the ``shots_count`` is set to 1000.
+
+        .. code:: python
+
+            cudaq.sample(kernel, shots_count=10000)
+
+        To see a complete example for using IQM server backends, take a look at our :doc:`Python examples<../../examples/examples>`.
+    
+    
+    
+    
+    
+.. tab:: C++
+    
+
+        To target quantum kernel code for execution on an IQM Server,
+        pass the ``--target iqm`` flag to the ``nvq++`` compiler, along with a specified ``--iqm-machine``.
+
+        .. note::
+            The ``--iqm-machine`` is  a mandatory argument. This provided architecture must match
+            the device architecture that the program has been compiled against. The hardware architecture for a
+            specific IQM Server may be checked  via `https://<IQM server>/cocos/quantum-architecture`.
+
+        .. code:: bash
+
+            nvq++ --target iqm --iqm-machine Adonis src.cpp
+
+        Once the binary for a specific IQM QPU architecture is compiled, it can be executed against any IQM Server with the same QPU architecture:
+
+        .. code:: bash
+
+            nvq++ --target iqm --iqm-machine Adonis src.cpp -o program
+            IQM_SERVER_URL="https://demo.qc.iqm.fi/cocos" ./program
+
+            # Executing the same program against an IQM Server with a different underlying QPU
+            # architecture will result in an error.
+            IQM_SERVER_URL="https://<Apollo IQM Server>/cocos" ./program
+
+        To emulate the IQM machine locally, without submitting to the IQM Server,
+        you can also pass the ``--emulate`` flag to ``nvq++``. This will emit any target
+        specific compiler diagnostics, before running a noise free emulation.
+
+        .. code:: bash
+
+            nvq++ --emulate --target iqm --iqm-machine Adonis src.cpp
+
+        To see a complete example for using IQM server backends, take a look at our :doc:`C++ examples <../../examples/examples>`.
+
+
+OQC
+++++
+
+.. _oqc-backend:
+
+
+
+`Oxford Quantum Circuits <https://oxfordquantumcircuits.com/>`__ (OQC) is currently providing CUDA-Q integration for multiple Quantum Processing Unit types.
+The 8 qubit ring topology Lucy device and the 32 qubit Kagome lattice topology Toshiko device are both supported via machine options described below.
+
+Setting Credentials
+`````````````````````````
+
+In order to use the OQC devices you will need to register.
+Registration is achieved by contacting `oqc_qcaas_support@oxfordquantumcircuits.com`.
+
+Once registered you will be able to authenticate with your ``email`` and ``password``
+
+There are three environment variables that the OQC target will look for during configuration:
+
+1. ``OQC_URL``
+2. ``OQC_EMAIL``
+3. ``OQC_PASSWORD`` - is mandatory
+
+
+Submitting
+`````````````````````````
+
+
+.. tab:: Python
+
+
+        To set which OQC URL, set the :code:`url` parameter.
+        To set which OQC email, set the :code:`email` parameter.
+        To set which OQC machine, set the :code:`machine` parameter.
+
+        .. code:: python
+
+            import os
+            import cudaq
+            # ...
+            os.environ['OQC_PASSWORD'] = password
+            cudaq.set_target("oqc", url=url, machine="lucy")
+
+        You can then execute a kernel against the platform using the OQC Lucy device
+
+        .. code:: python
+
+            kernel = cudaq.make_kernel()
+            qvec = kernel.qalloc(2)
+            kernel.h(qvec[0])
+            kernel.x(qvec[1])
+            kernel.cx(qvec[0], qvec[1])
+            kernel.mz(qvec)
+            str(cudaq.sample(kernel=kernel, shots_count=1000))
+
+
+.. tab:: C++
+
+
+        To target quantum kernel code for execution on the OQC platform, provide the flag ``--target oqc`` to the ``nvq++`` compiler.
+
+        Users may provide their :code:`email` and :code:`url` as extra arguments
+
+        .. code:: bash
+
+            nvq++ --target oqc --oqc-email <email> --oqc-url <url> src.cpp -o executable
+
+        Where both environment variables and extra arguments are supplied, precedent is given to the extra arguments.
+        To run the output, provide the runtime loaded variables and invoke the pre-built executable
+
+        .. code:: bash
+
+           OQC_PASSWORD=<password> ./executable
+
+        To emulate the OQC device locally, without submitting through the OQC QCaaS services, you can pass the ``--emulate`` flag to ``nvq++``.
+        This will emit any target specific compiler warnings and diagnostics, before running a noise free emulation.
+
+        .. code:: bash
+
+            nvq++ --emulate --target oqc src.cpp -o executable
+
+
+        .. note::
+
+            The oqc target supports a ``--oqc-machine`` option.
+            The default is the 8 qubit Lucy device.
+            You can set this to be either ``toshiko`` or ``lucy`` via this flag.
+
+        .. note::
+
+            The OQC quantum assembly toolchain (qat) which is used to compile and execute instructions can be found on github as `oqc-community/qat <https://github.com/oqc-community/qat>`__
+
+
diff --git a/docs/sphinx/using/backends/qpus.png b/docs/sphinx/using/backends/qpus.png
new file mode 100644
index 0000000000..7c0f5e7522
Binary files /dev/null and b/docs/sphinx/using/backends/qpus.png differ
diff --git a/docs/sphinx/using/backends/platform.rst b/docs/sphinx/using/backends/sims/mqpusims.rst
similarity index 92%
rename from docs/sphinx/using/backends/platform.rst
rename to docs/sphinx/using/backends/sims/mqpusims.rst
index da019f3f65..9fcbc5afdc 100644
--- a/docs/sphinx/using/backends/platform.rst
+++ b/docs/sphinx/using/backends/sims/mqpusims.rst
@@ -1,5 +1,7 @@
-Multi-Processor Platforms
----------------------------------------------------
+
+Multiple QPUs
+===========================
+
 The CUDA-Q machine model elucidates the various devices considered in the 
 broader quantum-classical compute node context. Programmers will have one or many 
 host CPUs, zero or many NVIDIA GPUs, a classical QPU control space, and the
@@ -17,11 +19,11 @@ specific asynchronous function invocations targeting a desired QPU.
 
 .. _mqpu-platform:
 
-NVIDIA `MQPU` Platform
-++++++++++++++++++++++
+Simulate Multiple QPUs in Parallel 
++++++++++++++++++++++++++++++++++++++
 
-In the multi-QPU mode (:code:`mqpu` option), the NVIDIA target provides a simulated QPU for every available NVIDIA GPU on the underlying system. 
-Each QPU is simulated via a `cuStateVec` simulator backend as defined by the NVIDIA target. For more information about using multiple GPUs 
+In the multi-QPU mode (:code:`mqpu` option), the NVIDIA backend provides a simulated QPU for every available NVIDIA GPU on the underlying system. 
+Each QPU is simulated via a `cuStateVec` simulator backend as defined by the NVIDIA backend. For more information about using multiple GPUs 
 to simulate each virtual QPU, or using a different backend for virtual QPUs, please see :ref:`remote MQPU platform <remote-mqpu-platform>`.
 This target enables asynchronous parallel execution of quantum kernel tasks.
 
@@ -29,13 +31,13 @@ Here is a simple example demonstrating its usage.
 
 .. tab:: Python
 
-    .. literalinclude:: ../../snippets/python/using/cudaq/platform/sample_async.py
+    .. literalinclude:: ../../../snippets/python/using/cudaq/platform/sample_async.py
         :language: python
         :start-after: [Begin Documentation]
 
 .. tab:: C++
 
-    .. literalinclude:: ../../snippets/cpp/using/cudaq/platform/sample_async.cpp
+    .. literalinclude:: ../../../snippets/cpp/using/cudaq/platform/sample_async.cpp
         :language: cpp
         :start-after: [Begin Documentation]
         :end-before: [End Documentation]
@@ -76,13 +78,13 @@ QPU via the :code:`cudaq::get_state_async` (C++) or :code:`cudaq.get_state_async
 
 .. tab:: Python
 
-    .. literalinclude:: ../../snippets/python/using/cudaq/platform/get_state_async.py
+    .. literalinclude:: ../../../snippets/python/using/cudaq/platform/get_state_async.py
         :language: python
         :start-after: [Begin Documentation]
 
 .. tab:: C++
 
-    .. literalinclude:: ../../snippets/cpp/using/cudaq/platform/get_state_async.cpp
+    .. literalinclude:: ../../../snippets/cpp/using/cudaq/platform/get_state_async.cpp
         :language: cpp
         :start-after: [Begin Documentation]
         :end-before: [End Documentation]
@@ -95,6 +97,9 @@ QPU via the :code:`cudaq::get_state_async` (C++) or :code:`cudaq.get_state_async
         nvq++ get_state_async.cpp --target nvidia --target-option mqpu
         ./a.out
 
+See the `Hadamard Test notebook <https://nvidia.github.io/cuda-quantum/latest/applications/python/hadamard_test.html>`__ for an application that leverages the `mqpu` backend. 
+
+
 .. deprecated:: 0.8
     The :code:`nvidia-mqpu` and :code:`nvidia-mqpu-fp64` targets, which are equivalent to the multi-QPU options `mqpu,fp32` and `mqpu,fp64`, respectively, of the :code:`nvidia` target, are deprecated and will be removed in a future release.
 
@@ -115,7 +120,7 @@ An example of MPI distribution mode usage in both C++ and Python is given below:
 
 .. tab:: Python
 
-    .. literalinclude:: ../../snippets/python/using/cudaq/platform/observe_mqpu_mpi.py
+    .. literalinclude:: ../../../snippets/python/using/cudaq/platform/observe_mqpu_mpi.py
         :language: python
         :start-after: [Begin Documentation]
 
@@ -125,7 +130,7 @@ An example of MPI distribution mode usage in both C++ and Python is given below:
 
 .. tab:: C++
 
-    .. literalinclude:: ../../snippets/cpp/using/cudaq/platform/observe_mqpu_mpi.cpp
+    .. literalinclude:: ../../../snippets/cpp/using/cudaq/platform/observe_mqpu_mpi.cpp
         :language: cpp
         :start-after: [Begin Documentation]
         :end-before: [End Documentation]
@@ -140,8 +145,8 @@ CUDA-Q provides MPI utility functions to initialize, finalize, or query (rank, s
 Last but not least, the compiled executable (C++) or Python script needs to be launched with an appropriate MPI command, 
 e.g., :code:`mpiexec`, :code:`mpirun`, :code:`srun`, etc.
 
-Remote `MQPU` Platform
-+++++++++++++++++++++++++++
+Multi-QPU + Other Backends 
++++++++++++++++++++++++++++++
 
 .. _remote-mqpu-platform:
 
@@ -154,14 +159,14 @@ each simulated by a `tensornet` simulator backend.
 
 .. tab:: Python
 
-    .. literalinclude:: ../../snippets/python/using/cudaq/platform/sample_async_remote.py
+    .. literalinclude:: ../../../snippets/python/using/cudaq/platform/sample_async_remote.py
         :language: python
         :start-after: [Begin Documentation]
         :end-before: [End Documentation]
 
 .. tab:: C++
 
-    .. literalinclude:: ../../snippets/cpp/using/cudaq/platform/sample_async_remote.cpp
+    .. literalinclude:: ../../../snippets/cpp/using/cudaq/platform/sample_async_remote.cpp
         :language: cpp
         :start-after: [Begin Documentation]
         :end-before: [End Documentation]
diff --git a/docs/sphinx/using/backends/sims/noisy.rst b/docs/sphinx/using/backends/sims/noisy.rst
new file mode 100644
index 0000000000..9549d815ef
--- /dev/null
+++ b/docs/sphinx/using/backends/sims/noisy.rst
@@ -0,0 +1,76 @@
+
+Density Matrix Simulators
+==================================
+
+
+Density Matrix 
+++++++++++++++++
+
+.. _density-matrix-cpu-backend:
+
+Density matrix simulation is helpful for understanding the impact of noise on quantum applications. Unlike state vectors simulation which manipulates the :math:`2^n` state vector, density matrix simulations manipulate the :math:`2^n x 2^n`  density matrix which defines an ensemble of states. To learn how you can leverage the :code:`density-matrix-cpu` backend to study the impact of noise models on your applications, see the  `example here <https://nvidia.github.io/cuda-quantum/latest/examples/python/noisy_simulations.html>`__.
+
+The `Quantum Volume notebook <https://nvidia.github.io/cuda-quantum/latest/applications/python/quantum_volume.html>`__ also demonstrates a full application that leverages the :code:`density-matrix-cpu` backend. 
+
+To execute a program on the :code:`density-matrix-cpu` target, use the following commands:
+
+.. tab:: Python
+
+    .. code:: bash 
+
+        python3 program.py [...] --target density-matrix-cpu
+
+    The target can also be defined in the application code by calling
+
+    .. code:: python 
+
+        cudaq.set_target('density-matrix-cpu')
+
+    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
+
+.. tab:: C++
+
+    .. code:: bash 
+
+        nvq++ --target density-matrix-cpu program.cpp [...] -o program.x
+        ./program.x
+
+
+Stim 
+++++++
+
+.. _stim-backend:
+
+This backend provides a fast simulator for circuits containing *only* Clifford
+gates. Any non-Clifford gates (such as T gates and Toffoli gates) are not
+supported. This simulator is based on the `Stim <https://github.com/quantumlib/Stim>`_
+library.
+
+To execute a program on the :code:`stim` target, use the following commands:
+
+.. tab:: Python
+
+    .. code:: bash 
+
+        python3 program.py [...] --target stim
+
+    The target can also be defined in the application code by calling
+
+    .. code:: python 
+
+        cudaq.set_target('stim')
+
+    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
+
+.. tab:: C++
+
+    .. code:: bash 
+
+        nvq++ --target stim program.cpp [...] -o program.x
+        ./program.x
+
+.. note::
+    CUDA-Q currently executes kernels using a "shot-by-shot" execution approach.
+    This allows for conditional gate execution (i.e. full control flow), but it
+    can be slower than executing Stim a single time and generating all the shots
+    from that single execution.
diff --git a/docs/sphinx/using/examples/photonic_operations.rst b/docs/sphinx/using/backends/sims/photonics.rst
similarity index 53%
rename from docs/sphinx/using/examples/photonic_operations.rst
rename to docs/sphinx/using/backends/sims/photonics.rst
index 18bd1ffbfa..ba785e951f 100644
--- a/docs/sphinx/using/examples/photonic_operations.rst
+++ b/docs/sphinx/using/backends/sims/photonics.rst
@@ -1,8 +1,51 @@
+Photonics Simulators
+=======================
+
+CUDA-Q provides the ability to simulate photonics circuits. This page provides the details 
+needed to run photonics simulations followed by an introduction to photonics kernels.
+
+
+orca-photonics
+----------------
+
+The :code:`orca-photonics` backend provides a state vector simulator with the :code:`Q++` library. 
+The :code:`orca-photonics` backend supports supports a double precision simulator that can run in multiple CPUs.
+
+OpenMP CPU-only
+^^^^^^^^^^^^^^^^^^^^
+
+.. _qpp-cpu-photonics-backend:
+
+This target provides a state vector simulator based on the CPU-only, OpenMP threaded `Q++ <https://github.com/softwareqinc/qpp>`_  library.
+To execute a program on the :code:`orca-photonics` target, use the following commands:
+
+.. tab:: Python
+
+    .. code:: bash
+
+        python3 program.py [...] --target orca-photonics
+
+    The target can also be defined in the application code by calling
+
+    .. code:: python
+ 
+        cudaq.set_target('orca-photonics')
+
+    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
+
+.. tab:: C++
+
+    .. code:: bash
+
+        nvq++ --library-mode --target orca-photonics program.cpp [...] -o program.x
+
+
 Photonics 101
-======================
+^^^^^^^^^^^^^^^^
+The following provides a basic introduction to photonics circuits so that you can simulate your own photonics circuits.  
 
 Quantum Photonic States
------------------------------
+++++++++++++++++++++++++
 
 We define a qumode (qudit) to have the states
 :math:`\ket{0}`, :math:`\ket{1}`, ... :math:`\ket{d}` in Dirac notation where:
@@ -43,7 +86,7 @@ relatively large.
 
 
 Quantum Photonics Gates
------------------------
+++++++++++++++++++++++++
 
 We can manipulate the state of a qumode via quantum photonic gates. For
 example, the create gate allows us to increase the number of photons in a
@@ -63,7 +106,7 @@ qumode up to a maximum given by the qudit level :math:`d`:
             \begin{bmatrix} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ 0 \end{bmatrix} =
             \begin{bmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ 0 \\ 0 \end{bmatrix}
 
-.. literalinclude:: ../../snippets/python/using/examples/create_photonic_gate.py
+.. literalinclude:: ../../../snippets/python/using/examples/create_photonic_gate.py
     :language: python
     :start-after: [Begin Docs]
     :end-before: [End Docs]
@@ -90,7 +133,7 @@ value 0, the operation has no effect:
             \begin{bmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ 0 \\ 0 \end{bmatrix} =
             \begin{bmatrix} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ 0 \end{bmatrix}
 
-.. literalinclude:: ../../snippets/python/using/examples/annihilate_photonic_gate.py
+.. literalinclude:: ../../../snippets/python/using/examples/annihilate_photonic_gate.py
     :language: python
     :start-after: [Begin Docs]
     :end-before: [End Docs]
@@ -125,7 +168,7 @@ As an example, the code below implements a simulation of the Hong-Ou-Mandel
 effect, in which two identical photons that interfere on a balanced beam
 splitter leave the beam splitter together.
 
-.. literalinclude:: ../../snippets/python/using/examples/beam_splitter_photonic_gate.py
+.. literalinclude:: ../../../snippets/python/using/examples/beam_splitter_photonic_gate.py
     :language: python
     :start-after: [Begin Docs]
     :end-before: [End Docs]
@@ -135,11 +178,11 @@ splitter leave the beam splitter together.
     { 02:491 20:509 }
 
 
-For a full list of photonic gates supported in CUDA-Q see
-:doc:`../../api/default_ops`.
+For a full list of photonic gates supported in CUDA-Q see 
+:doc:`Photonic Operations on Qudits <../../../api/default_ops>`.
 
 Measurements
------------------------------
+++++++++++++++
 
 Quantum theory is probabilistic and hence requires statistical inference
 to derive observations. Prior to measurement, the state of a qumode is
@@ -156,3 +199,100 @@ or :math:`\lvert \alpha_d \rvert ^2`, respectively.
 As we see in the example of the `beam_splitter` gate above, states 02 and 20
 are yielded roughly 50% of the times, providing and illustration of the
 Hong-Ou-Mandel effect.
+
+Executing Photonics Kernels
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+In order to execute a photonics kernel, you need to specify a photonics simulator backend like :code:`orca-photonics` used in the example below. 
+There are two ways to execute photonics kernels :code:`sample` and :code:`get_state`
+
+
+The :code:`sample` command can be used to generate statistics about the quantum state. 
+
+
+.. code:: python
+
+        import cudaq
+        import numpy as np
+
+        qumode_count = 2
+
+        # Define the simulation target.
+        cudaq.set_target("orca-photonics")
+
+        # Define a quantum kernel function.
+
+
+        @cudaq.kernel
+        def kernel(qumode_count: int):
+            level = qumode_count + 1
+            qumodes = [qudit(level) for _ in range(qumode_count)]
+
+            # Apply the create gate to the qumodes.
+            for i in range(qumode_count):
+                create(qumodes[i])  # |00⟩ -> |11⟩
+
+            # Apply the beam_splitter gate to the qumodes.
+            beam_splitter(qumodes[0], qumodes[1], np.pi / 6)
+
+            # measure all qumodes
+            mz(qumodes)
+
+
+        result = cudaq.sample(kernel, qumode_count, shots_count=1000)
+
+        print(result)
+
+
+.. parsed-literal::
+
+   { 02:376 11:234 20:390 }
+
+
+The :code:`get_state` command can be used to generate statistics about the quantum state. 
+
+.. code:: python
+
+        import cudaq
+        import numpy as np
+
+        qumode_count = 2
+
+        # Define the simulation target.
+        cudaq.set_target("orca-photonics")
+
+        # Define a quantum kernel function.
+
+
+        @cudaq.kernel
+        def kernel(qumode_count: int):
+            level = qumode_count + 1
+            qumodes = [qudit(level) for _ in range(qumode_count)]
+
+            # Apply the create gate to the qumodes.
+            for i in range(qumode_count):
+                create(qumodes[i])  # |00⟩ -> |11⟩
+
+            # Apply the beam_splitter gate to the qumodes.
+            beam_splitter(qumodes[0], qumodes[1], np.pi / 6)
+
+            # measure some of all qumodes if need to be measured
+            # mz(qumodes)
+
+
+        # Compute the statevector of the kernel
+        result = cudaq.get_state(kernel, qumode_count)
+
+        print(np.array(result))
+
+
+.. parsed-literal::
+
+  [ 0.        +0.j  0.        +0.j -0.61237244+0.j  0.        +0.j
+  0.5       +0.j  0.        +0.j  0.61237244+0.j  0.        +0.j
+  0.        +0.j]
+
+The statevector generated by the :code:`get_state` command follows little-endian convention for associating numbers with their digit string representations, which places the least significant digit on the right. That is, for the example of a 2-qumode system of level 3 (in which possible states are 0, 1, and 2), resulting in the following translation between integers and digit string:
+ 
+.. image:: photonics_notation.png
+       :scale: 25%
+
diff --git a/docs/sphinx/using/backends/sims/photonics_notation.png b/docs/sphinx/using/backends/sims/photonics_notation.png
new file mode 100644
index 0000000000..4cc63f389a
Binary files /dev/null and b/docs/sphinx/using/backends/sims/photonics_notation.png differ
diff --git a/docs/sphinx/using/backends/sims/svsims.rst b/docs/sphinx/using/backends/sims/svsims.rst
new file mode 100644
index 0000000000..e3925308f2
--- /dev/null
+++ b/docs/sphinx/using/backends/sims/svsims.rst
@@ -0,0 +1,278 @@
+
+State Vector Simulators
+==================================
+
+CPU
+++++
+
+.. _openmp cpu-only:
+.. _qpp-cpu-backend:
+
+The `qpp-cpu` backend backend provides a state vector simulator based on the CPU-only, OpenMP threaded `Q++ <https://github.com/softwareqinc/qpp>`_ library.
+This backend is good for basic testing and experimentation with just a few qubits, but performs poorly for all but the smallest simulation and is the default target when running on CPU-only systems. 
+
+To execute a program on the :code:`qpp-cpu` target even if a GPU-accelerated backend is available, 
+
+use the following commands:
+
+.. tab:: Python
+
+    .. code:: bash 
+
+        python3 program.py [...] --target qpp-cpu
+
+    The target can also be defined in the application code by calling
+
+    .. code:: python 
+
+        cudaq.set_target('qpp-cpu')
+
+    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
+
+.. tab:: C++
+
+    .. code:: bash 
+
+        nvq++ --target qpp-cpu program.cpp [...] -o program.x
+        ./program.x
+
+
+Single-GPU 
+++++++++++++++
+
+.. _cuquantum single-gpu:
+.. _default-simulator:
+.. _nvidia-backend:
+
+
+The :code:`nvidia` backend  provides a state vector simulator accelerated with -
+the :code:`cuStateVec` library. The `cuStateVec documentation <https://docs.nvidia.com/cuda/cuquantum/latest/custatevec/index.html>`__ provides a detailed explanation for how the simulations are performed on the GPU.
+
+The :code:`nvidia` target supports multiple configurable options including specification of floating point precision.
+
+To execute a program on the :code:`nvidia` backend, use the following commands:
+
+.. tab:: Python
+
+    Single Precision (Default):
+
+    .. code:: bash 
+
+        python3 program.py [...] --target nvidia --target-option fp32
+
+    Double Precision:
+
+    .. code:: bash 
+
+        python3 program.py [...] --target nvidia --target-option fp64
+    
+    The target can also be defined in the application code by calling
+
+    .. code:: python 
+
+        cudaq.set_target('nvidia', option = 'fp64')
+
+    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
+
+.. tab:: C++
+
+     Single Precision (Default):
+
+     .. code:: bash 
+
+        nvq++ --target nvidia --target-option fp32 program.cpp [...] -o program.x
+        ./program.x
+
+
+     Double Precision (Default):
+
+     .. code:: bash 
+
+        nvq++ --target nvidia --target-option fp64 program.cpp [...] -o program.x
+        ./program.x
+     
+.. note:: 
+   This backend requires an NVIDIA GPU and CUDA runtime libraries. If you do not have these dependencies installed, you may encounter an error stating `Invalid simulator requested`. See the section :ref:`dependencies-and-compatibility` for more information about how to install dependencies.
+
+
+In the single-GPU mode, the :code:`nvidia` backend provides the following
+environment variable options. Any environment variables must be set prior to
+setting the target. It is worth drawing attention to gate fusion, a powerful tool for improving simulation performance which is discussed in greater detail `here <https://nvidia.github.io/cuda-quantum/latest/examples/python/performance_optimizations.html>`__.
+
+.. list-table:: **Environment variable options supported in single-GPU mode**
+  :widths: 20 30 50
+
+  * - Option
+    - Value
+    - Description
+  * - ``CUDAQ_FUSION_MAX_QUBITS``
+    - positive integer
+    - The max number of qubits used for gate fusion. The default value is `4`.
+  * - ``CUDAQ_FUSION_DIAGONAL_GATE_MAX_QUBITS``
+    - integer greater than or equal to -1
+    - The max number of qubits used for diagonal gate fusion. The default value is set to `-1` and the fusion size will be automatically adjusted for the better performance. If 0, the gate fusion for diagonal gates is disabled.
+  * - ``CUDAQ_FUSION_NUM_HOST_THREADS``
+    - positive integer
+    - Number of CPU threads used for circuit processing. The default value is `8`.
+  * - ``CUDAQ_MAX_CPU_MEMORY_GB``
+    - non-negative integer, or `NONE`
+    - CPU memory size (in GB) allowed for state-vector migration. `NONE` means unlimited (up to physical memory constraints). Default is 0GB (disabled, variable is not set to any value).
+  * - ``CUDAQ_MAX_GPU_MEMORY_GB``
+    - positive integer, or `NONE`
+    - GPU memory (in GB) allowed for on-device state-vector allocation. As the state-vector size exceeds this limit, host memory will be utilized for migration. `NONE` means unlimited (up to physical memory constraints). This is the default.
+
+.. deprecated:: 0.8
+    The :code:`nvidia-fp64` targets, which is equivalent setting the `fp64` option on the :code:`nvidia` target, 
+    is deprecated and will be removed in a future release.
+
+
+
+Multi-node multi-GPU 
++++++++++++++++++++++++
+
+.. _nvidia-mgpu-backend:
+
+The :code:`nvidia` backend also provides a state vector simulator accelerated with 
+the :code:`cuStateVec` library with support for Multi-Node, Multi-GPU distribution of the 
+state vector.
+
+This backend is necessary to scale applications that require a state vector that cannot fit on a single GPU memory.
+
+The multi-node multi-GPU simulator expects to run within an MPI context.
+To execute a program on the multi-node multi-GPU NVIDIA target, use the following commands 
+(adjust the value of the :code:`-np` flag as needed to reflect available GPU resources on your system):
+
+See the `Divisive Clustering <https://nvidia.github.io/cuda-quantum/latest/applications/python/divisive_clustering_coresets.html>`__ application to see how this backend can be used in practice.
+
+.. tab:: Python
+
+    Double precision simulation:
+
+    .. code:: bash 
+
+        mpiexec -np 2 python3 program.py [...] --target nvidia --target-option fp64,mgpu
+
+    Single precision simulation:
+    
+    .. code:: bash 
+
+        mpiexec -np 2 python3 program.py [...] --target nvidia --target-option fp32,mgpu
+
+    .. note::
+
+      If you installed CUDA-Q via :code:`pip`, you will need to install the necessary MPI dependencies separately;
+      please follow the instructions for installing dependencies in the `Project Description <https://pypi.org/project/cuda-quantum/#description>`__.
+
+    In addition to using MPI in the simulator, you can use it in your application code by installing `mpi4py <https://mpi4py.readthedocs.io/>`__, and 
+    invoking the program with the command
+
+    .. code:: bash 
+
+        mpiexec -np 2 python3 -m mpi4py program.py [...] --target nvidia --target-option fp64,mgpu
+
+    The target can also be defined in the application code by calling
+
+    .. code:: python 
+
+        cudaq.set_target('nvidia', option='mgpu,fp64')
+
+    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
+
+    .. note::
+        
+        * The order of the option settings are interchangeable.
+          For example, `cudaq.set_target('nvidia', option='mgpu,fp64')` is equivalent to `cudaq.set_target('nvidia', option='fp64,mgpu')`.
+
+        * The `nvidia` target has single-precision as the default setting. Thus, using `option='mgpu'` implies that `option='mgpu,fp32'`.  
+
+.. tab:: C++
+
+    Double precision simulation:
+
+    .. code:: bash 
+
+        nvq++ --target nvidia  --target-option mgpu,fp64 program.cpp [...] -o program.x
+        mpiexec -np 2 ./program.x
+
+    Single precision simulation:
+
+    .. code:: bash 
+
+        nvq++ --target nvidia  --target-option mgpu,fp32 program.cpp [...] -o program.x
+        mpiexec -np 2 ./program.x
+
+.. note:: 
+
+  This backend requires an NVIDIA GPU, CUDA runtime libraries, as well as an MPI installation. If you do not have these dependencies installed, you may encounter either an error stating `invalid simulator requested` (missing CUDA libraries), or an error along the lines of `failed to launch kernel` (missing MPI installation). See the section :ref:`dependencies-and-compatibility` for more information about how to install dependencies.
+  
+  The number of processes and nodes should be always power-of-2. 
+
+  Host-device state vector migration is also supported in the multi-node multi-GPU configuration. 
+
+
+In addition to those environment variable options supported in the single-GPU mode,
+the :code:`nvidia` backend provides the following environment variable options particularly for 
+the multi-node multi-GPU configuration. Any environment variables must be set
+prior to setting the target.
+
+.. list-table:: **Additional environment variable options for multi-node multi-GPU mode**
+  :widths: 20 30 50
+
+  * - Option
+    - Value
+    - Description
+  * - ``CUDAQ_MGPU_LIB_MPI``
+    - string
+    - The shared library name for inter-process communication. The default value is `libmpi.so`.
+  * - ``CUDAQ_MGPU_COMM_PLUGIN_TYPE``
+    - `AUTO`, `EXTERNAL`, `OpenMPI`, or `MPICH` 
+    - Selecting :code:`cuStateVec` `CommPlugin` for inter-process communication. The default is `AUTO`. If `EXTERNAL` is selected, `CUDAQ_MGPU_LIB_MPI` should point to an implementation of :code:`cuStateVec` `CommPlugin` interface.
+  * - ``CUDAQ_MGPU_NQUBITS_THRESH``
+    - positive integer
+    - The qubit count threshold where state vector distribution is activated. Below this threshold, simulation is performed as independent (non-distributed) tasks across all MPI processes for optimal performance. Default is 25. 
+  * - ``CUDAQ_MGPU_FUSE``
+    - positive integer
+    - The max number of qubits used for gate fusion. The default value is `6` if there are more than one MPI processes or `4` otherwise.
+  * - ``CUDAQ_MGPU_P2P_DEVICE_BITS``
+    - positive integer
+    - Specify the number of GPUs that can communicate by using GPUDirect P2P. Default value is 0 (P2P communication is disabled).
+  * - ``CUDAQ_GPU_FABRIC``
+    - `MNNVL`, `NVL`, or `NONE`
+    - Automatically set the number of P2P device bits based on the total number of processes when multi-node NVLink (`MNNVL`) is selected; or the number of processes per node when NVLink (`NVL`) is selected; or disable P2P (with `NONE`). 
+  * - ``CUDAQ_GLOBAL_INDEX_BITS``
+    - comma-separated list of positive integers
+    - Specify the inter-node network structure (faster to slower). For example, assuming a 8 nodes, 4 GPUs/node simulation whereby network communication is faster, this `CUDAQ_GLOBAL_INDEX_BITS` environment variable can be set to `3,2`. The first `3` represents **8** nodes with fast communication and the second `2` represents **4** 8-node groups in those total 32 nodes. Default is an empty list (no customization based on network structure of the cluster).
+  * - ``CUDAQ_HOST_DEVICE_MIGRATION_LEVEL``
+    - positive integer
+    - Specify host-device memory migration w.r.t. the network structure. If provided, this setting determines the position to insert the number of migration index bits to the `CUDAQ_GLOBAL_INDEX_BITS` list. By default, if not set, the number of migration index bits (CPU-GPU data transfers) is appended to the end of the array of index bits (aka, state vector distribution scheme). This default behavior is optimized for systems with fast GPU-GPU interconnects (NVLink, InfiniBand, etc.) 
+
+.. deprecated:: 0.8
+    The :code:`nvidia-mgpu` backend, which is equivalent to the multi-node multi-GPU double-precision option (`mgpu,fp64`) of the :code:`nvidia`
+    is deprecated and will be removed in a future release.
+
+The above configuration options of the :code:`nvidia` backend 
+can be tuned to reduce your simulation runtimes. One of the
+performance improvements is to fuse multiple gates together during runtime. For
+example, :code:`x(qubit0)` and :code:`x(qubit1)` can be fused together into a
+single 4x4 matrix operation on the state vector rather than 2 separate 2x2
+matrix operations on the state vector. This fusion reduces memory bandwidth on
+the GPU because the state vector is transferred into and out of memory fewer
+times. By default, up to 4 gates are fused together for single-GPU simulations,
+and up to 6 gates are fused together for multi-GPU simulations. The number of
+gates fused can **significantly** affect performance of some circuits, so users
+can override the default fusion level by setting the setting `CUDAQ_MGPU_FUSE`
+environment variable to another integer value as shown below.
+
+.. tab:: Python
+
+    .. code:: bash 
+
+        CUDAQ_MGPU_FUSE=5 mpiexec -np 2 python3 program.py [...] --target nvidia --target-option mgpu,fp64
+
+.. tab:: C++
+
+    .. code:: bash 
+
+        nvq++ --target nvidia --target-option mgpu,fp64 program.cpp [...] -o program.x
+        CUDAQ_MGPU_FUSE=5 mpiexec -np 2 ./program.x
+
diff --git a/docs/sphinx/using/backends/sims/tnsims.rst b/docs/sphinx/using/backends/sims/tnsims.rst
new file mode 100644
index 0000000000..2e0a46ad33
--- /dev/null
+++ b/docs/sphinx/using/backends/sims/tnsims.rst
@@ -0,0 +1,229 @@
+
+Tensor Network Simulators
+==================================
+
+.. _tensor-backends:
+
+CUDA-Q provides a couple of tensor-network simulator backends accelerated with 
+the :code:`cuTensorNet` library. Detailed technical information on the simulator can be found `here <https://docs.nvidia.com/cuda/cuquantum/latest/cutensornet/index.html>`__. 
+These backends are available for use from both C++ and Python.
+
+Tensor network simulators are suitable for large-scale simulation of certain classes of quantum circuits involving many qubits beyond the memory limit of state vector based simulators. For example, computing the expectation value of a Hamiltonian via :code:`cudaq::observe` can be performed efficiently, thanks to :code:`cuTensorNet` contraction optimization capability. On the other hand, conditional circuits, i.e., those with mid-circuit measurements or reset, despite being supported by both backends, may result in poor performance. 
+
+Multi-node multi-GPU 
+++++++++++++++++++++++
+
+The :code:`tensornet` backend represents quantum states and circuits as tensor networks in an exact form (no approximation). 
+Measurement samples and expectation values are computed via tensor network contractions. 
+This backend supports multi-node, multi-GPU distribution of tensor operations required to evaluate and simulate the circuit.
+
+To execute a program on the :code:`tensornet` target using a *single GPU*, use the following commands:
+
+.. tab:: Python
+
+    .. code:: bash 
+
+        python3 program.py [...] --target tensornet
+
+    The target can also be defined in the application code by calling
+
+    .. code:: python 
+
+        cudaq.set_target('tensornet')
+
+    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
+
+.. tab:: C++
+
+    .. code:: bash 
+
+        nvq++ --target tensornet program.cpp [...] -o program.x
+        ./program.x
+
+If you have *multiple GPUs* available on your system, you can use MPI to automatically distribute parallelization across the visible GPUs. 
+
+.. note::
+
+  If you installed the CUDA-Q Python wheels, distribution across multiple GPUs is currently not supported for this backend.
+  We will add support for it in future releases. For more information, see this `GitHub issue <https://github.com/NVIDIA/cuda-quantum/issues/920>`__.
+
+Use the following commands to enable distribution across multiple GPUs (adjust the value of the :code:`-np` flag as needed to reflect available GPU resources on your system):
+
+.. tab:: Python
+
+    .. code:: bash 
+
+        mpiexec -np 2 python3 program.py [...] --target tensornet
+
+    In addition to using MPI in the simulator, you can use it in your application code by installing `mpi4py <https://mpi4py.readthedocs.io/>`__, and 
+    invoking the program with the command
+
+    .. code:: bash 
+
+        mpiexec -np 2 python3 -m mpi4py program.py [...] --target tensornet
+
+.. tab:: C++
+
+    .. code:: bash 
+
+        nvq++ --target tensornet program.cpp [...] -o program.x
+        mpiexec -np 2 ./program.x
+
+.. note::
+
+  If the `CUTENSORNET_COMM_LIB` environment variable is not set, MPI parallelization on the :code:`tensornet` backend may fail.
+  If you are using a CUDA-Q container, this variable is pre-configured and no additional setup is needed. If you are customizing your installation or have built CUDA-Q from source, please follow the instructions for `activating the distributed interface <https://docs.nvidia.com/cuda/cuquantum/latest/getting-started/index.html#from-nvidia-devzone>`__ for the `cuTensorNet` library. This requires 
+  :ref:`installing CUDA development dependencies <additional-cuda-tools>`, and setting the `CUTENSORNET_COMM_LIB`
+  environment variable to the newly built `libcutensornet_distributed_interface_mpi.so` library.
+
+Specific aspects of the simulation can be configured by setting the following of environment variables:
+
+* **`CUDA_VISIBLE_DEVICES=X`**: Makes the process only see GPU X on multi-GPU nodes. Each MPI process must only see its own dedicated GPU. For example, if you run 8 MPI processes on a DGX system with 8 GPUs, each MPI process should be assigned its own dedicated GPU via `CUDA_VISIBLE_DEVICES` when invoking `mpiexec` (or `mpirun`) commands. 
+* **`OMP_PLACES=cores`**: Set this environment variable to improve CPU parallelization.
+* **`OMP_NUM_THREADS=X`**: To enable CPU parallelization, set X to `NUMBER_OF_CORES_PER_NODE/NUMBER_OF_GPUS_PER_NODE`.
+
+.. note:: 
+
+  This backend requires an NVIDIA GPU and CUDA runtime libraries. 
+  If you do not have these dependencies installed, you may encounter an error stating `Invalid simulator requested`. 
+  See the section :ref:`dependencies-and-compatibility` for more information about how to install dependencies.
+
+.. note::
+
+  Setting random seed, via :code:`cudaq::set_random_seed`, is not supported for this backend due to a limitation of the :code:`cuTensorNet` library. This will be fixed in future release once this feature becomes available.
+
+
+Matrix product state 
++++++++++++++++++++++++
+
+The :code:`tensornet-mps` backend is based on the matrix product state (MPS) representation of the state vector/wave function, exploiting the sparsity in the tensor network via tensor decomposition techniques such as QR and SVD. As such, this backend is an approximate simulator, whereby the number of singular values may be truncated to keep the MPS size tractable. 
+The :code:`tensornet-mps` backend only supports single-GPU simulation. Its approximate nature allows the :code:`tensornet-mps` backend to handle a large number of qubits for certain classes of quantum circuits on a relatively small memory footprint.
+
+To execute a program on the :code:`tensornet-mps` target, use the following commands:
+
+.. tab:: Python
+
+    .. code:: bash 
+
+        python3 program.py [...] --target tensornet-mps
+
+    The target can also be defined in the application code by calling
+
+    .. code:: python 
+
+        cudaq.set_target('tensornet-mps')
+
+    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
+
+.. tab:: C++
+
+    .. code:: bash 
+
+        nvq++ --target tensornet-mps program.cpp [...] -o program.x
+        ./program.x
+
+Specific aspects of the simulation can be configured by defining the following environment variables:
+
+* **`CUDAQ_MPS_MAX_BOND=X`**: The maximum number of singular values to keep (fixed extent truncation). Default: 64.
+* **`CUDAQ_MPS_ABS_CUTOFF=X`**: The cutoff for the largest singular value during truncation. Eigenvalues that are smaller will be trimmed out. Default: 1e-5.
+* **`CUDAQ_MPS_RELATIVE_CUTOFF=X`**: The cutoff for the maximal singular value relative to the largest eigenvalue. Eigenvalues that are smaller than this fraction of the largest singular value will be trimmed out. Default: 1e-5
+* **`CUDAQ_MPS_SVD_ALGO=X`**: The SVD algorithm to use. Valid values are: `GESVD` (QR algorithm), `GESVDJ` (Jacobi method), `GESVDP` (`polar decomposition <https://epubs.siam.org/doi/10.1137/090774999>`__), `GESVDR` (`randomized methods <https://epubs.siam.org/doi/10.1137/090771806>`__). Default: `GESVDJ`.
+
+.. note:: 
+
+  This backend requires an NVIDIA GPU and CUDA runtime libraries. 
+  If you do not have these dependencies installed, you may encounter an error stating `Invalid simulator requested`. 
+  See the section :ref:`dependencies-and-compatibility` for more information about how to install dependencies.
+
+.. note::
+
+  Setting random seed, via :code:`cudaq::set_random_seed`, is not supported for this backend due to a limitation of the :code:`cuTensorNet` library. This will be fixed in future release once this feature becomes available.
+
+.. note::
+    The parallelism of Jacobi method (the default `CUDAQ_MPS_SVD_ALGO` setting) gives GPU better performance on small and medium size matrices.
+    If you expect a large number of singular values (e.g., increasing the `CUDAQ_MPS_MAX_BOND` setting), please adjust the `CUDAQ_MPS_SVD_ALGO` setting accordingly.  
+
+
+
+Fermioniq
+++++++++++
+
+.. _fermioniq-backend:
+
+`Fermioniq <https://fermioniq.com/>`__ offers a cloud-based tensor-network emulation platform, `Ava <https://www.fermioniq.com/ava/>`__, 
+for the approximate simulation of large-scale quantum circuits beyond the memory limit of state vector and exact tensor network based methods. 
+
+The level of approximation can be controlled by setting the bond dimension: larger values yield more accurate simulations at the expense 
+of slower computation time. For a detailed description of Ava users are referred to the `online documentation <https://docs.fermioniq.com/>`__.
+
+Users of CUDA-Q can access a simplified version of the full Fermioniq emulator (`Ava <https://www.fermioniq.com/ava/>`__) from either
+C++ or Python. This version currently supports emulation of quantum circuits without noise, and can return measurement samples and/or 
+compute expectation values of observables.
+
+.. note::
+    In order to use the Fermioniq emulator, users must provide access credentials. These can be requested by contacting info@fermioniq.com 
+
+    The credentials must be set via two environment variables:
+    `FERMIONIQ_ACCESS_TOKEN_ID` and `FERMIONIQ_ACCESS_TOKEN_SECRET`.
+
+.. tab:: Python
+
+    The target to which quantum kernels are submitted 
+    can be controlled with the ``cudaq::set_target()`` function.
+
+    .. code:: python
+
+        cudaq.set_target('fermioniq')
+
+    You will have to specify a remote configuration id for the Fermioniq backend
+    during compilation.
+
+    .. code:: python
+
+        cudaq.set_target("fermioniq",**{
+            "remote_config": remote_config_id
+        })
+
+    For a comprehensive list of all remote configurations, please contact Fermioniq directly.
+
+    When your organization requires you to define a project id, you have to specify
+    the project id during compilation.
+
+    .. code:: python
+
+        cudaq.set_target("fermioniq",**{
+            "project_id": project_id
+        })
+
+.. tab:: C++
+
+    To target quantum kernel code for execution in the Fermioniq backends,
+    pass the flag ``--target fermioniq`` to the ``nvq++`` compiler. CUDA-Q will
+    authenticate via the Fermioniq REST API using the environment variables
+    set earlier.
+
+    .. code:: bash
+
+        nvq++ --target fermioniq src.cpp ...
+
+    You will have to specify a remote configuration id for the Fermioniq backend
+    during compilation.
+
+    .. code:: bash
+
+        nvq++ --target fermioniq --fermioniq-remote-config <remote_config_id> src.cpp ...
+
+    For a comprehensive list of all remote configurations, please contact Fermioniq directly.
+
+    When your organization requires you to define a project id, you have to specify
+    the project id during compilation.
+
+    .. code:: bash
+
+        nvq++ --target fermioniq --fermioniq-project-id <project_id> src.cpp ...
+
+    To specify the bond dimension, you can pass the ``fermioniq-bond-dim`` parameter.
+
+    .. code:: bash
+
+        nvq++ --target fermioniq --fermioniq-bond-dim 10 src.cpp ...       
diff --git a/docs/sphinx/using/backends/simulators.rst b/docs/sphinx/using/backends/simulators.rst
index 43d115f127..7a4e465968 100644
--- a/docs/sphinx/using/backends/simulators.rst
+++ b/docs/sphinx/using/backends/simulators.rst
@@ -1,759 +1,100 @@
-CUDA-Q Simulation Backends
-*********************************
+CUDA-Q Circuit Simulation Backends
+************************************
+.. _simulators:
+
+The simulators available in CUDA-Q are grouped in the figure below. The 
+following sections follow the structure of the figure and provide additional 
+technical details and code examples for using each circuit simulator.
+
+.. figure:: circuitsimulators.png
+   :width: 600
+   :align: center
+
+.. list-table:: Simulators In CUDA-Q
+   :header-rows: 1
+   :widths: 20 20 25 10 10 16
+
+   * - Simulator Name
+     - Method
+     - Purpose
+     - Processor(s)
+     - Precision(s)
+     - N Qubits
+   * - `qpp-cpu`
+     - State Vector
+     - Testing and small applications
+     - CPU
+     - single
+     - < 28
+   * - `nvidia`
+     - State Vector
+     - General purpose (default)
+     - Single GPU
+     - single / double
+     - < 33 / 32 (64 GB)
+   * - `nvidia, option=mgpu`
+     - State Vector
+     - Large-scale simulation
+     - multi-GPU multi-node
+     - single / double
+     - 33+
+   * - `tensornet`
+     - Tensor Network
+     - Shallow-depth (low-entanglement) and high width circuits
+     - multi-GPU multi-node
+     - single / double
+     - Thousands 
+   * - `tensornet-mps`
+     - Matrix Product State
+     - Square-shaped circuits
+     - Single GPU
+     - single / double
+     - Hundreds
+   * - `fermioniq`
+     - Various
+     - Various
+     - Single GPU
+     - Various
+     - Various
+   * - `nvidia, option=mqpu`
+     - State Vector 
+     - Asynchronous distribution across multiple simulated QPUs to speedup applications
+     - multi-GPU multi-node
+     - single / double
+     - < 33 / 32 (64 GB)
+   * - `remote-mqpu`
+     - State Vector / Tensor Network
+     - Combine `mqpu` with other backend like `tensornet` and `mgpu`
+     - varies
+     - varies
+     - varies
+   * - `density-matrix-cpu`
+     - Density Matrix
+     - Noisy simulations
+     - CPU
+     - single
+     - < 14
+   * - `stim`
+     - Stabilizer 
+     - QEC simulation
+     - CPU
+     - N/A
+     - Thousands +
+   * - `orca-photonics`
+     - State Vector
+     - Photonics
+     - CPU
+     - double
+     - Varies on qudit level
+
+
+
+.. toctree::
+   :maxdepth: 2
+      
+        State Vector Simulators <sims/svsims.rst>
+        Tensor Network Simulators <sims/tnsims.rst>
+        Multi-QPU Simulators <sims/mqpusims.rst>
+        Noisy Simulators <sims/noisy.rst>
+        Photonics Simulators <sims/photonics.rst>
 
-.. _nvidia-backend:
-
-The simulation backends that are currently available in CUDA-Q are as follows.
-
-State Vector Simulators
-==================================
-
-The :code:`nvidia` target provides a state vector simulator accelerated with 
-the :code:`cuStateVec` library. 
-
-The :code:`nvidia` target supports multiple configurable options.
-
-Features 
-+++++++++
-
-* Floating-point precision configuration 
-
-The floating point precision of the state vector data can be configured to either 
-double (`fp64`) or single (`fp32`) precision. This option can be chosen for the optimal performance and accuracy.
-
-
-* Distributed simulation
-
-The :code:`nvidia` target supports distributing state vector simulations to multiple GPUs and multiple nodes (`mgpu` :ref:`distribution <nvidia-mgpu-backend>`)
-and multi-QPU (`mqpu` :ref:`platform <mqpu-platform>`) distribution whereby each QPU is simulated via a single-GPU simulator instance.
-
-
-* Host CPU memory utilization 
-
-Host CPU memory can be leveraged in addition to GPU memory to accommodate the state vector 
-(i.e., maximizing the number of qubits to be simulated).
-
-* Trajectory simulation for noisy quantum circuits
-
-The :code:`nvidia` target supports noisy quantum circuit simulations using quantum trajectory method across all configurations: single GPU, multi-node multi-GPU, and with host memory.
-When simulating many trajectories with small state vectors, the simulation is batched for optimal performance.
-
-.. _cuQuantum single-GPU:
-
-
-Single-GPU 
-++++++++++++++++++++++++++++++++++
-
-To execute a program on the :code:`nvidia` target, use the following commands:
-
-.. tab:: Python
-
-    .. code:: bash 
-
-        python3 program.py [...] --target nvidia
-
-    The target can also be defined in the application code by calling
-
-    .. code:: python 
-
-        cudaq.set_target('nvidia')
-
-    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
-
-.. tab:: C++
-
-    .. code:: bash 
-
-        nvq++ --target nvidia program.cpp [...] -o program.x
-        ./program.x
-
-.. _nvidia-fp64-backend:
-
-By default, this will leverage :code:`FP32` floating point types for the simulation. To 
-switch to :code:`FP64`, specify the :code:`--target-option fp64` `nvq++` command line option for `C++` and `Python` or 
-use `cudaq.set_target('nvidia', option='fp64')` for Python in-source target modification instead. 
-
-.. tab:: Python
-
-    .. code:: bash 
-
-        python3 program.py [...] --target nvidia --target-option fp64
-
-    The precision of the :code:`nvidia` target can also be modified in the application code by calling
-
-    .. code:: python 
-
-        cudaq.set_target('nvidia', option='fp64')
-
-.. tab:: C++
-
-    .. code:: bash 
-
-        nvq++ --target nvidia --target-option fp64 program.cpp [...] -o program.x
-        ./program.x
-
-.. note:: 
-
-  This backend requires an NVIDIA GPU and CUDA runtime libraries. If you do not have these dependencies installed, you may encounter an error stating `Invalid simulator requested`. See the section :ref:`dependencies-and-compatibility` for more information about how to install dependencies.
-
-In the single-GPU mode, the :code:`nvidia` target provides the following
-environment variable options. Any environment variables must be set prior to
-setting the target.
-
-.. list-table:: **Environment variable options supported in single-GPU mode**
-  :widths: 20 30 50
-
-  * - Option
-    - Value
-    - Description
-  * - ``CUDAQ_FUSION_MAX_QUBITS``
-    - positive integer
-    - The max number of qubits used for gate fusion. The default value is `4`.
-  * - ``CUDAQ_FUSION_DIAGONAL_GATE_MAX_QUBITS``
-    - integer greater than or equal to -1
-    - The max number of qubits used for diagonal gate fusion. The default value is set to `-1` and the fusion size will be automatically adjusted for the better performance. If 0, the gate fusion for diagonal gates is disabled.
-  * - ``CUDAQ_FUSION_NUM_HOST_THREADS``
-    - positive integer
-    - Number of CPU threads used for circuit processing. The default value is `8`.
-  * - ``CUDAQ_MAX_CPU_MEMORY_GB``
-    - non-negative integer, or `NONE`
-    - CPU memory size (in GB) allowed for state-vector migration. `NONE` means unlimited (up to physical memory constraints). Default is 0GB (disabled, variable is not set to any value).
-  * - ``CUDAQ_MAX_GPU_MEMORY_GB``
-    - positive integer, or `NONE`
-    - GPU memory (in GB) allowed for on-device state-vector allocation. As the state-vector size exceeds this limit, host memory will be utilized for migration. `NONE` means unlimited (up to physical memory constraints). This is the default.
-
-.. deprecated:: 0.8
-    The :code:`nvidia-fp64` targets, which is equivalent setting the `fp64` option on the :code:`nvidia` target, 
-    is deprecated and will be removed in a future release.
-
-.. _nvidia-mgpu-backend:
-
-Multi-node multi-GPU
-++++++++++++++++++++++++++++++++++
-
-The NVIDIA target also provides a state vector simulator accelerated with 
-the :code:`cuStateVec` library with support for Multi-Node, Multi-GPU distribution of the 
-state vector, in addition to a single GPU.
-
-The multi-node multi-GPU simulator expects to run within an MPI context.
-To execute a program on the multi-node multi-GPU NVIDIA target, use the following commands 
-(adjust the value of the :code:`-np` flag as needed to reflect available GPU resources on your system):
-
-.. tab:: Python
-
-    Double precision simulation:
-
-    .. code:: bash 
-
-        mpiexec -np 2 python3 program.py [...] --target nvidia --target-option fp64,mgpu
-
-    Single precision simulation:
-    
-    .. code:: bash 
-
-        mpiexec -np 2 python3 program.py [...] --target nvidia --target-option fp32,mgpu
-
-    .. note::
-
-      If you installed CUDA-Q via :code:`pip`, you will need to install the necessary MPI dependencies separately;
-      please follow the instructions for installing dependencies in the `Project Description <https://pypi.org/project/cudaq/#description>`__.
-
-    In addition to using MPI in the simulator, you can use it in your application code by installing `mpi4py <https://mpi4py.readthedocs.io/>`__, and 
-    invoking the program with the command
-
-    .. code:: bash 
-
-        mpiexec -np 2 python3 -m mpi4py program.py [...] --target nvidia --target-option fp64,mgpu
-
-    The target can also be defined in the application code by calling
-
-    .. code:: python 
-
-        cudaq.set_target('nvidia', option='mgpu,fp64')
-
-    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
-
-    .. note::
-        
-        * The order of the option settings are interchangeable.
-          For example, `cudaq.set_target('nvidia', option='mgpu,fp64')` is equivalent to `cudaq.set_target('nvidia', option='fp64,mgpu')`.
-
-        * The `nvidia` target has single-precision as the default setting. Thus, using `option='mgpu'` implies that `option='mgpu,fp32'`.  
-
-.. tab:: C++
-
-    Double precision simulation:
-
-    .. code:: bash 
-
-        nvq++ --target nvidia  --target-option mgpu,fp64 program.cpp [...] -o program.x
-        mpiexec -np 2 ./program.x
-
-    Single precision simulation:
-
-    .. code:: bash 
-
-        nvq++ --target nvidia  --target-option mgpu,fp32 program.cpp [...] -o program.x
-        mpiexec -np 2 ./program.x
-
-.. note:: 
-
-  This backend requires an NVIDIA GPU, CUDA runtime libraries, as well as an MPI installation. If you do not have these dependencies installed, you may encounter either an error stating `invalid simulator requested` (missing CUDA libraries), or an error along the lines of `failed to launch kernel` (missing MPI installation). See the section :ref:`dependencies-and-compatibility` for more information about how to install dependencies.
-  
-  The number of processes and nodes should be always power-of-2. 
-
-  Host-device state vector migration is also supported in the multi-node multi-GPU configuration. 
-
-
-In addition to those environment variable options supported in the single-GPU mode,
-the :code:`nvidia` target provides the following environment variable options particularly for 
-the multi-node multi-GPU configuration. Any environment variables must be set
-prior to setting the target.
-
-.. list-table:: **Additional environment variable options for multi-node multi-GPU mode**
-  :widths: 20 30 50
-
-  * - Option
-    - Value
-    - Description
-  * - ``CUDAQ_MGPU_LIB_MPI``
-    - string
-    - The shared library name for inter-process communication. The default value is `libmpi.so`.
-  * - ``CUDAQ_MGPU_COMM_PLUGIN_TYPE``
-    - `AUTO`, `EXTERNAL`, `OpenMPI`, or `MPICH` 
-    - Selecting :code:`cuStateVec` `CommPlugin` for inter-process communication. The default is `AUTO`. If `EXTERNAL` is selected, `CUDAQ_MGPU_LIB_MPI` should point to an implementation of :code:`cuStateVec` `CommPlugin` interface.
-  * - ``CUDAQ_MGPU_NQUBITS_THRESH``
-    - positive integer
-    - The qubit count threshold where state vector distribution is activated. Below this threshold, simulation is performed as independent (non-distributed) tasks across all MPI processes for optimal performance. Default is 25. 
-  * - ``CUDAQ_MGPU_FUSE``
-    - positive integer
-    - The max number of qubits used for gate fusion. The default value is `6` if there are more than one MPI processes or `4` otherwise.
-  * - ``CUDAQ_MGPU_P2P_DEVICE_BITS``
-    - positive integer
-    - Specify the number of GPUs that can communicate by using GPUDirect P2P. Default value is 0 (P2P communication is disabled).
-  * - ``CUDAQ_GPU_FABRIC``
-    - `MNNVL`, `NVL`, or `NONE`
-    - Automatically set the number of P2P device bits based on the total number of processes when multi-node NVLink (`MNNVL`) is selected; or the number of processes per node when NVLink (`NVL`) is selected; or disable P2P (with `NONE`). 
-  * - ``CUDAQ_GLOBAL_INDEX_BITS``
-    - comma-separated list of positive integers
-    - Specify the inter-node network structure (faster to slower). For example, assuming a 8 nodes, 4 GPUs/node simulation whereby network communication is faster, this `CUDAQ_GLOBAL_INDEX_BITS` environment variable can be set to `3,2`. The first `3` represents **8** nodes with fast communication and the second `2` represents **4** 8-node groups in those total 32 nodes. Default is an empty list (no customization based on network structure of the cluster).
-  * - ``CUDAQ_HOST_DEVICE_MIGRATION_LEVEL``
-    - positive integer
-    - Specify host-device memory migration w.r.t. the network structure. If provided, this setting determines the position to insert the number of migration index bits to the `CUDAQ_GLOBAL_INDEX_BITS` list. By default, if not set, the number of migration index bits (CPU-GPU data transfers) is appended to the end of the array of index bits (aka, state vector distribution scheme). This default behavior is optimized for systems with fast GPU-GPU interconnects (NVLink, InfiniBand, etc.) 
-
-.. deprecated:: 0.8
-    The :code:`nvidia-mgpu` target, which is equivalent to the multi-node multi-GPU double-precision option (`mgpu,fp64`) of the :code:`nvidia`
-    is deprecated and will be removed in a future release.
-
-The above configuration options of the :code:`nvidia` backend 
-can be tuned to reduce your simulation runtimes. One of the
-performance improvements is to fuse multiple gates together during runtime. For
-example, :code:`x(qubit0)` and :code:`x(qubit1)` can be fused together into a
-single 4x4 matrix operation on the state vector rather than 2 separate 2x2
-matrix operations on the state vector. This fusion reduces memory bandwidth on
-the GPU because the state vector is transferred into and out of memory fewer
-times. By default, up to 4 gates are fused together for single-GPU simulations,
-and up to 6 gates are fused together for multi-GPU simulations. The number of
-gates fused can **significantly** affect performance of some circuits, so users
-can override the default fusion level by setting the setting `CUDAQ_MGPU_FUSE`
-environment variable to another integer value as shown below.
-
-.. tab:: Python
-
-    .. code:: bash 
-
-        CUDAQ_MGPU_FUSE=5 mpiexec -np 2 python3 program.py [...] --target nvidia --target-option mgpu,fp64
-
-.. tab:: C++
-
-    .. code:: bash 
-
-        nvq++ --target nvidia --target-option mgpu,fp64 program.cpp [...] -o program.x
-        CUDAQ_MGPU_FUSE=5 mpiexec -np 2 ./program.x
-
-
-Trajectory Noisy Simulation
-++++++++++++++++++++++++++++++++++
-
-When a :code:`noise_model` is provided to CUDA-Q, the :code:`nvidia` target will incorporate quantum noise into the quantum circuit simulation according to the noise model specified.
-
-
-.. tab:: Python
-
-    .. literalinclude:: ../../snippets/python/using/backends/trajectory.py
-        :language: python
-        :start-after: [Begin Docs]
-
-    .. code:: bash 
-        
-        python3 program.py
-        { 00:15 01:92 10:81 11:812 }
-
-.. tab:: C++
-
-    .. literalinclude:: ../../snippets/cpp/using/backends/trajectory.cpp
-        :language: cpp
-        :start-after: [Begin Documentation]
-
-    .. code:: bash 
-
-        nvq++ --target nvidia program.cpp [...] -o program.x
-        ./program.x
-        { 00:15 01:92 10:81 11:812 }
-
-
-In the case of bit-string measurement sampling as in the above example, each measurement 'shot' is executed as a trajectory, whereby Kraus operators specified in the noise model are sampled.
-
-For observable expectation value estimation, the statistical error scales asymptotically as :math:`1/\sqrt{N_{trajectories}}`, where :math:`N_{trajectories}` is the number of trajectories.
-Hence, depending on the required level of accuracy, the number of trajectories can be specified accordingly.
-
-.. tab:: Python
-
-    .. literalinclude:: ../../snippets/python/using/backends/trajectory_observe.py
-        :language: python
-        :start-after: [Begin Docs]
-
-    .. code:: bash 
-        
-        python3 program.py
-        Noisy <Z> with 1024 trajectories = -0.810546875
-        Noisy <Z> with 8192 trajectories = -0.800048828125
-
-.. tab:: C++
-
-    .. literalinclude:: ../../snippets/cpp/using/backends/trajectory_observe.cpp
-        :language: cpp
-        :start-after: [Begin Documentation]
-
-    .. code:: bash 
-
-        nvq++ --target nvidia program.cpp [...] -o program.x
-        ./program.x
-        Noisy <Z> with 1024 trajectories = -0.810547
-        Noisy <Z> with 8192 trajectories = -0.800049
-
-
-The following environment variable options are applicable to the :code:`nvidia` target for trajectory noisy simulation. Any environment variables must be set
-prior to setting the target.
-
-.. list-table:: **Additional environment variable options for trajectory simulation**
-  :widths: 20 30 50
-
-  * - Option
-    - Value
-    - Description
-  * - ``CUDAQ_OBSERVE_NUM_TRAJECTORIES``
-    - positive integer
-    - The default number of trajectories for observe simulation if none was provided in the `observe` call. The default value is 1000.
-  * - ``CUDAQ_BATCH_SIZE``
-    - positive integer or `NONE`
-    - The number of state vectors in the batched mode. If `NONE`, the batch size will be calculated based on the available device memory. Default is `NONE`.
-  * - ``CUDAQ_BATCHED_SIM_MAX_BRANCHES``
-    - positive integer
-    - The number of trajectory branches to be tracked simultaneously in the gate fusion. Default is 16. 
-  * - ``CUDAQ_BATCHED_SIM_MAX_QUBITS``
-    - positive integer
-    - The max number of qubits for batching. If the qubit count in the circuit is more than this value, batched trajectory simulation will be disabled. The default value is 20.
-  * - ``CUDAQ_BATCHED_SIM_MIN_BATCH_SIZE``
-    - positive integer
-    - The minimum number of trajectories for batching. If the number of trajectories is less than this value, batched trajectory simulation will be disabled. Default value is 4.
-
-.. note::
-    
-    Batched trajectory simulation is only available on the single-GPU execution mode of the :code:`nvidia` target. 
-    
-    If batched trajectory simulation is not activated, e.g., due to problem size, number of trajectories, or the nature of the circuit (dynamic circuits with mid-circuit measurements and conditional branching), the required number of trajectories will be executed sequentially.  
-
-
-.. _OpenMP CPU-only:
-
-OpenMP CPU-only
-++++++++++++++++++++++++++++++++++
-
-.. _qpp-cpu-backend:
-
-This target provides a state vector simulator based on the CPU-only, OpenMP threaded `Q++ <https://github.com/softwareqinc/qpp>`_ library.
-This is the default target when running on CPU-only systems.
-
-To execute a program on the :code:`qpp-cpu` target even if a GPU-accelerated backend is available, 
-use the following commands:
-
-.. tab:: Python
-
-    .. code:: bash 
-
-        python3 program.py [...] --target qpp-cpu
-
-    The target can also be defined in the application code by calling
-
-    .. code:: python 
-
-        cudaq.set_target('qpp-cpu')
-
-    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
-
-.. tab:: C++
-
-    .. code:: bash 
-
-        nvq++ --target qpp-cpu program.cpp [...] -o program.x
-        ./program.x
-
-Tensor Network Simulators
-==================================
-
-.. _tensor-backends:
-
-CUDA-Q provides a couple of tensor-network simulator targets accelerated with 
-the :code:`cuTensorNet` library. 
-These backends are available for use from both C++ and Python.
-
-Tensor network simulators are suitable for large-scale simulation of certain classes of quantum circuits involving many qubits beyond the memory limit of state vector based simulators. For example, computing the expectation value of a Hamiltonian via :code:`cudaq::observe` can be performed efficiently, thanks to :code:`cuTensorNet` contraction optimization capability. On the other hand, conditional circuits, i.e., those with mid-circuit measurements or reset, despite being supported by both backends, may result in poor performance. 
-
-Multi-node multi-GPU
-+++++++++++++++++++++++++++++++++++
-
-The :code:`tensornet` backend represents quantum states and circuits as tensor networks in an exact form (no approximation). 
-Measurement samples and expectation values are computed via tensor network contractions. 
-This backend supports multi-node, multi-GPU distribution of tensor operations required to evaluate and simulate the circuit.
-
-To execute a program on the :code:`tensornet` target using a *single GPU*, use the following commands:
-
-.. tab:: Python
-
-    .. code:: bash 
-
-        python3 program.py [...] --target tensornet
-
-    The target can also be defined in the application code by calling
-
-    .. code:: python 
-
-        cudaq.set_target('tensornet')
-
-    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
-
-.. tab:: C++
-
-    .. code:: bash 
-
-        nvq++ --target tensornet program.cpp [...] -o program.x
-        ./program.x
-
-If you have *multiple GPUs* available on your system, you can use MPI to automatically distribute parallelization across the visible GPUs. 
-
-.. note::
-
-  If you installed the CUDA-Q Python wheels, distribution across multiple GPUs is currently not supported for this backend.
-  We will add support for it in future releases. For more information, see this `GitHub issue <https://github.com/NVIDIA/cuda-quantum/issues/920>`__.
-
-Use the following commands to enable distribution across multiple GPUs (adjust the value of the :code:`-np` flag as needed to reflect available GPU resources on your system):
-
-.. tab:: Python
-
-    .. code:: bash 
-
-        mpiexec -np 2 python3 program.py [...] --target tensornet
-
-    In addition to using MPI in the simulator, you can use it in your application code by installing `mpi4py <https://mpi4py.readthedocs.io/>`__, and 
-    invoking the program with the command
-
-    .. code:: bash 
-
-        mpiexec -np 2 python3 -m mpi4py program.py [...] --target tensornet
-
-.. tab:: C++
-
-    .. code:: bash 
-
-        nvq++ --target tensornet program.cpp [...] -o program.x
-        mpiexec -np 2 ./program.x
-
-.. note::
-
-  If the `CUTENSORNET_COMM_LIB` environment variable is not set, MPI parallelization on the :code:`tensornet` backend may fail.
-  If you are using a CUDA-Q container, this variable is pre-configured and no additional setup is needed. If you are customizing your installation or have built CUDA-Q from source, please follow the instructions for `activating the distributed interface <https://docs.nvidia.com/cuda/cuquantum/latest/getting-started/index.html#from-nvidia-devzone>`__ for the `cuTensorNet` library. This requires 
-  :ref:`installing CUDA development dependencies <additional-cuda-tools>`, and setting the `CUTENSORNET_COMM_LIB`
-  environment variable to the newly built `libcutensornet_distributed_interface_mpi.so` library.
-
-Specific aspects of the simulation can be configured by setting the following of environment variables:
-
-* **`CUDA_VISIBLE_DEVICES=X`**: Makes the process only see GPU X on multi-GPU nodes. Each MPI process must only see its own dedicated GPU. For example, if you run 8 MPI processes on a DGX system with 8 GPUs, each MPI process should be assigned its own dedicated GPU via `CUDA_VISIBLE_DEVICES` when invoking `mpiexec` (or `mpirun`) commands. 
-* **`OMP_PLACES=cores`**: Set this environment variable to improve CPU parallelization.
-* **`OMP_NUM_THREADS=X`**: To enable CPU parallelization, set X to `NUMBER_OF_CORES_PER_NODE/NUMBER_OF_GPUS_PER_NODE`.
-* **`CUDAQ_TENSORNET_CONTROLLED_RANK=X`**: Specify the number of controlled qubits whereby the full tensor body of the controlled gate is expanded. If the number of controlled qubits is greater than this value, the gate is applied as a controlled tensor operator to the tensor network state. Default value is 1.
-* **`CUDAQ_TENSORNET_OBSERVE_CONTRACT_PATH_REUSE=X`**: Set this environment variable to `TRUE` (`ON`) or `FALSE` (`OFF`) to enable or disable contraction path reuse when computing expectation values. Default is `OFF`.
-
-.. note:: 
-
-  This backend requires an NVIDIA GPU and CUDA runtime libraries. 
-  If you do not have these dependencies installed, you may encounter an error stating `Invalid simulator requested`. 
-  See the section :ref:`dependencies-and-compatibility` for more information about how to install dependencies.
-
-.. note:: 
-
-  When using contraction path reuse (`CUDAQ_TENSORNET_OBSERVE_CONTRACT_PATH_REUSE=TRUE`), :code:`tensornet` backends perform a single contraction path optimization with an opaque spin operator term. This path is then used to contract all the actual terms in the spin operator, hence saving the path finding time.
-
-  As we use an opaque spin operator term as a placeholder for contraction path optimization, the resulting contraction path is not as optimal as if the actual spin operator is used.
-  For instance, if the spin operator is sparse (only acting on a few qubits), the contraction can be significantly simplified.  
-
-.. note:: 
-
-  :code:`tensornet` backends only return the overall expectation value for a :class:`cudaq.SpinOperator` when using the `cudaq::observe` method. 
-  Term-by-term expectation values will not be available in the resulting `ObserveResult` object.
-  If needed, these values can be computed by calling `cudaq::observe` on individual terms instead.  
-
-Matrix product state 
-+++++++++++++++++++++++++++++++++++
-
-The :code:`tensornet-mps` backend is based on the matrix product state (MPS) representation of the state vector/wave function, exploiting the sparsity in the tensor network via tensor decomposition techniques such as QR and SVD. As such, this backend is an approximate simulator, whereby the number of singular values may be truncated to keep the MPS size tractable. 
-The :code:`tensornet-mps` backend only supports single-GPU simulation. Its approximate nature allows the :code:`tensornet-mps` backend to handle a large number of qubits for certain classes of quantum circuits on a relatively small memory footprint.
-
-To execute a program on the :code:`tensornet-mps` target, use the following commands:
-
-.. tab:: Python
-
-    .. code:: bash 
-
-        python3 program.py [...] --target tensornet-mps
-
-    The target can also be defined in the application code by calling
-
-    .. code:: python 
-
-        cudaq.set_target('tensornet-mps')
-
-    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
-
-.. tab:: C++
-
-    .. code:: bash 
-
-        nvq++ --target tensornet-mps program.cpp [...] -o program.x
-        ./program.x
-
-Specific aspects of the simulation can be configured by defining the following environment variables:
-
-* **`CUDAQ_MPS_MAX_BOND=X`**: The maximum number of singular values to keep (fixed extent truncation). Default: 64.
-* **`CUDAQ_MPS_ABS_CUTOFF=X`**: The cutoff for the largest singular value during truncation. Eigenvalues that are smaller will be trimmed out. Default: 1e-5.
-* **`CUDAQ_MPS_RELATIVE_CUTOFF=X`**: The cutoff for the maximal singular value relative to the largest eigenvalue. Eigenvalues that are smaller than this fraction of the largest singular value will be trimmed out. Default: 1e-5
-* **`CUDAQ_MPS_SVD_ALGO=X`**: The SVD algorithm to use. Valid values are: `GESVD` (QR algorithm), `GESVDJ` (Jacobi method), `GESVDP` (`polar decomposition <https://epubs.siam.org/doi/10.1137/090774999>`__), `GESVDR` (`randomized methods <https://epubs.siam.org/doi/10.1137/090771806>`__). Default: `GESVDJ`.
-
-.. note:: 
-
-  This backend requires an NVIDIA GPU and CUDA runtime libraries. 
-  If you do not have these dependencies installed, you may encounter an error stating `Invalid simulator requested`. 
-  See the section :ref:`dependencies-and-compatibility` for more information about how to install dependencies.
-
-.. note::
-    The parallelism of Jacobi method (the default `CUDAQ_MPS_SVD_ALGO` setting) gives GPU better performance on small and medium size matrices.
-    If you expect a large number of singular values (e.g., increasing the `CUDAQ_MPS_MAX_BOND` setting), please adjust the `CUDAQ_MPS_SVD_ALGO` setting accordingly.  
-
-Clifford-Only Simulator
-==================================
-
-Stim (CPU)
-++++++++++++++++++++++++++++++++++
-
-.. _stim-backend:
-
-This target provides a fast simulator for circuits containing *only* Clifford
-gates. Any non-Clifford gates (such as T gates and Toffoli gates) are not
-supported. This simulator is based on the `Stim <https://github.com/quantumlib/Stim>`_
-library.
-
-To execute a program on the :code:`stim` target, use the following commands:
-
-.. tab:: Python
-
-    .. code:: bash 
-
-        python3 program.py [...] --target stim
-
-    The target can also be defined in the application code by calling
-
-    .. code:: python 
-
-        cudaq.set_target('stim')
-
-    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
-
-.. tab:: C++
-
-    .. code:: bash 
-
-        nvq++ --target stim program.cpp [...] -o program.x
-        ./program.x
-
-.. note::
-    CUDA-Q currently executes kernels using a "shot-by-shot" execution approach.
-    This allows for conditional gate execution (i.e. full control flow), but it
-    can be slower than executing Stim a single time and generating all the shots
-    from that single execution.
-
-
-Photonics Simulators
-==================================
-
-The :code:`orca-photonics` target provides a state vector simulator with 
-the :code:`Q++` library. 
-
-The :code:`orca-photonics` target supports supports a double precision simulator that can run in multiple CPUs.
-
-OpenMP CPU-only
-++++++++++++++++++++++++++++++++++
-
-.. _qpp-cpu-photonics-backend:
-
-This target provides a state vector simulator based on the CPU-only, OpenMP threaded `Q++ <https://github.com/softwareqinc/qpp>`_  library.
-
-To execute a program on the :code:`orca-photonics` target, use the following commands:
-
-.. tab:: Python
-
-    .. code:: bash 
-
-        python3 program.py [...] --target orca-photonics
-
-    The target can also be defined in the application code by calling
-
-    .. code:: python 
-
-        cudaq.set_target('orca-photonics')
-
-    If a target is set in the application code, this target will override the :code:`--target` command line flag given during program invocation.
-
-.. tab:: C++
-
-    .. code:: bash 
-
-        nvq++ --library-mode --target orca-photonics program.cpp [...] -o program.x
-
-
-Fermioniq
-==================================
-
-.. _fermioniq-backend:
-
-`Fermioniq <https://fermioniq.com/>`__ offers a cloud-based tensor-network emulation platform, `Ava <https://www.fermioniq.com/ava/>`__, 
-for the approximate simulation of large-scale quantum circuits beyond the memory limit of state vector and exact tensor network based methods. 
-
-The level of approximation can be controlled by setting the bond dimension: larger values yield more accurate simulations at the expense 
-of slower computation time. For a detailed description of Ava users are referred to the `online documentation <https://docs.fermioniq.com/>`__.
-
-Users of CUDA-Q can access a simplified version of the full Fermioniq emulator (`Ava <https://www.fermioniq.com/ava/>`__) from either
-C++ or Python. This version currently supports emulation of quantum circuits without noise, and can return measurement samples and/or 
-compute expectation values of observables.
-
-.. note::
-    In order to use the Fermioniq emulator, users must provide access credentials. These can be requested by contacting info@fermioniq.com 
-
-    The credentials must be set via two environment variables:
-    `FERMIONIQ_ACCESS_TOKEN_ID` and `FERMIONIQ_ACCESS_TOKEN_SECRET`.
-
-.. tab:: Python
-
-    The target to which quantum kernels are submitted 
-    can be controlled with the ``cudaq::set_target()`` function.
-
-    .. code:: python
-
-        cudaq.set_target('fermioniq')
-
-    You will have to specify a remote configuration id for the Fermioniq backend
-    during compilation.
-
-    .. code:: python
-
-        cudaq.set_target("fermioniq", **{
-            "remote_config": remote_config_id
-        })
-
-    For a comprehensive list of all remote configurations, please contact Fermioniq directly.
-
-    When your organization requires you to define a project id, you have to specify
-    the project id during compilation.
-
-    .. code:: python
-
-        cudaq.set_target("fermioniq", **{
-            "project_id": project_id
-        })
-
-    To specify the bond dimension, you can pass the ``bond_dim`` parameter.
-
-    .. code:: python 
-
-        cudaq.set_target("fermioniq", **{
-            "bond_dim": 5
-        })
-
-.. tab:: C++
-
-    To target quantum kernel code for execution in the Fermioniq backends,
-    pass the flag ``--target fermioniq`` to the ``nvq++`` compiler. CUDA-Q will 
-    authenticate via the Fermioniq REST API using the environment variables 
-    set earlier.
-
-    .. code:: bash
-
-        nvq++ --target fermioniq src.cpp ...
-
-    You will have to specify a remote configuration id for the Fermioniq backend
-    during compilation.
-
-    .. code:: bash
-
-        nvq++ --target fermioniq --fermioniq-remote-config <remote_config_id> src.cpp ...
-
-    For a comprehensive list of all remote configurations, please contact Fermioniq directly.
-
-    When your organization requires you to define a project id, you have to specify
-    the project id during compilation.
-
-    .. code:: bash
-
-        nvq++ --target fermioniq --fermioniq-project-id <project_id> src.cpp ...
-
-    To specify the bond dimension, you can pass the ``fermioniq-bond-dim`` parameter.
-
-    .. code:: bash
-
-        nvq++ --target fermioniq --fermioniq-bond-dim 10 src.cpp ...
-
-Default Simulator
-==================================
-
-.. _default-simulator:
-
-If no explicit target is set, i.e., if the code is compiled without any :code:`--target` flags, then CUDA-Q makes a default choice for the simulator.
-
-If an NVIDIA GPU and CUDA runtime libraries are available, the default target is set to `nvidia`. This will utilize the :ref:`cuQuantum single-GPU state vector simulator <cuQuantum single-GPU>`.  
-On CPU-only systems, the default target is set to `qpp-cpu` which uses the :ref:`OpenMP CPU-only simulator <OpenMP CPU-only>`.
-
-The default simulator can be overridden by the environment variable `CUDAQ_DEFAULT_SIMULATOR`. If no target is explicitly specified and the environment variable has a valid value, then it will take effect.
-This environment variable can be set to any non-hardware backend. Any invalid value is ignored.
-
-For CUDA-Q Python API, the environment variable at the time when `cudaq` module is imported is relevant, not the value of the environment variable at the time when the simulator is invoked.
-
-For example,
-
-.. tab:: Python
-
-    .. code:: bash 
-
-        CUDAQ_DEFAULT_SIMULATOR=density-matrix-cpu python3 program.py [...]
-        
-.. tab:: C++
-
-    .. code:: bash 
-
-        CUDAQ_DEFAULT_SIMULATOR=density-matrix-cpu nvq++ program.cpp [...] -o program.x
-        ./program.x
-
-This will use the density matrix simulator target.
-
-
-.. note:: 
-
-    To use targets that require an NVIDIA GPU and CUDA runtime libraries, the dependencies must be installed, else you may encounter an error stating `Invalid simulator requested`. See the section :ref:`dependencies-and-compatibility` for more information about how to install dependencies.
\ No newline at end of file
diff --git a/docs/sphinx/using/basics/run_kernel.rst b/docs/sphinx/using/basics/run_kernel.rst
index 37a82cf3b4..0585625884 100644
--- a/docs/sphinx/using/basics/run_kernel.rst
+++ b/docs/sphinx/using/basics/run_kernel.rst
@@ -115,7 +115,7 @@ is available, for example, by choosing the target `nvidia-mqpu`:
   if you actually have multiple QPU or CPU available. Otherwise, the 
   sampling will still have to execute sequentially due to resource constraints. 
 
-More information about parallelizing execution can be found at :doc:`../backends/platform` page.
+More information about parallelizing execution can be found at :ref:`mqpu-platform`  page.
 
 Observe
 +++++++++
@@ -198,7 +198,7 @@ be specified to any integer.
 
 Similar to `sample_async` above, observe also supports asynchronous execution. 
 More information about parallelizing execution can be found at 
-:doc:`../backends/platform` page.
+the :ref:`mqpu-platform` page.
 
 Running on a GPU
 ++++++++++++++++++
@@ -255,4 +255,4 @@ all of the available targets and ways to accelerate kernel execution, visit the
     ./a.out
 
   seeing an output of the order:
-  ``It took 3.18988 seconds.``
\ No newline at end of file
+  ``It took 3.18988 seconds.``
diff --git a/docs/sphinx/using/examples/examples.rst b/docs/sphinx/using/examples/examples.rst
index ef3fd40e59..1d17553771 100644
--- a/docs/sphinx/using/examples/examples.rst
+++ b/docs/sphinx/using/examples/examples.rst
@@ -1,6 +1,8 @@
 *************************
 CUDA-Q by Example
 *************************
+.. _examples:
+
 
 Examples that illustrate how to use CUDA-Q for application development are available in C++ and Python.
 
@@ -10,11 +12,9 @@ Examples that illustrate how to use CUDA-Q for application development are avail
       Introduction <introduction.rst>
       Building Kernels <building_kernels.rst>
       Quantum Operations <quantum_operations.rst>
-      Photonic Operations <photonic_operations.rst>
       Measuring Kernels <../../examples/python/measuring_kernels.ipynb>
       Visualizing Kernels <../../examples/python/visualization.ipynb>
       Executing Kernels <../../examples/python/executing_kernels.ipynb>
-      Executing Photonic Kernels <../../examples/python/executing_photonic_kernels.ipynb>
       Computing Expectation Values <expectation_values.rst>
       Multi-Control Synthesis <multi_control.rst>
       Multi-GPU Workflows <multi_gpu_workflows.rst>
diff --git a/docs/sphinx/using/examples/expectation_values.rst b/docs/sphinx/using/examples/expectation_values.rst
index 254302e5ed..86de989ff5 100644
--- a/docs/sphinx/using/examples/expectation_values.rst
+++ b/docs/sphinx/using/examples/expectation_values.rst
@@ -32,10 +32,9 @@ at an example of this:
 Parallelizing across Multiple Processors
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-:doc:`multi-processor platforms <../backends/platform>` page.
 
 
-One typical use case of :doc:`multi-processor platforms <../backends/platform>` is to distribute the
+One typical use case of :ref:`multi-processor platforms <mqpu-platform>` is to distribute the
 expectation value computations of a multi-term Hamiltonian across multiple virtual QPUs.
 
 The following shows an example using the :code:`nvidia-mqpu` platform:
diff --git a/docs/sphinx/using/examples/images/mgpu.png b/docs/sphinx/using/examples/images/mgpu.png
new file mode 100644
index 0000000000..89a1d4ffcc
Binary files /dev/null and b/docs/sphinx/using/examples/images/mgpu.png differ
diff --git a/docs/sphinx/using/examples/images/mqpu.png b/docs/sphinx/using/examples/images/mqpu.png
new file mode 100644
index 0000000000..1f475d655e
Binary files /dev/null and b/docs/sphinx/using/examples/images/mqpu.png differ
diff --git a/docs/sphinx/using/examples/multi_gpu_workflows.rst b/docs/sphinx/using/examples/multi_gpu_workflows.rst
index 4cdf2504aa..02c5648043 100644
--- a/docs/sphinx/using/examples/multi_gpu_workflows.rst
+++ b/docs/sphinx/using/examples/multi_gpu_workflows.rst
@@ -1,36 +1,23 @@
 Multi-GPU Workflows
 ===================
 
-There are many backends available with CUDA-Q which enable seamless
+There are many backends available with CUDA-Q that enable seamless
 switching between GPUs, QPUs and CPUs and also allow for workflows
-involving multiple architectures working in tandem.
+involving multiple architectures working in tandem. This page will walk through the simple steps to accelerate any quantum circuit simulation with a GPU and how to scale large simulations using multi-GPU multi-node capabilities.
 
-Available Targets
--------------------
 
--  **`qpp-cpu`**: The QPP based CPU backend which is multithreaded to
-   maximize the usage of available cores on your system.
 
--  **`nvidia`**: GPU-accelerated state-vector based backend which accelerates quantum circuit
-   simulation on NVIDIA GPUs powered by cuQuantum.
-
--  **`nvidia-mgpu`**: Allows for scaling circuit simulation on multiple GPUs.
-
--  **`nvidia-mqpu`**: Enables users to program workflows utilizing
-   multiple virtual quantum processors in parallel, where each QPU is simulated by the `nvidia` backend.
-
--  **`remote-mqpu`**: Enables users to program workflows utilizing
-   multiple virtual quantum processors in parallel, where the backend used to simulate each QPU is configurable.
-
-Please see :doc:`../backends/backends` for a full list of all available backends. 
-Below we explore how to effectively utilize multiple CUDA-Q targets with the same GHZ state preparation code
+From CPU to GPU
+------------------
 
+The code below defines a kernel that creates a GHZ state using :math:`N` qubits. 
+   
 .. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/multiple_targets.py
     :language: python
     :start-after: [Begin state]
     :end-before: [End state]
 
-You can execute the code by running a statevector simulator on your CPU:
+You can run a state vector simulation using your CPU with the :code:`qpp-cpu` backend. This is helpful for debugging code and testing small circuits.
 
 .. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/multiple_targets.py
     :language: python
@@ -41,8 +28,7 @@ You can execute the code by running a statevector simulator on your CPU:
 
     { 00:475 11:525 }
 
-You will notice a speedup of up to **2500x** in executing the circuit below on
-NVIDIA GPUs vs CPUs:
+As the number of qubits increases to even modest size, the CPU simulation will become impractically slow.  By switching to the :code:`nvidia` backend, you can accelerate the same code on a single GPU and achieve a speedup of up to **425x**.  If you have a GPU available, this the default backend to ensure maximum productivity.
 
 .. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/multiple_targets.py
     :language: python
@@ -53,74 +39,79 @@ NVIDIA GPUs vs CPUs:
 
     { 0000000000000000000000000:510 1111111111111111111111111:490 }
 
-If one incrementally increases the qubit count, we
-reach a limit where the memory required is beyond the capabilities of a
-single GPU: A :math:`n` qubit quantum state has :math:`2^n` complex amplitudes, each
-of which require 8 bytes of memory to store. Hence the total memory
-required to store a :math:`n` qubit quantum state is :math:`8` bytes
-:math:`\times 2^n`. For :math:`n = 30` qubits, this is roughly :math:`8`
-GB but for :math:`n = 40`, this exponentially increases to 8700 GB.
 
-
-Parallelization across Multiple Processors
+Pooling the memory of multiple GPUs (`mgpu`)
 ---------------------------------------------
 
-The ``nvidia-mgpu`` target allows for memory from additional
-GPUs to be pooled enabling qubit counts to be scaled.
-Execution on the ``nvidia-mgpu`` backend is enabled via ``mpirun``. Users
-need to create a ``.py`` file with their code and run the command below
-in terminal:
 
-``mpirun -np 4 python3 test.py``
+As :code:`N` gets larger, the size of the state vector that needs to be stored in memory increases exponentially. 
+The state vector has :math:`2^N` elements, each a complex number requiring 8 bytes. This means a 30 qubit simulation 
+requires roughly 8 GB. Adding a few more qubits will quickly exceed the memory of as single GPU.  The `mqpu` backend 
+solved this problem by pooling the memory of multiple GPUs across multiple nodes to perform a single state vector simulation. 
+
+
+.. image:: images/mgpu.png
+
+
+If you have multiple GPUs, you can use the following command to run the simulation across :math:`n` GPUs. 
+
+
+:code:`mpiexec -np n python3 program.py --target nvidia --target-option mgpu`
+
+This code will execute in an MPI context and provide additional memory to simulation much larger state vectors.
+
+You can also set :code:`cudaq.set_target('nvidia', option='mgpu')` within the file to select the target.
+
 
-where 4 is the number of GPUs one has access to and ``test`` is the file
-name chosen.
+Parallel execution over multiple QPUs (`mqpu`)
+------------------------------------------------
 
-The ``nvidia-mqpu`` target uses a statevector simulator to simulate execution 
-on each virtual QPU.
-The ``remote-mqpu`` platform allows to freely configure what backend is used 
-for each platform QPU. 
-For more information about the different platform targets, please take a look at
-:doc:`../backends/platform`.
+Batching Hamiltonian Terms 
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Multiple GPUs can also come in handy for cases where applications might benefit from multiple QPUs running in parallel.  The `mqpu` backend uses multiple GPUs to simulate QPUs so you can accelerate quantum applications with parallelization.
+
+
+.. image:: images/mqpu.png
+
+The most simple example is Hamiltonian Batching. In this case, an expectation value of a large Hamiltonian is distributed across multiple simulated QPUs, where each QPUs evaluates a subset of the Hamiltonian terms. 
 
-Batching Hamiltonian Terms
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Expectation value computations of multi-term Hamiltonians can be
-asynchronously processed via the ``mqpu`` platform.
 
 .. image:: ../../applications/python/images/hsplit.png
 
-For workflows involving multiple GPUs, save the code below in a
-``filename.py`` file and execute via:
-``mpirun -np n python3 filename.py`` where ``n`` is an integer
-specifying the number of GPUs you have access to.
 
+The code below evaluates the expectation value of a random 100000 term Hamiltonian. A standard :code:`observe` call will run the program on a single GPU.  Adding the argument :code:`execution=cudaq.parallel.thread` or :code:`execution=cudaq.parallel.mpi` will automatically distribute the Hamiltonian terms across multiple GPUs on a single node or multiple GPUs on multiple nodes, respectively.  
+
+The code is executed with :code:`mpiexec -np n python3 program.py --target nvidia --target-option mqpu` where :math:`n` is the number of GPUs available. 
+
+   
 .. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/hamiltonian_batching.py
     :language: python
     :start-after: [Begin Docs]
     :end-before: [End Docs]
 
-.. parsed-literal::
 
-    mpi is initialized?  True
-    rank 0 num_ranks 1
 
+Circuit Batching 
+^^^^^^^^^^^^^^^^^
+
+A second way to leverage the `mqpu` backend is to batch circuit evaluations across multiple simulated QPUs.   
 
-Circuit Batching
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Execution of parameterized circuits with different parameters can be
-executed asynchronously via the ``mqpu`` platform.
 
 .. image:: ../../applications/python/images/circsplit.png
 
+
+One example where circuit batching is helpful might be evaluating a parameterized circuit many times with different parameters. The code below prepares a list of 10000 parameter sets for a 5 qubit circuit. 
+
+
 .. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
     :language: python
     :start-after: [Begin prepare]
     :end-before: [End prepare]
 
-Let's time the execution on single GPU.
+All of these circuits can be broadcast through a single :code"`observe` call and run by default on a single GPU.  The code below times this entire process.
 
 .. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
     :language: python
@@ -129,9 +120,9 @@ Let's time the execution on single GPU.
 
 .. parsed-literal::
 
-    31.7 s ± 990 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+    3.185340642929077
 
-Now let's try to time multi GPU run.
+This can be greatly accelerated by batching the circuits on multiple QPUs. The first step is to slice the large list of parameters unto smaller arrays. The example below divides by four, in preparation to run on four GPUs.
 
 .. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
     :language: python
@@ -140,8 +131,10 @@ Now let's try to time multi GPU run.
 
 .. parsed-literal::
 
-    We have 10000 parameters which we would like to execute
-    We split this into 4 batches of 2500 , 2500 , 2500 , 2500
+    There are now 10000 parameter sets split into 4 batches of 2500 , 2500 , 2500 , 2500
+
+
+As the results are run asynchronously, they need to be stored in a list (:code:`asyncresults`) and retrieved later with the :code:`get` command. The following loops over the parameter batches, and the sets of parameters in each batch. The parameter sets are provided as inputs to :code:`observe_async` along with specification of a :code:`qpu_id` which designates the GPU (of the four available) which will run computation. A speedup of up to 4x can be expected with results varying by problem size.
 
 
 .. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
@@ -151,5 +144,65 @@ Now let's try to time multi GPU run.
 
 .. parsed-literal::
 
-    85.3 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
-
+    1.1754660606384277
+
+
+    
+Multi-QPU + Other Backends (`remote-mqpu`)
+-------------------------------------------
+    
+    
+The `mqpu` backend can be extended so that each parallel simulated QPU run backends other than :code:`nvidia`.  This provides a way to simulate larger scale circuits and execute parallel algorithms. This accomplished by launching remotes servers which each simulated a QPU.  
+The code example below demonstrates this using the :code:`tensornet-mps` backend which allows sampling of a 40 qubit circuit too larger for state vector simulation. In this case, the target is specified as :code:`remote-mqpu` while an additional :code:`backend` is specified for the simulator used for each QPU.  
+
+The default approach uses one GPU per QPU and can both launch and close each server automatically. This is accomplished by specifying :code:`auto_launch` and :code"`url` within :code:`cudaq.set_target`.  Running the script below will then sample the 40 qubit circuit using two QPUs each running :code:`tensornet-mps`.  
+    
+.. code:: python
+  
+        import cudaq
+
+        backend = 'tensornet-mps'
+
+        servers = '2'
+
+        @cudaq.kernel
+        def kernel(controls_count: int):
+            controls = cudaq.qvector(controls_count)
+            targets = cudaq.qvector(40)
+            # Place controls in superposition state.
+            h(controls)
+            for target in range(40):
+                x.ctrl(controls, targets[target])
+            # Measure.
+            mz(controls)
+            mz(targets)
+
+        # Set the target to execute on and query the number of QPUs in the system;
+        # The number of QPUs is equal to the number of (auto-)launched server instances.
+        cudaq.set_target("remote-mqpu",
+                         backend=backend,
+                         auto_launch=str(servers) if servers.isdigit() else "",
+                         url="" if servers.isdigit() else servers)
+        qpu_count = cudaq.get_target().num_qpus()
+        print("Number of virtual QPUs:", qpu_count)
+
+        # We will launch asynchronous sampling tasks,
+        # and will store the results as a future we can query at some later point.
+        # Each QPU (indexed by an unique Id) is associated with a remote REST server.
+        count_futures = []
+        for i in range(qpu_count):
+
+            result = cudaq.sample_async(kernel, i + 1, qpu_id=i)
+            count_futures.append(result)
+        print("Sampling jobs launched for asynchronous processing.")
+
+        # Go do other work, asynchronous execution of sample tasks on-going.
+        # Get the results, note future::get() will kick off a wait
+        # if the results are not yet available.
+        for idx in range(len(count_futures)):
+            counts = count_futures[idx].get()
+            print(counts)
+
+:code:`remote-mqpu` can also be used with `mqpu`, allowing each QPU to be simulated by multiple GPUs. 
+This requires manual preparation of the servers and detailed instructions are in the :ref:`remote multi-QPU platform <remote-mqpu-platform>`
+section of the docs.