Perform BarrierBeforeFinalMeasurements analysis in parallel #13411

mtreinish · 2024-11-08T13:07:54Z

Summary

With #13410 removing the non-threadsafe structure from our circuit
representation we're now able to read and iterate over a DAGCircuit from
multiple threads. This commit is the first small piece doing this, it
moves the analysis portion of the BarrierBeforeFinalMeasurements pass to
execute in parallel. The pass checks every node to ensure all it's
decedents are either a measure or a barrier before reaching the end of
the circuit. This commit iterates over all the nodes and does the check
in parallel.

Details and comments

TODO:

Rebase after Use OnceLock instead of OnceCell #13410 merges
Benchmark to test this actually speeds the pass up
Add handling to avoid using multithreading in a multiprocessing context

OnceLock is a thread-safe version of OnceCell that enables us to use PackedInstruction from a threaded environment. There is some overhead associated with this, primarily in memory as the OnceLock is a larger type than a OnceCell. But the tradeoff is worth it to start leverage multithreading for circuits. Fixes Qiskit#13219

With Qiskit#13410 removing the non-threadsafe structure from our circuit representation we're now able to read and iterate over a DAGCircuit from multiple threads. This commit is the first small piece doing this, it moves the analysis portion of the BarrierBeforeFinalMeasurements pass to execure in parallel. The pass checks every node to ensure all it's decendents are either a measure or a barrier before reaching the end of the circuit. This commit iterates over all the nodes and does the check in parallel.

coveralls · 2024-11-08T13:32:50Z

Pull Request Test Coverage Report for Build 13271408905

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

67 of 83 (80.72%) changed or added relevant lines in 2 files are covered.
10 unchanged lines in 3 files lost coverage.
Overall coverage increased (+0.002%) to 88.317%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
crates/circuit/src/dag_circuit.rs	5	11	45.45%
crates/accelerate/src/barrier_before_final_measurement.rs	62	72	86.11%

Files with Coverage Reduction	New Missed Lines	%
crates/accelerate/src/unitary_synthesis.rs	1	93.29%
crates/qasm2/src/lex.rs	4	92.98%
qiskit/circuit/delay.py	5	75.71%

Totals
Change from base Build 13268699272:	0.002%
Covered Lines:	78853
Relevant Lines:	89284

💛 - Coveralls

This commit updates the logic in the pass to simplify the search algorithm and improve it's overall efficiency. Previously the pass would search the entire dag for all barrier and measurements and then did a BFS from each found node to check that all descendants are either barriers or measurements. Then with the set of nodes matching that condition a full topological sort of the dag was run, then the topologically ordered nodes were filtered for the matching set. That sorted set is then used for filtering This commit refactors this to do a reverse search from the output nodes which reduces the complexity of the algorithm. This new algorithm is also conducive for parallel execution because it does a search starting from each qubit's output node. Doing a test with a quantum volume circuit from 10 to 1000 qubits which scales linearly in depth and number of qubits a crossover point between the parallel and serial implementations was found around 150 qubits.

qiskit-bot · 2025-02-04T02:47:10Z

One or more of the following people are relevant to this code:

@Qiskit/terra-core

mtreinish · 2025-02-04T02:52:30Z

I ran a benchmark with a quantum volume circuit and did a sweep from 10 to 1000 qubits/depthand ran the pass on it with 1.3.2, the pass running with a serial iterator and with a parallel iterator:

Based on these results I went with a parallel threshold of 150 qubits in: b89c826 so when we use a parallel iterator where the performance is better. This might vary on other environments though, so it would be useful for someone else to test this and we can adjust that value if it's not a good value.

raynelfss

This looks straightforward to me and easy to understand. I just left a couple of comments related to the documentation of the code.

raynelfss · 2025-02-11T14:59:03Z