Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use OnceLock instead of OnceCell #13410

Merged
merged 5 commits into from
Nov 11, 2024
Merged

Conversation

mtreinish
Copy link
Member

Summary

OnceLock is a thread-safe version of OnceCell that enables us to use PackedInstruction from a threaded environment. There is some overhead associated with this, primarily in memory as the OnceLock is a larger type than a OnceCell. But the tradeoff is worth it to start leverage multithreading for circuits.

Details and comments

Fixes #13219

OnceLock is a thread-safe version of OnceCell that enables us to use
PackedInstruction from a threaded environment. There is some overhead
associated with this, primarily in memory as the OnceLock is a larger
type than a OnceCell. But the tradeoff is worth it to start leverage
multithreading for circuits.

Fixes Qiskit#13219
@mtreinish mtreinish added performance Changelog: None Do not include in changelog Rust This PR or issue is related to Rust code in the repository labels Nov 8, 2024
@mtreinish mtreinish added this to the 2.0.0 milestone Nov 8, 2024
@mtreinish mtreinish requested a review from a team as a code owner November 8, 2024 10:19
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core

mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Nov 8, 2024
With Qiskit#13410 removing the non-threadsafe structure from our circuit
representation we're now able to read and iterate over a DAGCircuit from
multiple threads. This commit is the first small piece doing this, it
moves the analysis portion of the BarrierBeforeFinalMeasurements pass to
execure in parallel. The pass checks every node to ensure all it's
decendents are either a measure or a barrier before reaching the end of
the circuit. This commit iterates over all the nodes and does the check
in parallel.
@coveralls
Copy link

coveralls commented Nov 8, 2024

Pull Request Test Coverage Report for Build 11766429415

Details

  • 19 of 23 (82.61%) changed or added relevant lines in 7 files are covered.
  • 5 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.008%) to 88.935%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/circuit/src/dag_node.rs 1 2 50.0%
crates/circuit/src/dag_circuit.rs 5 8 62.5%
Files with Coverage Reduction New Missed Lines %
crates/qasm2/src/lex.rs 5 92.48%
Totals Coverage Status
Change from base Build 11749692210: 0.008%
Covered Lines: 79065
Relevant Lines: 88902

💛 - Coveralls

Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a simple enough move. Have you tried running any benchmarks?

@mtreinish
Copy link
Member Author

After 8bc4844 merged fixing the asv tests I did a quick asv run which yielded basically no change:

Benchmarks that have stayed the same:

| Change   | Before [8bc48442] <once-lock^2>   | After [487e1bec] <once-lock>   | Ratio   | Benchmark (Parameter)                                                                                           |
|----------|-----------------------------------|--------------------------------|---------|-----------------------------------------------------------------------------------------------------------------|
|          | 113±5μs                           | 246±70μs                       | ~2.17   | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 8192)                                        |
|          | 452±200μs                         | 827±300μs                      | ~1.83   | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 32768)                                       |
|          | 486±200μs                         | 799±200μs                      | ~1.64   | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 32768)                                       |
|          | 665±300μs                         | 962±200μs                      | ~1.45   | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 32768)                                       |
|          | 461±200μs                         | 614±200μs                      | ~1.33   | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 32768)                                       |
|          | 34.2±8μs                          | 45.2±20μs                      | ~1.32   | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 2048)                                        |
|          | 194±50μs                          | 248±50μs                       | ~1.28   | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 8192)                                        |
|          | 5.89±0.2ms                        | 6.88±0.4ms                     | ~1.17   | manipulate.TestCircuitManipulate.time_DTC100_twirling                                                           |
|          | 3.42±0.03s                        | 3.97±0.2s                      | ~1.16   | utility_scale.UtilityScaleBenchmarks.time_circSU2('cz')                                                         |
|          | 222±2μs                           | 252±70μs                       | ~1.14   | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 8192)                                        |
|          | 2.26±0.5ms                        | 2.50±0.6ms                     | ~1.11   | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 131072)                                     |
|          | 172±60μs                          | 189±60μs                       | ~1.10   | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 8192)                                        |
|          | 2.28±0.4ms                        | 2.51±0.5ms                     | 1.10    | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 131072)                                     |
|          | 132±10μs                          | 144±7μs                        | 1.09    | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 8192)                                       |
|          | 2.34±0.1ms                        | 2.55±0.07ms                    | 1.09    | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 131072)                                      |
|          | 2.34±0.2ms                        | 2.53±0.04ms                    | 1.08    | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 131072)                                      |
|          | 41.9±1μs                          | 44.9±1μs                       | 1.07    | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 2048)                                       |
|          | 782±10μs                          | 833±40μs                       | 1.07    | circuit_construction.CliffordSynthesis.time_clifford_synthesis(10)                                              |
|          | 2.47±0.05ms                       | 2.62±0.05ms                    | 1.06    | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 131072)                                      |
|          | 2.37±0.1ms                        | 2.52±0.06ms                    | 1.06    | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 131072)                                      |
|          | 1.13±0.03ms                       | 1.19±0.03ms                    | 1.06    | circuit_construction.CircuitConstructionBench.time_circuit_extend(1, 8192)                                      |
|          | 5.78±0.2ms                        | 6.12±0.2ms                     | 1.06    | circuit_construction.CircuitConstructionBench.time_circuit_extend(14, 32768)                                    |
|          | 375±10μs                          | 393±2μs                        | 1.05    | circuit_construction.CircuitConstructionBench.time_circuit_extend(14, 2048)                                     |
|          | 343±7μs                           | 361±2μs                        | 1.05    | circuit_construction.CircuitConstructionBench.time_circuit_extend(2, 2048)                                      |
|          | 1.39±0.01ms                       | 1.46±0.02ms                    | 1.05    | circuit_construction.CircuitConstructionBench.time_circuit_extend(2, 8192)                                      |
|          | 108±6ms                           | 113±20ms                       | 1.05    | circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(100, 150)                              |
|          | 34.4±2μs                          | 35.6±1μs                       | 1.04    | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 2048)                                        |
|          | 462±30μs                          | 481±40μs                       | 1.04    | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 32768)                                      |
|          | 4.52±0.2ms                        | 4.68±0.3ms                     | 1.04    | circuit_construction.CircuitConstructionBench.time_circuit_extend(1, 32768)                                     |
|          | 380±2μs                           | 397±5μs                        | 1.04    | circuit_construction.CircuitConstructionBench.time_circuit_extend(20, 2048)                                     |
|          | 1.46±0.03ms                       | 1.51±0.04ms                    | 1.04    | circuit_construction.CircuitConstructionBench.time_circuit_extend(5, 8192)                                      |
|          | 5.65±0.08ms                       | 5.86±0.1ms                     | 1.04    | circuit_construction.CircuitConstructionBench.time_circuit_extend(8, 32768)                                     |
|          | 3.27±0.08ms                       | 3.39±0.05ms                    | 1.04    | circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(50, 10)                                |
|          | 29.4±0.2ms                        | 30.5±0.3ms                     | 1.04    | circuit_construction.ParamaterizedDifferentCircuit.time_QV100_build(50, 150)                                    |
|          | 296±3ms                           | 307±1ms                        | 1.04    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 131072, 128)                            |
|          | 39.2±3μs                          | 40.5±7μs                       | 1.03    | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 2048)                                        |
|          | 37.1±0.5μs                        | 38.1±0.3μs                     | 1.03    | circuit_construction.CircuitConstructionBench.time_circuit_extend(14, 128)                                      |
|          | 28.9±0.3μs                        | 29.8±0.09μs                    | 1.03    | circuit_construction.CircuitConstructionBench.time_circuit_extend(2, 128)                                       |
|          | 5.32±0.1ms                        | 5.50±0.07ms                    | 1.03    | circuit_construction.CircuitConstructionBench.time_circuit_extend(2, 32768)                                     |
|          | 44.2±0.3μs                        | 45.4±0.4μs                     | 1.03    | circuit_construction.CircuitConstructionBench.time_circuit_extend(20, 128)                                      |
|          | 22.3±1ms                          | 23.0±1ms                       | 1.03    | circuit_construction.CircuitConstructionBench.time_circuit_extend(20, 131072)                                   |
|          | 22.9±0.3μs                        | 23.7±0.4μs                     | 1.03    | circuit_construction.CircuitConstructionBench.time_circuit_extend(20, 8)                                        |
|          | 364±7μs                           | 375±7μs                        | 1.03    | circuit_construction.CircuitConstructionBench.time_circuit_extend(5, 2048)                                      |
|          | 1.48±0.04ms                       | 1.53±0.05ms                    | 1.03    | circuit_construction.CircuitConstructionBench.time_circuit_extend(8, 8192)                                      |
|          | 4.78±0.04ms                       | 4.94±0.1ms                     | 1.03    | circuit_construction.CliffordSynthesis.time_clifford_synthesis(50)                                              |
|          | 8.91±0.3ms                        | 9.22±0.2ms                     | 1.03    | circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(100, 10)                               |
|          | 34.6±0.9ms                        | 35.5±0.7ms                     | 1.03    | circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(50, 150)                               |
|          | 604±4ms                           | 624±2ms                        | 1.03    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 131072, 8)      |
|          | 9.40±0.06ms                       | 9.65±0.06ms                    | 1.03    | utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('ecr')                                                |
|          | 484±4ms                           | 495±3ms                        | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_construction(14, 131072)                             |
|          | 481±1ms                           | 490±7ms                        | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_construction(2, 131072)                              |
|          | 123±0.3ms                         | 125±0.6ms                      | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_construction(20, 32768)                              |
|          | 30.9±0.1ms                        | 31.5±0.05ms                    | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_construction(20, 8192)                               |
|          | 30.8±0.2ms                        | 31.4±0.1ms                     | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_construction(5, 8192)                                |
|          | 572±3μs                           | 585±2μs                        | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_construction(8, 128)                                 |
|          | 123±0.5ms                         | 125±1ms                        | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_construction(8, 32768)                               |
|          | 11.9±0.05μs                       | 12.0±0.1μs                     | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 128)                                         |
|          | 12.8±0.4μs                        | 13.1±0.2μs                     | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 128)                                         |
|          | 25.3±0.05μs                       | 25.9±0.1μs                     | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(1, 128)                                       |
|          | 293±3μs                           | 297±8μs                        | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(1, 2048)                                      |
|          | 1.53±0.04ms                       | 1.56±0.04ms                    | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(14, 8192)                                     |
|          | 8.80±0.1μs                        | 8.93±0.06μs                    | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(2, 8)                                         |
|          | 5.94±0.1ms                        | 6.04±0.1ms                     | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(20, 32768)                                    |
|          | 1.48±0.03ms                       | 1.51±0.06ms                    | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(20, 8192)                                     |
|          | 32.6±0.4μs                        | 33.4±0.2μs                     | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(5, 128)                                       |
|          | 23.1±0.8ms                        | 23.6±0.6ms                     | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(5, 131072)                                    |
|          | 5.65±0.09ms                       | 5.78±0.1ms                     | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(5, 32768)                                     |
|          | 10.1±0.07μs                       | 10.3±0.07μs                    | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(5, 8)                                         |
|          | 22.3±0.8ms                        | 22.7±0.8ms                     | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(8, 131072)                                    |
|          | 371±2μs                           | 380±9μs                        | 1.02    | circuit_construction.CircuitConstructionBench.time_circuit_extend(8, 2048)                                      |
|          | 12.2±0.4ms                        | 12.5±0.05ms                    | 1.02    | circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(50, 50)                                |
|          | 599±3ms                           | 613±5ms                        | 1.02    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 131072, 131072)                         |
|          | 145±1ms                           | 147±2ms                        | 1.02    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 32768, 32768)                           |
|          | 31.4±0.5ms                        | 32.0±1ms                       | 1.02    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 8192, 8192)                             |
|          | 660±4ms                           | 675±7ms                        | 1.02    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 131072, 8192)   |
|          | 309±2ms                           | 314±3ms                        | 1.02    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 32768, 32768)   |
|          | 38.5±0.2ms                        | 39.5±0.3ms                     | 1.02    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 8192, 128)      |
|          | 46.9±0.1ms                        | 47.7±0.2ms                     | 1.02    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 8192, 2048)     |
|          | 3.59±0.1s                         | 3.67±0.1s                      | 1.02    | utility_scale.UtilityScaleBenchmarks.time_circSU2('cx')                                                         |
|          | 3.60±0.08s                        | 3.65±0.2s                      | 1.02    | utility_scale.UtilityScaleBenchmarks.time_circSU2('ecr')                                                        |
|          | 100±2ms                           | 102±2ms                        | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('cx')                                                  |
|          | 32.8±0.2ms                        | 33.3±0.1ms                     | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('cx')                                    |
|          | 32.6±0.5ms                        | 33.2±0.3ms                     | 1.02    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('ecr')                                   |
|          | 153±1ms                           | 156±0.9ms                      | 1.02    | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cz')                                               |
|          | 7.29±0.01ms                       | 7.33±0.1ms                     | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(1, 2048)                                |
|          | 580±5μs                           | 587±1μs                        | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(14, 128)                                |
|          | 158±0.5μs                         | 160±1μs                        | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(14, 8)                                  |
|          | 30.8±0.08ms                       | 31.2±0.3ms                     | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(14, 8192)                               |
|          | 30.4±0.3ms                        | 30.8±0.2ms                     | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(2, 8192)                                |
|          | 674±2μs                           | 682±2μs                        | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(20, 128)                                |
|          | 486±3ms                           | 490±7ms                        | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(20, 131072)                             |
|          | 7.83±0.04ms                       | 7.93±0.04ms                    | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(20, 2048)                               |
|          | 218±0.6μs                         | 221±1μs                        | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(20, 8)                                  |
|          | 7.80±0.04ms                       | 7.90±0.05ms                    | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(5, 2048)                                |
|          | 31.2±0.3ms                        | 31.5±0.2ms                     | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_construction(8, 8192)                                |
|          | 10.4±0.05μs                       | 10.6±0.3μs                     | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 128)                                         |
|          | 8.04±0.03μs                       | 8.09±0.01μs                    | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_extend(1, 8)                                         |
|          | 17.9±0.4μs                        | 18.1±0.6μs                     | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_extend(14, 8)                                        |
|          | 22.1±0.1ms                        | 22.5±0.2ms                     | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_extend(2, 131072)                                    |
|          | 34.3±0.3μs                        | 34.8±0.2μs                     | 1.01    | circuit_construction.CircuitConstructionBench.time_circuit_extend(8, 128)                                       |
|          | 25.5±0.2ms                        | 25.6±0.4ms                     | 1.01    | circuit_construction.CliffordSynthesis.time_clifford_synthesis(100)                                             |
|          | 2.25±0.01ms                       | 2.26±0.04ms                    | 1.01    | circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(10, 50)                                |
|          | 2.63±0.08ms                       | 2.65±0.03ms                    | 1.01    | circuit_construction.ParamaterizedDifferentCircuit.time_QV100_build(10, 50)                                     |
|          | 2.92±0.06ms                       | 2.94±0.06ms                    | 1.01    | circuit_construction.ParamaterizedDifferentCircuit.time_QV100_build(50, 10)                                     |
|          | 1.68±0.03ms                       | 1.69±0.02ms                    | 1.01    | circuit_construction.ParameterizedCirc.time_param_circSU2_100_build(10)                                         |
|          | 302±4ms                           | 306±1ms                        | 1.01    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 131072, 8)                              |
|          | 316±4ms                           | 320±2ms                        | 1.01    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 131072, 8192)                           |
|          | 128±2μs                           | 129±1μs                        | 1.01    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 8, 8)                                   |
|          | 18.5±0.2ms                        | 18.6±0.08ms                    | 1.01    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 8192, 128)                              |
|          | 1.35±0.01ms                       | 1.36±0.01ms                    | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 128, 128)       |
|          | 857±4μs                           | 863±2μs                        | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 128, 8)         |
|          | 1.23±0.01s                        | 1.24±0.01s                     | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 131072, 131072) |
|          | 18.8±0.1ms                        | 18.9±0.2ms                     | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 2048, 2048)     |
|          | 9.75±0.05ms                       | 9.82±0.07ms                    | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 2048, 8)        |
|          | 156±0.5ms                         | 157±2ms                        | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 32768, 128)     |
|          | 161±2ms                           | 163±2ms                        | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 32768, 2048)    |
|          | 153±0.4ms                         | 154±1ms                        | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 32768, 8)       |
|          | 194±2ms                           | 195±1ms                        | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 32768, 8192)    |
|          | 38.4±0.2ms                        | 38.6±0.2ms                     | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 8192, 8)        |
|          | 75.4±0.8ms                        | 75.9±0.8ms                     | 1.01    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 8192, 8192)     |
|          | 938±6ms                           | 947±4ms                        | 1.01    | circuit_construction.QasmImport.time_QV100_qasm2_import                                                         |
|          | 9.41±0.05ms                       | 9.54±0.03ms                    | 1.01    | utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('cx')                                                 |
|          | 9.48±0.05ms                       | 9.55±0.06ms                    | 1.01    | utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100('cz')                                                 |
|          | 32.9±0.2ms                        | 33.3±0.2ms                     | 1.01    | utility_scale.UtilityScaleBenchmarks.time_parse_square_heisenberg_n100('cz')                                    |
|          | 623±5ms                           | 629±3ms                        | 1.01    | utility_scale.UtilityScaleBenchmarks.time_qft('cz')                                                             |
|          | 634±2ms                           | 642±2ms                        | 1.01    | utility_scale.UtilityScaleBenchmarks.time_qft('ecr')                                                            |
|          | 812±3ms                           | 817±6ms                        | 1.01    | utility_scale.UtilityScaleBenchmarks.time_qv('ecr')                                                             |
|          | 135±0.7ms                         | 136±0.5ms                      | 1.01    | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cx')                                               |
|          | 155±2ms                           | 157±1ms                        | 1.01    | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('ecr')                                              |
|          | 464±2ms                           | 463±2ms                        | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(1, 131072)                              |
|          | 29.1±0.2ms                        | 29.3±0.3ms                     | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(1, 8192)                                |
|          | 7.81±0.05ms                       | 7.81±0.05ms                    | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(14, 2048)                               |
|          | 124±0.6ms                         | 123±1ms                        | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(14, 32768)                              |
|          | 529±3μs                           | 530±0.8μs                      | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(2, 128)                                 |
|          | 7.80±0.04ms                       | 7.81±0.07ms                    | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(2, 2048)                                |
|          | 122±0.5ms                         | 123±1ms                        | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(2, 32768)                               |
|          | 490±2ms                           | 490±1ms                        | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(5, 131072)                              |
|          | 123±1ms                           | 123±1ms                        | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(5, 32768)                               |
|          | 488±2ms                           | 489±3ms                        | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(8, 131072)                              |
|          | 7.93±0.06ms                       | 7.97±0.05ms                    | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_construction(8, 2048)                                |
|          | 15.5±0.5μs                        | 15.4±0.8μs                     | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 128)                                        |
|          | 13.8±0.3μs                        | 13.8±0.8μs                     | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_copy(14, 8)                                          |
|          | 10.9±0.05μs                       | 10.8±0.6μs                     | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 128)                                         |
|          | 9.30±0.05μs                       | 9.29±0.4μs                     | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_copy(2, 8)                                           |
|          | 10.2±0.1μs                        | 10.2±0.05μs                    | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 8)                                           |
|          | 19.6±0.2ms                        | 19.6±0.3ms                     | 1.00    | circuit_construction.CircuitConstructionBench.time_circuit_extend(1, 131072)                                    |
|          | 237±1μs                           | 237±2μs                        | 1.00    | circuit_construction.MultiControl.time_multi_control_circuit(10)                                                |
|          | 679±6μs                           | 677±9μs                        | 1.00    | circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(10, 10)                                |
|          | 6.00±0.01ms                       | 6.00±0.06ms                    | 1.00    | circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(10, 150)                               |
|          | 726±30μs                          | 728±40μs                       | 1.00    | circuit_construction.ParamaterizedDifferentCircuit.time_QV100_build(10, 10)                                     |
|          | 5.12±0.1ms                        | 5.11±0.1ms                     | 1.00    | circuit_construction.ParamaterizedDifferentCircuit.time_QV100_build(100, 10)                                    |
|          | 60.9±0.5ms                        | 61.1±0.5ms                     | 1.00    | circuit_construction.ParamaterizedDifferentCircuit.time_QV100_build(100, 150)                                   |
|          | 2.15±0.03ms                       | 2.14±0.02ms                    | 1.00    | circuit_construction.ParameterizedCirc.time_param_circSU2_100_build(16)                                         |
|          | 1.22±0ms                          | 1.22±0ms                       | 1.00    | circuit_construction.ParameterizedCirc.time_param_circSU2_100_build(5)                                          |
|          | 542±5μs                           | 544±5μs                        | 1.00    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 128, 128)                               |
|          | 392±7μs                           | 392±1μs                        | 1.00    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 128, 8)                                 |
|          | 372±4ms                           | 371±3ms                        | 1.00    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 131072, 32768)                          |
|          | 4.75±0.01ms                       | 4.77±0.01ms                    | 1.00    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 2048, 128)                              |
|          | 4.73±0.02ms                       | 4.72±0.02ms                    | 1.00    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 2048, 8)                                |
|          | 78.0±1ms                          | 78.1±0.6ms                     | 1.00    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 32768, 2048)                            |
|          | 73.3±0.6ms                        | 73.4±0.6ms                     | 1.00    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 32768, 8)                               |
|          | 91.1±2ms                          | 91.1±0.7ms                     | 1.00    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 32768, 8192)                            |
|          | 18.4±0.3ms                        | 18.4±0.2ms                     | 1.00    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 8192, 8)                                |
|          | 617±7ms                           | 616±10ms                       | 1.00    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 131072, 128)    |
|          | 628±8ms                           | 629±6ms                        | 1.00    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 131072, 2048)   |
|          | 797±6ms                           | 801±4ms                        | 1.00    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 131072, 32768)  |
|          | 10.3±0.05ms                       | 10.3±0.08ms                    | 1.00    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 2048, 128)      |
|          | 304±3μs                           | 303±0.7μs                      | 1.00    | circuit_construction.ParameterizedCircuitConstructionBench.time_build_parameterized_circuit(20, 8, 8)           |
|          | 139±2ms                           | 139±0.5ms                      | 1.00    | utility_scale.UtilityScaleBenchmarks.time_bv_100('cx')                                                          |
|          | 143±0.5ms                         | 143±0.8ms                      | 1.00    | utility_scale.UtilityScaleBenchmarks.time_bv_100('cz')                                                          |
|          | 9.45±0.04ms                       | 9.42±0.06ms                    | 1.00    | utility_scale.UtilityScaleBenchmarks.time_bvlike('cx')                                                          |
|          | 9.50±0.04ms                       | 9.45±0.09ms                    | 1.00    | utility_scale.UtilityScaleBenchmarks.time_bvlike('cz')                                                          |
|          | 101±2ms                           | 102±0.9ms                      | 1.00    | utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('cz')                                                  |
|          | 101±1ms                           | 101±0.4ms                      | 1.00    | utility_scale.UtilityScaleBenchmarks.time_parse_qft_n100('ecr')                                                 |
|          | 272±2ms                           | 272±0.5ms                      | 1.00    | utility_scale.UtilityScaleBenchmarks.time_qaoa('cx')                                                            |
|          | 359±2ms                           | 360±1ms                        | 1.00    | utility_scale.UtilityScaleBenchmarks.time_qaoa('cz')                                                            |
|          | 495±2ms                           | 495±2ms                        | 1.00    | utility_scale.UtilityScaleBenchmarks.time_qft('cx')                                                             |
|          | 835±8ms                           | 839±5ms                        | 1.00    | utility_scale.UtilityScaleBenchmarks.time_qv('cz')                                                              |
|          | 390                               | 390                            | 1.00    | utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('cx')                                                   |
|          | 407                               | 407                            | 1.00    | utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('cz')                                                   |
|          | 407                               | 407                            | 1.00    | utility_scale.UtilityScaleBenchmarks.track_bv_100_depth('ecr')                                                  |
|          | 300                               | 300                            | 1.00    | utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('cx')                                                  |
|          | 300                               | 300                            | 1.00    | utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('cz')                                                  |
|          | 300                               | 300                            | 1.00    | utility_scale.UtilityScaleBenchmarks.track_circSU2_depth('ecr')                                                 |
|          | 1607                              | 1607                           | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('cx')                                                     |
|          | 1622                              | 1622                           | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('cz')                                                     |
|          | 1622                              | 1622                           | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qaoa_depth('ecr')                                                    |
|          | 1954                              | 1954                           | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qft_depth('cx')                                                      |
|          | 1954                              | 1954                           | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qft_depth('cz')                                                      |
|          | 1954                              | 1954                           | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qft_depth('ecr')                                                     |
|          | 2709                              | 2709                           | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qv_depth('cx')                                                       |
|          | 2709                              | 2709                           | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qv_depth('cz')                                                       |
|          | 2709                              | 2709                           | 1.00    | utility_scale.UtilityScaleBenchmarks.track_qv_depth('ecr')                                                      |
|          | 462                               | 462                            | 1.00    | utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('cx')                                        |
|          | 462                               | 462                            | 1.00    | utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('cz')                                        |
|          | 462                               | 462                            | 1.00    | utility_scale.UtilityScaleBenchmarks.track_square_heisenberg_depth('ecr')                                       |
|          | 489±4μs                           | 485±2μs                        | 0.99    | circuit_construction.CircuitConstructionBench.time_circuit_construction(1, 128)                                 |
|          | 117±0.4ms                         | 115±0.4ms                      | 0.99    | circuit_construction.CircuitConstructionBench.time_circuit_construction(1, 32768)                               |
|          | 48.9±0.1μs                        | 48.6±0.1μs                     | 0.99    | circuit_construction.CircuitConstructionBench.time_circuit_construction(1, 8)                                   |
|          | 571±6μs                           | 566±4μs                        | 0.99    | circuit_construction.CircuitConstructionBench.time_circuit_construction(5, 128)                                 |
|          | 65.8±1μs                          | 65.3±0.7μs                     | 0.99    | circuit_construction.CircuitConstructionBench.time_circuit_construction(5, 8)                                   |
|          | 98.2±2μs                          | 97.3±0.4μs                     | 0.99    | circuit_construction.CircuitConstructionBench.time_circuit_construction(8, 8)                                   |
|          | 9.09±0.1μs                        | 8.98±0.03μs                    | 0.99    | circuit_construction.CircuitConstructionBench.time_circuit_copy(1, 8)                                           |
|          | 11.4±0.5μs                        | 11.3±0.4μs                     | 0.99    | circuit_construction.CircuitConstructionBench.time_circuit_copy(8, 8)                                           |
|          | 394±2μs                           | 389±2μs                        | 0.99    | circuit_construction.MultiControl.time_multi_control_circuit(16)                                                |
|          | 508±4μs                           | 503±3μs                        | 0.99    | circuit_construction.MultiControl.time_multi_control_circuit(20)                                                |
|          | 37.0±1ms                          | 36.7±0.1ms                     | 0.99    | circuit_construction.ParamaterizedDifferentCircuit.time_DTC100_set_build(100, 50)                               |
|          | 20.7±0.1ms                        | 20.5±0.09ms                    | 0.99    | circuit_construction.ParamaterizedDifferentCircuit.time_QV100_build(100, 50)                                    |
|          | 74.4±0.5ms                        | 73.8±2ms                       | 0.99    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 32768, 128)                             |
|          | 21.6±0.3ms                        | 21.5±0.3ms                     | 0.99    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 8192, 2048)                             |
|          | 9.52±0.04ms                       | 9.45±0.06ms                    | 0.99    | utility_scale.UtilityScaleBenchmarks.time_bvlike('ecr')                                                         |
|          | 342±3ms                           | 339±1ms                        | 0.99    | utility_scale.UtilityScaleBenchmarks.time_qaoa('ecr')                                                           |
|          | 60.4±2μs                          | 59.1±0.5μs                     | 0.98    | circuit_construction.CircuitConstructionBench.time_circuit_construction(2, 8)                                   |
|          | 23.8±0.3ms                        | 23.2±1ms                       | 0.98    | circuit_construction.CircuitConstructionBench.time_circuit_extend(14, 131072)                                   |
|          | 6.60±0.1ms                        | 6.47±0.2ms                     | 0.98    | circuit_construction.ParamaterizedDifferentCircuit.time_QV100_build(10, 150)                                    |
|          | 11.0±0.4ms                        | 10.7±0.09ms                    | 0.98    | circuit_construction.ParamaterizedDifferentCircuit.time_QV100_build(50, 50)                                     |
|          | 306±2ms                           | 300±5ms                        | 0.98    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 131072, 2048)                           |
|          | 7.44±0.1ms                        | 7.29±0.04ms                    | 0.98    | circuit_construction.ParameterizedCircuitBindBench.time_bind_params(20, 2048, 2048)                             |
|          | 145±2ms                           | 142±0.4ms                      | 0.98    | utility_scale.UtilityScaleBenchmarks.time_bv_100('ecr')                                                         |
|          | 701±2ms                           | 689±3ms                        | 0.98    | utility_scale.UtilityScaleBenchmarks.time_qv('cx')                                                              |
|          | 16.7±0.7μs                        | 16.1±0.8μs                     | 0.97    | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 8)                                          |
|          | 12.9±0.3μs                        | 12.6±0.09μs                    | 0.97    | circuit_construction.CircuitConstructionBench.time_circuit_extend(8, 8)                                         |
|          | 18.5±0.8μs                        | 17.7±0.9μs                     | 0.96    | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 128)                                        |
|          | 130±4μs                           | 121±10μs                       | 0.93    | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 8192)                                       |
|          | 43.7±3μs                          | 40.4±1μs                       | 0.93    | circuit_construction.CircuitConstructionBench.time_circuit_copy(5, 2048)                                        |
|          | 50.2±4μs                          | 46.1±0.7μs                     | 0.92    | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 2048)                                       |

Benchmarks that have got worse:

| Change   | Before [8bc48442] <once-lock^2>   | After [487e1bec] <once-lock>   |   Ratio | Benchmark (Parameter)                                                      |
|----------|-----------------------------------|--------------------------------|---------|----------------------------------------------------------------------------|
| +        | 368±40μs                          | 485±20μs                       |    1.32 | circuit_construction.CircuitConstructionBench.time_circuit_copy(20, 32768) |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

This is mostly expected because we only use the py gate cache if when we're interacting with Python which will always end up being the bottleneck. The exception I guess is copy() which ends up being a rust clone which is probably more expensive now given the tracked regression on that one benchmark. But I think a 30% slowdown on a benchmark in the worst case 100s of microsecond scale is acceptable given that this unlocks a whole class of multithreaded pass implementations like #13419 and #13411 which will yield further speed ups for the transpiler.

Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks good to me.

@kevinhartman kevinhartman added this pull request to the merge queue Nov 11, 2024
Merged via the queue into Qiskit:main with commit 3a9993a Nov 11, 2024
17 checks passed
@mtreinish mtreinish deleted the once-lock branch November 18, 2024 22:51
github-merge-queue bot pushed a commit that referenced this pull request Feb 12, 2025
* Use OnceLock instead of OnceCell

OnceLock is a thread-safe version of OnceCell that enables us to use
PackedInstruction from a threaded environment. There is some overhead
associated with this, primarily in memory as the OnceLock is a larger
type than a OnceCell. But the tradeoff is worth it to start leverage
multithreading for circuits.

Fixes #13219

* Update twirling too

* Perform BarrierBeforeFinalMeasurements analysis in paralle

With #13410 removing the non-threadsafe structure from our circuit
representation we're now able to read and iterate over a DAGCircuit from
multiple threads. This commit is the first small piece doing this, it
moves the analysis portion of the BarrierBeforeFinalMeasurements pass to
execure in parallel. The pass checks every node to ensure all it's
decendents are either a measure or a barrier before reaching the end of
the circuit. This commit iterates over all the nodes and does the check
in parallel.

* Remove allocation for node scan

* Refactor pass to optimize search and set parallel threshold

This commit updates the logic in the pass to simplify the search
algorithm and improve it's overall efficiency. Previously the pass would
search the entire dag for all barrier and measurements and then did a
BFS from each found node to check that all descendants are either
barriers or measurements. Then with the set of nodes matching that
condition a full topological sort of the dag was run, then the
topologically ordered nodes were filtered for the matching set. That
sorted set is then used for filtering

This commit refactors this to do a reverse search from the output
nodes which reduces the complexity of the algorithm. This new algorithm
is also conducive for parallel execution because it does a search
starting from each qubit's output node. Doing a test with a quantum
volume circuit from 10 to 1000 qubits which scales linearly in depth
and number of qubits a crossover point between the parallel and serial
implementations was found around 150 qubits.

* Update crates/circuit/src/dag_circuit.rs

Co-authored-by: Raynel Sanchez <[email protected]>

* Rework logic to check using StandardInstruction

* Add comments explaining the search function

* Update crates/circuit/src/dag_circuit.rs

Co-authored-by: Raynel Sanchez <[email protected]>

---------

Co-authored-by: Raynel Sanchez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop using OnceCell for PackedInstruction.py_op field
4 participants