zb: allow direct access to message properties without allocating #1158

TTWNO · 2024-11-25T14:19:46Z

Creates a handful of new methods (there could potentially be a few more, too) that allow the user of zbus to directly access a reference to header properties without cloning the signature or the PrimaryHeader.
Secondly, it allows you to consume the message in exchange for an owned zbus::message::Body type.

This increases performance relative to taking the cloned primary header and signature.
atspi, branch = "deserialize-properly" can be cloned and cargo bench run to confirm these results.

Performance improvement for atspi is in the 30-50% range; I'll attach my benchmarks here for reference.

`zbus-5.x` branch of atspi

     Running benches/event_parsing.rs (target/release/deps/event_parsing-08ab76e3e81624d0)
Gnuplot not found, using plotters backend
Benchmarking 1000 Messages into Events
Benchmarking 1000 Messages into Events: Warming up for 3.0000 s
Benchmarking 1000 Messages into Events: Collecting 100 samples in estimated 9.3177 s (10k iterations)
Benchmarking 1000 Messages into Events: Analyzing
1000 Messages into Events
                        time:   [923.37 µs 925.20 µs 927.12 µs]
                        change: [+109.91% +110.21% +110.52%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  15 (15.00%) high severe

     Running benches/event_parsing_100k.rs (target/release/deps/event_parsing_100k-da6ad9ef4c4bfa68)
Gnuplot not found, using plotters backend
Benchmarking 100_000 Messages into Events
Benchmarking 100_000 Messages into Events: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, or reduce sample count to 50.
Benchmarking 100_000 Messages into Events: Collecting 100 samples in estimated 9.3435 s (100 iterations)
Benchmarking 100_000 Messages into Events: Analyzing
100_000 Messages into Events
                        time:   [93.347 ms 93.455 ms 93.579 ms]
                        change: [+136.69% +137.03% +137.44%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

`deserialize-properly` branch of atspi

     Running benches/event_parsing.rs (target/release/deps/event_parsing-a86e9ef1e3bf8bc7)
Gnuplot not found, using plotters backend
Benchmarking 1000 Messages into Events
Benchmarking 1000 Messages into Events: Warming up for 3.0000 s
Benchmarking 1000 Messages into Events: Collecting 100 samples in estimated 6.6050 s (15k iterations)
Benchmarking 1000 Messages into Events: Analyzing
1000 Messages into Events
                        time:   [434.71 µs 435.44 µs 436.31 µs]
                        change: [-52.873% -52.782% -52.699%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low mild
  1 (1.00%) high severe

     Running benches/event_parsing_100k.rs (target/release/deps/event_parsing_100k-74d92936c60bb4e4)
Gnuplot not found, using plotters backend
Benchmarking 100_000 Messages into Events
Benchmarking 100_000 Messages into Events: Warming up for 3.0000 s
Benchmarking 100_000 Messages into Events: Collecting 100 samples in estimated 7.5467 s (200 iterations)
Benchmarking 100_000 Messages into Events: Analyzing
100_000 Messages into Events
                        time:   [38.021 ms 38.071 ms 38.123 ms]
                        change: [-59.336% -59.263% -59.188%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmarking 100_000 Messages into Events (with header call)
Benchmarking 100_000 Messages into Events (with header call): Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.9s, or reduce sample count to 80.
Benchmarking 100_000 Messages into Events (with header call): Collecting 100 samples in estimated 5.8684 s (100 iterations)
Benchmarking 100_000 Messages into Events (with header call): Analyzing
100_000 Messages into Events (with header call)
                        time:   [59.386 ms 59.432 ms 59.479 ms]

Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

     Running unittests src/lib.rs (target/release/deps/atspi_common-d6719e5605143917)

Please let me know if there is a better way to implement this.
For example, an appropriate trait to use, or whether this is even something you want to support.

(Also, I know the commits/formatting might be wrong; that's why it's a draft)

Thanks for everything,
—Tait

…eader Message::into_body

TTWNO · 2024-11-28T04:09:25Z

Assuming my benchmarks are accurate, using Arc::clone(signature) instead of cloning the entire signature brings the cost down from around 21ms to about 15ms. It does however, leave some additional performance on the table compared to not taking the header as a reference.

It appears that most of this is coming from a desire to:

Limit direct access to the internals of Message (private fields)
Message is Send and Sync (Arc<...>, OnceLock<...>)
No need to deserialize multiple times (QuickFields)

While I understand that general users of your library may benefit from all of these features, atspi only has temporary Messages. They are used exclusively for Event parsing, are never passed down to our downstream users, and are immediately discarded.

Would you consider a version of Message without all this indirection? I know it's an eventual goal to get some no_std compatibility with zvariant and having a no_std compatible Message would go a long way (and help me get some extra little performance tweaks :)

I understand these wins seem like a small amount of time, but I'd like to reiterate that atspi:

Faces some amount of event storms
These scores are only good on my very powerful machine. atspi should be able to used fine on a Raspberry Pi for example. I can go test there to see how the benchmarks fare.
Gets almost a 50% performance improvement just by stopping a few clones. With this tight of a loop, small things add up quickly.

Please let me know if I can help in any way.

zeenix · 2024-11-28T07:36:44Z

Assuming my benchmarks are accurate, using Arc::clone(signature) instead of cloning the entire signature brings the cost down from around 21ms to about 15ms. It does however, leave some additional performance on the table compared to not taking the header as a reference.

Thanks for checking. That's not as impressive as I would have thought. I guess that's the cost of dropping an Arc.

Did you try the Cow trick we talked about? I was hoping you'd prioritise that because if Cloning is the bottleneck here, that should help a lot.

zeenix · 2024-11-28T07:47:33Z

Would you consider a version of Message without all this indirection?

That depends on what you mean. 😊 At least, not by putting all API under the Message. It'll also be an API break if we do it in a consistent way.

Can you please first try what could be a low-hanging fruit: only cloning the reference to the signature?

I know it's an eventual goal to get some no_std compatibility with zvariant and having a no_std compatible Message would go a long way (and help me get some extra little performance tweaks :)

I'm not what you mean here.

TTWNO · 2024-12-02T16:16:32Z

Did you try the Cow trick we talked about? I was hoping you'd prioritise that because if Cloning is the bottleneck here, that should help a lot.

Attempted this with similar results. To note though, it is better than the Arc<_> version.

Using `Cow<Signature>`

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/event_parsing.rs (target/release/deps/event_parsing-502de8f30e35fba0)
Gnuplot not found, using plotters backend
Benchmarking 1000 Messages into Events
Benchmarking 1000 Messages into Events: Warming up for 3.0000 s
Benchmarking 1000 Messages into Events: Collecting 100 samples in estimated 6.3859 s (15k iterations)
Benchmarking 1000 Messages into Events: Analyzing
1000 Messages into Events
                        time:   [423.13 µs 423.63 µs 424.17 µs]
                        change: [-1.6422% -1.4358% -1.2412%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

     Running benches/event_parsing_100k.rs (target/release/deps/event_parsing_100k-1df580876c5b4ca3)
Gnuplot not found, using plotters backend
Benchmarking 100_000 Messages into Events
Benchmarking 100_000 Messages into Events: Warming up for 3.0000 s
Benchmarking 100_000 Messages into Events: Collecting 100 samples in estimated 8.0410 s (200 iterations)
Benchmarking 100_000 Messages into Events: Analyzing
100_000 Messages into Events
                        time:   [40.704 ms 41.115 ms 41.559 ms]
                        change: [-0.9046% +0.0968% +1.2099%] (p = 0.86 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Benchmarking 100_000 Messages into Events (with header call)
Benchmarking 100_000 Messages into Events (with header call): Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, or reduce sample count to 80.
Benchmarking 100_000 Messages into Events (with header call): Collecting 100 samples in estimated 5.6560 s (100 iterations)
Benchmarking 100_000 Messages into Events (with header call): Analyzing
100_000 Messages into Events (with header call)
                        time:   [55.217 ms 55.327 ms 55.442 ms]
                        change: [-9.5622% -9.3639% -9.1954%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

I'm not what you mean here.

What I meant was it would be good if there was a version of Message that could be deserialized without std. Does Message really need to be Send + Sync? I understand why it is, it's just since we have a very unique use case (that of Message as merely a temporary value to be parsed out in a hot loop), those tiny locks (OnceLock<QuickFields>) and atomics (Arc<InnerMessage>) all add up.

Or perhaps Data<_, _> could become more versatile? I.e., have many of the same functions of Message but without indirection (i.e., if you want the .body_signature() it will parse it, whether the value has been computed already or not). Leave it up to libraries to use it responsibly.

zeenix · 2024-12-03T16:04:24Z

Attempted this with similar results. To note though, it is better than the Arc<_> version.

That seems good then. Let's do that please.

What I meant was it would be good if there was a version of Message that could be deserialized without std. Does Message really need to be Send + Sync?

Well, in general, yes. We especially need it (and many other types) to be Send + Sync for easily working with popular multithreaded runtimes like tokio. Apart from async, we also need to broadcast message internally and not using Arc would mean, we do a deep clone multiple times and that would be bad.

If/when we target no_std, one path would be to make those constraints conditional. Also to keep in mind that atomics aren't strictly necessary for Send+Sync.

those tiny locks (OnceLock<QuickFields>) and atomics (Arc<InnerMessage>) all add up.

I'm pretty sure but in the end it's about cost to benefit ratio. With benchmarks we can see how much is the cost so to be able to tell if it's worth it to spend time working on it and (especially) to reduce the ergonomics of the API for it.

Or perhaps Data<_, _> could become more versatile? I.e., have many of the same functions of Message but without indirection

How would that work better than the Cow trick? We could actually use Cow for all fields, no?

zeenix · 2025-01-06T15:02:36Z

This was superseded by #1188.

Add methods to directly access message properties without a copy of H…

ae7c976

…eader Message::into_body

TTWNO force-pushed the direct_propety_access branch from d453f4c to ae7c976 Compare November 25, 2024 14:24

Attempt Cow<Signature> type

619d1a4

zeenix mentioned this pull request Dec 27, 2024

⚡️ zb: Make header cloning cheaper #1188

Merged

zeenix closed this Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zb: allow direct access to message properties without allocating #1158

zb: allow direct access to message properties without allocating #1158

TTWNO commented Nov 25, 2024 •

edited

Loading

TTWNO commented Nov 28, 2024

zeenix commented Nov 28, 2024

zeenix commented Nov 28, 2024

TTWNO commented Dec 2, 2024

zeenix commented Dec 3, 2024 •

edited

Loading

zeenix commented Jan 6, 2025

zb: allow direct access to message properties without allocating #1158

zb: allow direct access to message properties without allocating #1158

Conversation

TTWNO commented Nov 25, 2024 • edited Loading

zbus-5.x branch of atspi

deserialize-properly branch of atspi

TTWNO commented Nov 28, 2024

zeenix commented Nov 28, 2024

zeenix commented Nov 28, 2024

TTWNO commented Dec 2, 2024

Using Cow<Signature>

zeenix commented Dec 3, 2024 • edited Loading

zeenix commented Jan 6, 2025

TTWNO commented Nov 25, 2024 •

edited

Loading

`zbus-5.x` branch of atspi

`deserialize-properly` branch of atspi

Using `Cow<Signature>`

zeenix commented Dec 3, 2024 •

edited

Loading