Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zb: allow direct access to message properties without allocating #1158

Closed
wants to merge 2 commits into from

Conversation

TTWNO
Copy link
Contributor

@TTWNO TTWNO commented Nov 25, 2024

  • Creates a handful of new methods (there could potentially be a few more, too) that allow the user of zbus to directly access a reference to header properties without cloning the signature or the PrimaryHeader.
  • Secondly, it allows you to consume the message in exchange for an owned zbus::message::Body type.

This increases performance relative to taking the cloned primary header and signature.
atspi, branch = "deserialize-properly" can be cloned and cargo bench run to confirm these results.

Performance improvement for atspi is in the 30-50% range; I'll attach my benchmarks here for reference.

zbus-5.x branch of atspi

     Running benches/event_parsing.rs (target/release/deps/event_parsing-08ab76e3e81624d0)
Gnuplot not found, using plotters backend
Benchmarking 1000 Messages into Events
Benchmarking 1000 Messages into Events: Warming up for 3.0000 s
Benchmarking 1000 Messages into Events: Collecting 100 samples in estimated 9.3177 s (10k iterations)
Benchmarking 1000 Messages into Events: Analyzing
1000 Messages into Events
                        time:   [923.37 µs 925.20 µs 927.12 µs]
                        change: [+109.91% +110.21% +110.52%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  15 (15.00%) high severe

     Running benches/event_parsing_100k.rs (target/release/deps/event_parsing_100k-da6ad9ef4c4bfa68)
Gnuplot not found, using plotters backend
Benchmarking 100_000 Messages into Events
Benchmarking 100_000 Messages into Events: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, or reduce sample count to 50.
Benchmarking 100_000 Messages into Events: Collecting 100 samples in estimated 9.3435 s (100 iterations)
Benchmarking 100_000 Messages into Events: Analyzing
100_000 Messages into Events
                        time:   [93.347 ms 93.455 ms 93.579 ms]
                        change: [+136.69% +137.03% +137.44%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

deserialize-properly branch of atspi

     Running benches/event_parsing.rs (target/release/deps/event_parsing-a86e9ef1e3bf8bc7)
Gnuplot not found, using plotters backend
Benchmarking 1000 Messages into Events
Benchmarking 1000 Messages into Events: Warming up for 3.0000 s
Benchmarking 1000 Messages into Events: Collecting 100 samples in estimated 6.6050 s (15k iterations)
Benchmarking 1000 Messages into Events: Analyzing
1000 Messages into Events
                        time:   [434.71 µs 435.44 µs 436.31 µs]
                        change: [-52.873% -52.782% -52.699%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low mild
  1 (1.00%) high severe

     Running benches/event_parsing_100k.rs (target/release/deps/event_parsing_100k-74d92936c60bb4e4)
Gnuplot not found, using plotters backend
Benchmarking 100_000 Messages into Events
Benchmarking 100_000 Messages into Events: Warming up for 3.0000 s
Benchmarking 100_000 Messages into Events: Collecting 100 samples in estimated 7.5467 s (200 iterations)
Benchmarking 100_000 Messages into Events: Analyzing
100_000 Messages into Events
                        time:   [38.021 ms 38.071 ms 38.123 ms]
                        change: [-59.336% -59.263% -59.188%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmarking 100_000 Messages into Events (with header call)
Benchmarking 100_000 Messages into Events (with header call): Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.9s, or reduce sample count to 80.
Benchmarking 100_000 Messages into Events (with header call): Collecting 100 samples in estimated 5.8684 s (100 iterations)
Benchmarking 100_000 Messages into Events (with header call): Analyzing
100_000 Messages into Events (with header call)
                        time:   [59.386 ms 59.432 ms 59.479 ms]

Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

     Running unittests src/lib.rs (target/release/deps/atspi_common-d6719e5605143917)

Please let me know if there is a better way to implement this.
For example, an appropriate trait to use, or whether this is even something you want to support.

(Also, I know the commits/formatting might be wrong; that's why it's a draft)

Thanks for everything,
—Tait

@TTWNO TTWNO force-pushed the direct_propety_access branch from d453f4c to ae7c976 Compare November 25, 2024 14:24
@TTWNO
Copy link
Contributor Author

TTWNO commented Nov 28, 2024

Assuming my benchmarks are accurate, using Arc::clone(signature) instead of cloning the entire signature brings the cost down from around 21ms to about 15ms. It does however, leave some additional performance on the table compared to not taking the header as a reference.

It appears that most of this is coming from a desire to:

  1. Limit direct access to the internals of Message (private fields)
  2. Message is Send and Sync (Arc<...>, OnceLock<...>)
  3. No need to deserialize multiple times (QuickFields)

While I understand that general users of your library may benefit from all of these features, atspi only has temporary Messages. They are used exclusively for Event parsing, are never passed down to our downstream users, and are immediately discarded.

Would you consider a version of Message without all this indirection? I know it's an eventual goal to get some no_std compatibility with zvariant and having a no_std compatible Message would go a long way (and help me get some extra little performance tweaks :)


I understand these wins seem like a small amount of time, but I'd like to reiterate that atspi:

  1. Faces some amount of event storms
  2. These scores are only good on my very powerful machine. atspi should be able to used fine on a Raspberry Pi for example. I can go test there to see how the benchmarks fare.
  3. Gets almost a 50% performance improvement just by stopping a few clones. With this tight of a loop, small things add up quickly.

Please let me know if I can help in any way.

@zeenix
Copy link
Contributor

zeenix commented Nov 28, 2024

Assuming my benchmarks are accurate, using Arc::clone(signature) instead of cloning the entire signature brings the cost down from around 21ms to about 15ms. It does however, leave some additional performance on the table compared to not taking the header as a reference.

Thanks for checking. That's not as impressive as I would have thought. I guess that's the cost of dropping an Arc.

Did you try the Cow trick we talked about? I was hoping you'd prioritise that because if Cloning is the bottleneck here, that should help a lot.

@zeenix
Copy link
Contributor

zeenix commented Nov 28, 2024

Would you consider a version of Message without all this indirection?

That depends on what you mean. 😊 At least, not by putting all API under the Message. It'll also be an API break if we do it in a consistent way.

Can you please first try what could be a low-hanging fruit: only cloning the reference to the signature?

I know it's an eventual goal to get some no_std compatibility with zvariant and having a no_std compatible Message would go a long way (and help me get some extra little performance tweaks :)

I'm not what you mean here.

@TTWNO
Copy link
Contributor Author

TTWNO commented Dec 2, 2024

Did you try the Cow trick we talked about? I was hoping you'd prioritise that because if Cloning is the bottleneck here, that should help a lot.

Attempted this with similar results. To note though, it is better than the Arc<_> version.

Using Cow<Signature>

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/event_parsing.rs (target/release/deps/event_parsing-502de8f30e35fba0)
Gnuplot not found, using plotters backend
Benchmarking 1000 Messages into Events
Benchmarking 1000 Messages into Events: Warming up for 3.0000 s
Benchmarking 1000 Messages into Events: Collecting 100 samples in estimated 6.3859 s (15k iterations)
Benchmarking 1000 Messages into Events: Analyzing
1000 Messages into Events
                        time:   [423.13 µs 423.63 µs 424.17 µs]
                        change: [-1.6422% -1.4358% -1.2412%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

     Running benches/event_parsing_100k.rs (target/release/deps/event_parsing_100k-1df580876c5b4ca3)
Gnuplot not found, using plotters backend
Benchmarking 100_000 Messages into Events
Benchmarking 100_000 Messages into Events: Warming up for 3.0000 s
Benchmarking 100_000 Messages into Events: Collecting 100 samples in estimated 8.0410 s (200 iterations)
Benchmarking 100_000 Messages into Events: Analyzing
100_000 Messages into Events
                        time:   [40.704 ms 41.115 ms 41.559 ms]
                        change: [-0.9046% +0.0968% +1.2099%] (p = 0.86 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Benchmarking 100_000 Messages into Events (with header call)
Benchmarking 100_000 Messages into Events (with header call): Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, or reduce sample count to 80.
Benchmarking 100_000 Messages into Events (with header call): Collecting 100 samples in estimated 5.6560 s (100 iterations)
Benchmarking 100_000 Messages into Events (with header call): Analyzing
100_000 Messages into Events (with header call)
                        time:   [55.217 ms 55.327 ms 55.442 ms]
                        change: [-9.5622% -9.3639% -9.1954%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

I'm not what you mean here.

What I meant was it would be good if there was a version of Message that could be deserialized without std. Does Message really need to be Send + Sync? I understand why it is, it's just since we have a very unique use case (that of Message as merely a temporary value to be parsed out in a hot loop), those tiny locks (OnceLock<QuickFields>) and atomics (Arc<InnerMessage>) all add up.

Or perhaps Data<_, _> could become more versatile? I.e., have many of the same functions of Message but without indirection (i.e., if you want the .body_signature() it will parse it, whether the value has been computed already or not). Leave it up to libraries to use it responsibly.

@zeenix
Copy link
Contributor

zeenix commented Dec 3, 2024

Attempted this with similar results. To note though, it is better than the Arc<_> version.

That seems good then. Let's do that please.

What I meant was it would be good if there was a version of Message that could be deserialized without std. Does Message really need to be Send + Sync?

Well, in general, yes. We especially need it (and many other types) to be Send + Sync for easily working with popular multithreaded runtimes like tokio. Apart from async, we also need to broadcast message internally and not using Arc would mean, we do a deep clone multiple times and that would be bad.

If/when we target no_std, one path would be to make those constraints conditional. Also to keep in mind that atomics aren't strictly necessary for Send+Sync.

those tiny locks (OnceLock<QuickFields>) and atomics (Arc<InnerMessage>) all add up.

I'm pretty sure but in the end it's about cost to benefit ratio. With benchmarks we can see how much is the cost so to be able to tell if it's worth it to spend time working on it and (especially) to reduce the ergonomics of the API for it.

Or perhaps Data<_, _> could become more versatile? I.e., have many of the same functions of Message but without indirection

How would that work better than the Cow trick? We could actually use Cow for all fields, no?

@zeenix
Copy link
Contributor

zeenix commented Jan 6, 2025

This was superseded by #1188.

@zeenix zeenix closed this Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants