Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid dropping elements on consume (idea) #93

Closed
erenon opened this issue Sep 15, 2022 · 5 comments
Closed

Avoid dropping elements on consume (idea) #93

erenon opened this issue Sep 15, 2022 · 5 comments

Comments

@erenon
Copy link

erenon commented Sep 15, 2022

Hi, thanks for the great library!

Here's my use-case: This is a low latency application. Thread A sends elements over a spsc channel to Thread B. The elements contain heap-allocations, e.g: String. (I know, this is funny to have both "low latency" and "heap allocated elems" in the same system, but that's it)

With any popular queue solution, including rtrb, popping takes ownership of the elem, making the consumer eventually Drop it (unless it does not send the consumed elem back using a different channel). This makes Thread A continuously allocated, and Thread B continuously deallocate, eventually making them contending for the same lock. This has measurable effects.

To solve this issue, I ported my C++ queue to rust: https://github.com/erenon/cueue
This has the following differences compared to rtrb:

  • It default inits every elem in the queue (This requires T to be Default)
  • It does not drop consumed elems, but leaves them in the queue as-is. This avoids heap ping-pong.
  • The produced elems are returned to the producer (in the write_chunk): this allows re-use of resources.
  • It employs an mmap trick that makes the queue truly circular. This makes the internal logic a bit simpler, and the reader/writer interface a bit nicer (e.g: reader always returns a single slice, not two).

Perhaps some ideas can be salvaged to make rtrb even better? (I would prefer the widely used rtrb over my tiny lib, if it was possible) Thanks!

@mgeier
Copy link
Owner

mgeier commented Sep 16, 2022

Thanks for reaching out!

(unless it does not send the consumed elem back using a different channel)

That's what I would do. And that's what I've actually done for some time. I've created a data type that, when dropped, sends itself (or rather its contents) back over a secondary queue, where it can be re-used:
https://github.com/AudioSceneDescriptionFormat/asdf-rust/blob/aba3abb37d47aff51834c9b3a79c86cabd90b004/src/streamer.rs#L66-L78

But later I changed that to avoid the heap-allocated type.

Perhaps some ideas can be salvaged to make rtrb even better?

I'm open for improvements, but I think it's also OK to have multiple implementations with different properties and therefore different strengths and weaknesses. And different APIs.

(I would prefer the widely used rtrb over my tiny lib, if it was possible)

I'm not sure how widely it's actually used. You can also check out https://github.com/agerasev/ringbuf, which seems to be more popular.

And your cueue project already has many Github stars for its very short existence, maybe it will overtake rtrb soon?

  • It default inits every elem in the queue (This requires T to be Default)

I wouldn't want to force this requirement onto all users.

I think that's a good reason to have a separate project.

  • It does not drop consumed elems, but leaves them in the queue as-is. This avoids heap ping-pong.
  • The produced elems are returned to the producer (in the write_chunk): this allows re-use of resources.

I'm sure this has its uses, but some users might actually want to have move semantics, e.g. when the items are managing some kind of resource.

Of course one could always use an Option<T> for that, but that seems like a clumsy work-around.

  • It employs an mmap trick that makes the queue truly circular. This makes the internal logic a bit simpler, and the reader/writer interface a bit nicer (e.g: reader always returns a single slice, not two).

Yes, I've thought about it, and this sounds really interesting, but I didn't want to be too platform-specific.
I think that's another good reason to have a separate project.

I'm also not sure about the performance implications, but I'd hope that the mmap trick is faster.

The API would be completely different, so there wouldn't be much left from the original rtrb.

It would be interesting if the two approaches (and maybe even more?) can share some code. I'm not sure though, since the mmap approach uses a different (simpler) internal logic, as you mentioned.


Another aspect: I'm playing with the thought of putting all the memory into a single allocation, which I've implemented in #75. This is supposed to remove one indirection, but I don't know yet if that's worthwhile.
Buf AFAICT, this would be incompatible with the mmap approach, right?

@erenon
Copy link
Author

erenon commented Sep 19, 2022

All good points, thanks for taking a look! I think we can conclude that the two projects are sufficiently different that there is no point of sharing code - only ideas.

(I can't say I really understand #75, but fwiw, depending on how you count allocations, the "dynamic" part of cueue resides in a single allocation, and the Reader and Writer reference that)

@erenon erenon closed this as completed Sep 19, 2022
@mgeier
Copy link
Owner

mgeier commented Sep 20, 2022

I think we can conclude that the two projects are sufficiently different that there is no point of sharing code - only ideas.

I agree.
I've already made a few suggestions: erenon/cueue#6, erenon/cueue#7

(I can't say I really understand #75, but fwiw, depending on how you count allocations, the "dynamic" part of cueue resides in a single allocation, and the Reader and Writer reference that)

Yes, I've just looked at that, but there is an additional allocation created by Arc::new().

However, you seem to have an additional buffer pointer, which seems to be a "shortcut" in a way, avoiding the second indirection. But instead of two steps of indirections, you have to make two separate indirections, one for cb and one for buffer. And then there is write_begin and read_begin, which seem to be additional indirections, but I'm not sure about the implications of that.

In rtrb, I have two indirections to get to producer.buffer.data_ptr. The idea would be that #75 avoids the Arc by re-implementing part of its functionality manually. Conceptually, I would get something like producer.buffer[i] for accessing a slot, because the RingBuffer struct itself holds the data.

In the end I would really only make a single dynamic allocation, which holds two atomic indices for head/tail (with cacheline padding), an atomic bool for the Arc-like behavior and all the buffer elements.

So far, I couldn't see convincing performance improvements, so I'm not sure if it is worth the additional code.

In your cueue case however, since you already have the atomic indices in the same allocation as your slots, you might as well add an atomic bool and get rid of the Arc. This way, you might even get rid of the alloc dependency (for whatever that's worth), because you do your actual allocations with libc and not with the standard library.

BTW, I'm wondering about your cueue() function, which doesn't seem to mention alignment anywhere. I guess if your T doesn't have an alignment bigger than 128 everything should be fine, but what if it has more?

@erenon
Copy link
Author

erenon commented Sep 21, 2022

Yes, I've just looked at that, but there is an additional allocation created by Arc::new().

Good point, now I understand what you mean much better. I retained buffer, because computing it from cb costed precious nanoseconds (measurable). And I see how I could get rid of Arc, good idea!

BTW, I'm wondering about your cueue() function, which doesn't seem to mention alignment anywhere. I guess if your T doesn't have an alignment bigger than 128 everything should be fine, but what if it has more?

Where's 128 coming from? I understand the alignment must be less than pagesize (and power2), that is 4096 or 4*4096 on systems I have access to - never heard of a use-case for a larger alignment.

@mgeier
Copy link
Owner

mgeier commented Sep 22, 2022

Where's 128 coming from?

From https://github.com/erenon/cueue/blob/f536ef748cd36f72888ee9c1ac4f5b69d6575129/src/lib.rs#L251

I thought your items follow immediately after the two padded indices, but now I've seen that you are using an offset of cbsize which is set to the page size. So IIUC, the beginning of your buffer has an alignment of much more than 128. I guess that's fine.

In #75 I was using Layout::extend(), which automatically takes care of the proper alignment, so I was wondering how this can be achieved manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants