Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue: 1792164 Socket error queue support #900

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

igor-ivanov
Copy link
Collaborator

No description provided.

NOTIFY_ON_EVENTS should be used in all place to provide
single way for passing any epoll events.

Signed-off-by: Igor Ivanov <[email protected]>
@swx-jenkins2
Copy link

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/accl-libvma-pr/3422/ for details (Mellanox internal link).

@swx-jenkins2
Copy link

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/accl-libvma-pr/3423/ for details (Mellanox internal link).

Added flags argument that comes from original recv() call.
It is needed to return information from error queue that
should be done if MSG_ERRQUEUE is passed.

Signed-off-by: Igor Ivanov <[email protected]>
zero copy was introduced at linux kernel 4.14 so
prevoius versions do not have related options.

Signed-off-by: Igor Ivanov <[email protected]>
Passing the MSG_ZEROCOPY flag is the most obvious step to enable
copy avoidance, but not the only one.
The kernel is permissive when applications pass undefined flags
to the send system call. By default it simply ignores these.
To avoid enabling copy avoidance mode for legacy processes that
accidentally already pass this flag, a process must first
signal intent by setting a socket option as SO_ZEROCOPY.

Signed-off-by: Igor Ivanov <[email protected]>
Extend pbuf allocation functions with new parameter as pbuf_type
to by pass requested type of memory to socket.
Socket layer needs this information to manage different types
of mem_buf_desc_t elements.

Signed-off-by: Igor Ivanov <[email protected]>
There flags are added:
VMA_TX_PACKET_ZEROCOPY - to use on sockinfo/dst_entry layers
TCP_WRITE_ZEROCOPY - to use inside lwip tcp_write
TF_SEG_OPTS_ZEROCOPY - to mark tcp segment with zero copy attribute

Signed-off-by: Igor Ivanov <[email protected]>
@swx-jenkins2
Copy link

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/accl-libvma-pr/3424/ for details (Mellanox internal link).

@swx-jenkins2
Copy link

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/accl-libvma-pr/3426/ for details (Mellanox internal link).

@swx-jenkins2
Copy link

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/accl-libvma-pr/3427/ for details (Mellanox internal link).

@swx-jenkins2
Copy link

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/accl-libvma-pr/3428/ for details (Mellanox internal link).

@swx-jenkins2
Copy link

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/accl-libvma-pr/3429/ for details (Mellanox internal link).

These changes make workable MSG_ZEROCOPY send flow
including notification mechanizm.
It is needed to notify the process when it is safe to reuse a
previously passed buffer. It queues completion notifications
on the socket error queue.

But copy avoidance internally is not done. So all data
is copied in internal buffers as without MSG_ZEROCOPY.

Full zcopy support will be implemented later.

Signed-off-by: Igor Ivanov <[email protected]>
ZCOPY packets should notify application as soon as possible to
confirm one that user buffers are free to reuse. So force completion
signal for such work requests.

Signed-off-by: Igor Ivanov <[email protected]>
TCP write can create several memory descriptors for single write
call with identical zcopy id. Notification should be done just
in case last one is free.
This change does not garantee correctness completelly when during
the same write() call memory descriptor set current zcopy id
after previous memory descriptor get tx completion and ack.

zcopy operation should allocate memory buffer to track
unique counter correctly.
So tcp_write() should avoid adding portion of data to existing
pbuf.

Signed-off-by: Igor Ivanov <[email protected]>
To effectively process TX completions VMA should
polling TX from internal thread too otherwise
tx memory descriptor can not be freed on time as far
as there user application should call any write() operations
to force it.

Signed-off-by: Igor Ivanov <[email protected]>
Flexible tunning is added to control RX and TX polling.

Signed-off-by: Igor Ivanov <[email protected]>
rx() processing should allow return information from error queue
and income data in single call.
Depending on user application it means that rx() logic should
return:
1. only income data
2. only error queue data
3. income and error queue data
Error processing logic is done accordingly.

Signed-off-by: Igor Ivanov <[email protected]>
LSO operation can not be done when payload data less
than mss.
This change allows to use LSO in right way.

Signed-off-by: Igor Ivanov <[email protected]>
pasis and others added 3 commits December 30, 2020 16:18
Zcopy notification mechanism (error queue) adds an event EPOLLERR to
respective epfd_info object and it is never removed. This leads to the
issue that epoll_wait() returns EPOLLERR event endlessly and doesn't
enter polling loops.

Fix this by removing EPOLLERR event when socket becomes not "errorable".
The fix avoids fake EPOLLERR events and allows epoll_wait_helper() to
perform polling.
In retransmit scenario it is possible to get duplicate ids in the zcopy
callback. In this case, ee_data is rewritten with a value which may be
lower than previous value. This leads to missed notifications.

As workaround, don't overwrite ee_data with lower value.
Control message should be handled just in case an user
passes a buffer for it.
Error queue request must be processed first before data.

Signed-off-by: Igor Ivanov <[email protected]>
@swx-jenkins5
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants