Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add socket_path option to enable unix socket traffic to dogstatsd6 #199

Merged
merged 5 commits into from
Jun 6, 2017

Conversation

xvello
Copy link
Contributor

@xvello xvello commented May 16, 2017

What does this PR do?

dogstatsd6 will allow custom metrics to be sent via a datagram unix socket instead of UDP (see DataDog/datadog-agent#252). This PR add a use_unix_socket option to the DogstatsD object in order to allow using this communication channel.

Motivation

Containers make it harder to reach the dogstatsd server from client applications and we don't have a future-proof network solution that could work across all orchestrators. This allows to bypass the network stack altogether and use a host-local protocol. See DataDog/docker-dd-agent#195 for more context.

As unix datagram and UDP share the same semantics, the patch is pretty small and unobstrusive.

Socket connection is set to non blocking: if the server is unresponsive, the packets are dropped, mirroring what would happen with UDP.

Additional Notes

Still WIP as we need to:

  • Work on better error handling if the server stops (currently, the exceptions are caught and packets are dropped, but CPU usage increases significantly)

@xvello xvello changed the title WIP add use_unix_socket option to enable unix socket traffic to dogstatsd6 [WIP] add use_unix_socket option to enable unix socket traffic to dogstatsd6 May 16, 2017
Copy link

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small comment otherwise 👍 for me

@@ -24,7 +24,8 @@ class DogStatsd(object):
OK, WARNING, CRITICAL, UNKNOWN = (0, 1, 2, 3)

def __init__(self, host='localhost', port=8125, max_buffer_size=50, namespace=None,
constant_tags=None, use_ms=False, use_default_route=False):
constant_tags=None, use_ms=False, use_default_route=False,
use_unix_socket=None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this param socket_path

@masci masci requested a review from yannmh May 17, 2017 07:09
xvello added 2 commits May 17, 2017 11:53
this mimics UDP behaviour for unix socket implementation
@xvello
Copy link
Contributor Author

xvello commented May 17, 2017

Added an exception handling for socket.timeout (silently drop the packets), but I need some python-fu on how to handle dogstatsd crashing (generic socket.error).

Currently, the exception is graceful handled and traffic resumes as soon as dogstatsd restarts, but the CPU usage is significant (trying on a new socket for every packet). Should we detect repeat failures and throttle?

@yannmh
Copy link
Member

yannmh commented May 24, 2017

@xvello is it still a WIP?

@xvello xvello changed the title [WIP] add use_unix_socket option to enable unix socket traffic to dogstatsd6 add use_unix_socket option to enable unix socket traffic to dogstatsd6 May 26, 2017
@xvello
Copy link
Contributor Author

xvello commented May 26, 2017

@yannmh removed the WIP flag as stress test went well
I'd need some input from you on whether to include a "connection failed N times, dropping packets for a second" mechanism (or similar), to avoid high CPU usage when the statsd server is down.
Could either be in this PR or in another down the road.

@@ -265,6 +282,9 @@ def _send_to_server(self, packet):
try:
# If set, use socket directly
(self.socket or self.get_socket()).send(packet.encode(self.encoding))
except socket.timeout:
# dogstatsd is overflowing, drop the packets (mimicks the UDP behaviour)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? From what I understand, we should never get a timeout on non-blocking mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python socket implementation implements non-blocking send as a zero-timeout send and sends a socket.timeout exception if the write does not return immediately (the queue, managed by the kernel, is full), I chose to silently drop the packet, mirroring what UDP would do. With dogstatsd6's goroutine-based intake, this should only happen if the CPU is saturated.

If needed, the socket queue length is configurable via sysctl net.unix.max_dgram_qlen (10 on my test machines).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to silently drop the packet, mirroring what UDP would do

@xvello I wonder if we should still add some logging here. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could turn into log spamming really fast. But I'll open another PR to add a fail_on_error option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could turn into log spamming really fast

That's a good point!

@yannmh
Copy link
Member

yannmh commented May 26, 2017

I am in favor of keeping the code very simple, i.e. avoid adding any back-off mechanism.

Do you know what errors to expect? error: [Errno 61] Connection refused sounds like one, are there more?
If the list is relatively small, maybe we can filter those and avoid the systematic retry we do on:

self.get_socket().send(packet.encode(self.encoding))
.
It may also not be necessary to re-create the socket.

@xvello
Copy link
Contributor Author

xvello commented May 29, 2017

Here are the errors I could produce:

socket.connect

  • [Errno 2] No such file or directory: the socket is not created yet (dogstatsd not started?)
  • [Errno 111] Connection refused: the socket file is present but no dogstatsd is listening (server killed -9 and didn't unlink the socket file)

In these cases, we should retry for the next packet

socket.send

  • [Errno 111] Connection refused once then [Errno 107] Transport endpoint is not connected if dogstatds stops listening (unlinking the socket or not). To resume communication once dogstatsd is ready again, we need to recreate the connection, even if the filename didn't change (the receiver changed) : we should drop the packet, but retry later to re-create the connection to catch the new dogstatsd. Indeed, trying twice per packet should be avoided in this case.

  • socket.timeout if the queue is full (dogstatsd not reading fast enough) : I'm in favor of dropping the packet

@xvello xvello changed the title add use_unix_socket option to enable unix socket traffic to dogstatsd6 add socket_path option to enable unix socket traffic to dogstatsd6 Jun 6, 2017
@xvello
Copy link
Contributor Author

xvello commented Jun 6, 2017

Tested one last time communicating with dogstatsd6, the packet drop logic works, so does recovery after dsd restart. Let's merge that!
@yannmh do we have something else waiting for a release?

@xvello xvello merged commit 0d89b5a into master Jun 6, 2017
@xvello xvello deleted the xvello/unix_socket branch June 6, 2017 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants