Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring down networking-related costs #55

Open
clezag opened this issue Jan 23, 2024 · 1 comment
Open

Bring down networking-related costs #55

clezag opened this issue Jan 23, 2024 · 1 comment

Comments

@clezag
Copy link
Member

clezag commented Jan 23, 2024

Browsing the AWS cost center I've noted that we spent a unreasonable chunk of change on "EC2-other" charges, which on closer inspection revealed themselves to be almost exclusively networking related things:

image

Note that this is almost half our current AWS bill and we aren't really doing anything heavy yet.

Looks like it likely was a temporary increase in usage from this image:
image

@christian-roggia @Luscha do you have an Idea how to mitigate this?
Could this be caused by repeated image pulling during error states (ImagePull policy always)?
Maybe the filebeat/elasticsearch logs during busy error logging?
Or do we have to rethink our networking setup?

@clezag clezag added this to the 1st Service on Production milestone Jan 23, 2024
@christian-roggia
Copy link
Contributor

Almost always when we are looking at spikes in network-related costs one of the following is the root cause:

  • monitoring
  • logging
  • error retries with no exponential back-off
  • infinite loops
  • heavy network operations

Image pulling should be fine, especially considering that Kubernetes has internally an exponential back-off in case of failure, and will reuse locally downloaded images unless Always is used as the pull image policy instead of IfNotPresent.

I would however verify that the current setup always tries to keep communication within the VPC rather than going out to the public internet just to go back into the AWS network immediately after.

While investigating please take into account the following:

  • ingress data flow is usually free
  • egress is usually expensive if the data goes out of the AWS network
  • data transfer within the same region is usually quite cheap

https://aws.amazon.com/ec2/pricing/on-demand/

This incident tells me it's probably a good time to set up proper monitoring and alerts. Root cause analysis is significantly easier when you have proper observability of the system and metrics you can work with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants