Skip to content
This repository has been archived by the owner on Sep 25, 2019. It is now read-only.

Out Of Memory errors #58

Open
Splaktar opened this issue Oct 7, 2015 · 11 comments
Open

Out Of Memory errors #58

Splaktar opened this issue Oct 7, 2015 · 11 comments
Labels
Milestone

Comments

@Splaktar
Copy link
Contributor

Splaktar commented Oct 7, 2015

There are a number of reasons that the Hub stops responding to requests (OOM, exceptions, hangs, etc). The goal isn't to solve all of these problems because there will likely be more introduced in the future via open source contributions and limited automated testing.

We need to setup a system to improve the Hub's HA, ideally via pm2 and possibly other packages. This is a common problem with Node.js projects and there are many examples and guides for handling this. We just need someone to set it up, test it, and finally work with me on deployment.


On Oct 6th at 6am the Hub started responding to all requests with a 502 error and the console just logged the request and timed out processing it at 2 seconds. This appears to be different than the previous issue with a resource leak which left a clear exception.

I've restarted the Hub and it's back online.

We need to spin up another Hub node and connect it to the load balancer so that if one goes down, we don't loose service. Then we probably also need to enable Stackdriver Monitoring and alerts so that we get emailed when the Health Checks fail for a node under the load balancer. We currently get no such notification.

@Splaktar Splaktar added the bug label Oct 7, 2015
@Splaktar Splaktar added this to the v0.1.0 milestone Oct 7, 2015
@tasomaniac
Copy link
Member

What you mentioned is a workaround right. Do we really need more than 1 server if we didn't have these kind of problems. I guess we don't have that many users.

@Splaktar
Copy link
Contributor Author

Happened again tonight:

FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)

@tasomaniac it's not so much of a work around as it is a proper HA configuration.

@Splaktar
Copy link
Contributor Author

@Splaktar Splaktar modified the milestones: v0.2.0, v0.1.0 Nov 1, 2015
@Splaktar
Copy link
Contributor Author

Splaktar commented Nov 1, 2015

Happened again as few days ago, but I didn't have time to investigate or collect stack trace.

@tasomaniac
Copy link
Member

:(

On Mon, Nov 2, 2015, 02:22 Michael Prentice [email protected]
wrote:

Happened again as few days ago, but I didn't have time to investigate or
collect stack trace.


Reply to this email directly or view it on GitHub
#58 (comment).

@Splaktar
Copy link
Contributor Author

FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)
npm ERR! Linux 3.19.0-28-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "runScript" "startProd"
npm ERR! node v0.12.7
npm ERR! npm  v2.11.3
npm ERR! code ELIFECYCLE
npm ERR! [email protected] startProd: `grunt serve:dist`
npm ERR! Exit status 134
npm ERR! 
npm ERR! Failed at the [email protected] startProd script 'grunt serve:dist'.
npm ERR! This is most likely a problem with the gdgx-hub package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     grunt serve:dist
npm ERR! You can get their info via:
npm ERR!     npm owner ls gdgx-hub
npm ERR! There is likely additional logging output above.
npm ERR! Please include the following file with any support request:
npm ERR!     /opt/hub/npm-debug.log

@Splaktar
Copy link
Contributor Author

Here's the latest status when the Hub stopped and started giving 502 status errors:

  System information as of Mon Nov 30 10:41:11 UTC 2015
  System load:  0.0               Processes:           4398
  Usage of /:   24.7% of 9.69GB   Users logged in:     0
  Memory usage: 49%               IP address for eth0: 10.111.216.151
  Swap usage:   0%
  => There are 4318 zombie processes.

4318 zombies does not look good... but the resources don't seem to be otherwise bottlenecked (RAM and disk are fine).

@Splaktar
Copy link
Contributor Author

Splaktar commented Dec 1, 2015

OK, I've spun up a second Hub node (small instance, tried micro but ran into ENOMEM errors with grunt).

Now clustering and load balancing seems to be working:

hub:

[1510] worker-2317 just said hi. Replying.
[1510] was master: true, now master: true

hub-backup:

2317] Risky is up. I'm worker-2317
[2317] Cancel masterResponder
[2317] was master: false, now master: false

Then kill hub:

2317] worker-1510 has gone down...
[2317] was master: false, now master: true

And the handoff is seamless with no interruption to traffic. I tried a few iterations of this in both directions and it seemed to work great.

This does not solve the fact that the hub instances sometimes run out of memory or otherwise stop responding, but it should reduce the impact. I've started to setup Stackdriver monitoring to alert us when one of them stops responding, but I haven't completed that process yet.

@Splaktar
Copy link
Contributor Author

Still seeing OOM errors bringing the server down:

FATAL ERROR: node::smalloc::Alloc(v8::Handle<v8::Object>, size_t, v8::ExternalArrayType) Out Of Memory
Aborted (core dumped)
npm ERR! Linux 3.19.0-39-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "runScript" "startProd"
npm ERR! node v0.12.9
npm ERR! npm  v2.14.9
npm ERR! code ELIFECYCLE
npm ERR! [email protected] startProd: `grunt serve:dist`
npm ERR! Exit status 134

The hub-backup also stopped responding to requests. But it did not have any kind of stack trace, crash, or logs. I really want to move this to a managed service as this is far too much trouble atm.

@Splaktar Splaktar changed the title Hub stops responding to requests Implement pm2 recovery/restart Dec 30, 2015
@Splaktar
Copy link
Contributor Author

Both VMs locked up last night, so even pm2 wouldn't have helped. We may need to go farther and implement Kubernetes to orchestrate the containers and restart them when they fail health checks.

@Splaktar Splaktar changed the title Implement pm2 recovery/restart Out Of Memory errors Jun 10, 2016
@Splaktar Splaktar modified the milestones: v0.2.0, v0.2.1 Aug 29, 2016
@Splaktar
Copy link
Contributor Author

Splaktar commented Mar 13, 2017

If we implement #100, then this should be much less of an issue. It's also been many months since these were an issue, though I think that this is due to the resolution of auto restarting in #88.

@Splaktar Splaktar modified the milestones: v0.3.0, v0.2.1 Mar 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants