Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grow-cluster / create-cluster shouldn't return until finished initializing #210

Open
jvivian opened this issue Jul 20, 2016 · 3 comments
Open

Comments

@jvivian
Copy link
Contributor

jvivian commented Jul 20, 2016

Suggestion from @briandoconnor

There is a status endpoint for determining if it's finished initializing.

@hannes-ucsc
Copy link
Contributor

Are you referring to the instance status checks? AFAIK, those aren't necessarily conclusive since they don't actually check on anything going on inside the instance. And how could they, unless there is code running inside the VM that reports the status to the host. AFAIK, there is no such code. Waiting for the status check to report "ok" just works coincidentally because they typically take longer to initialize than the instance takes to boot up. But it is not a reliable mechanism. If you put a sleep 3600 into the boot process, the status checks would still report "ok" after five minutes, I think.

CGCloud waits until the last cloud-init stage is finished which happens late in the boot processes, close to when rc.local is being run. Even if we wait for the init process to enter the final run level, the daemons started by init will still go through initialization asynchronously. This includes, for example, the Mesos slave daemons registering with the master.

To deal with this asynchrony, the unit tests ask the master daemon (Spark or Mesos) to report the number of slaves. We could move that functionality into create-cluster. Then again, I wouldn't want it to get hung up on a few sticky instances when I'm creating a cluster of hundreds of instances. So that check would have to leave some wiggle room, requireing only, say, 90% of the instances to join.

@hannes-ucsc
Copy link
Contributor

ping @briandoconnor

@hannes-ucsc
Copy link
Contributor

hannes-ucsc commented Jul 28, 2016

Just now, I've seen an instance that got stuck while booting, before even starting ssh while the instance status check says its ok. I take this as further evidence that the status checks are inconclusive as far as boot completion is concerned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants