Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve working with disabled MyClouds #27

Open
martinheidegger opened this issue Jan 20, 2022 · 3 comments
Open

Improve working with disabled MyClouds #27

martinheidegger opened this issue Jan 20, 2022 · 3 comments

Comments

@martinheidegger
Copy link
Contributor

martinheidegger commented Jan 20, 2022

When a MyCloud is disabled, the CLI currently shows errors during all new operations like in #26 because the underlying code can not look up information about a mycloud in the disabled state.

Currently re-enabling is a multi-step manual process that involves going into the AWS console.

To improve this I am thinking of 3 steps to improve the situation:

  1. If the cli lambda returns an empty result, we should use a AWS command to see if the lambda in question has been disabled, and if it has: throw an error
  2. Find a way to avoid the manual steps when re-enabling a MyCloud
  3. If the cli runs into a "mycloud disabled" error, it should offer the user a prompt, asking if they want to re-enable it.
@urbien
Copy link
Member

urbien commented Jan 20, 2022

I agree with point 3 which is easy to fix, but how big of a project is the rest?
I suggest to try a different take first.
Why do we even have disable function? It is because of Lambda cold starts - it takes AWS about 10sec to start a Lambda if it was not accessed recently, that is it is not warm. You can think of it this way:

  • Warm Lambda uses the same VM and docker container, while cold, needs to start new one
  • Warm Lambda has Node.js started already
  • Warm Lambda did all that we do at Lambda initialization time and cached it in file system and in memory

So we are running a special Lambda that constantly pings other Lambdas, so that they stay warm.
This is an expensive process (@spwilko, is it a $50 a month?), so tradleconf disable stops it
Also disable stops the scheduler lambda that runs every minute, and runs various jobs that cost us money.

The alternative solution for the warmup is a configuration option released by AWS called "provisioned concurrency". We can play with it and see how much it costs per month, thus we may not need the disable function altogether, so no need to fix it. But we may need to have a "slow down" feature, which will decrease "provisioned capacity" and decrease the rate at which jobs run to once a day.

Hybrid strategy - dormant, but not cold

  1. Keep one Lambda (onMessage) provisioned at low concurrency (saving money) and start warming up other Lambdas upon the first customer request. At this point increase frequency of scheduler jobs from once a day to once a minute. This fits the pattern of mostly idle and occasional testing of our MyClouds. After testing ended, as evidenced by an hour delay in onMessage Lambda, we switch back to dormant.

MicroVM Snapshotting - future solution for "cold start"

In cloudpal, we plan to use MicroVM snapshotting mechanism to avoid cold starts. Snapshotting has been slowly productized for FireCracker MicroVM (underlying AWS Lambda) and is now quite reliable. But when AWS is going to start using it, is not clear, as there is also a challenge of uniqueness / randomness as each restored snapshot is identical (problem described here).

Snapshotting can be further significantly improved by super-awesome OS paging mechanism, called REAP. Compared to baseline snapshotting, REAP slashes the cold-start delays by 3.7x. It is tested with the help of Hive. REAP was a research project and at this point is not in active development.

But there is a more recent work, called SnapFaaS that analyzes limitations of REAP, and offers an alternative approach that claims significant improvement over REAP. Still, as paper the lower bound for this optimization is 15ms cold start. Language-specific sandboxing runtimes, like WASM, as above paper states, can achieve 10-20micro seconds cold start, and some CDNs already have such in production. But they are not-generic (we can't run there unless we rewrite MyCloud in Rust) and more importantly, they provide much lower level of protection from the host (cloud provider).
SnapFaaS seems to be in active development at Princeton but still needs to solve the randomness problem.

@spwilko
Copy link
Member

spwilko commented Jan 20, 2022

lambda costs for a single mycloud seem to be < $0.15 per day
here's a breakdown for South America for 1 day
Total cost ($) 0.96
DynamoDB ($) 0.35
S3 ($) 0.21
Key Management Se... ($) 0.17
Lambda ($) 0.13
CloudWatch ($) 0.1

@martinheidegger
Copy link
Contributor Author

martinheidegger commented Jan 21, 2022

but how big of a project is the rest?

Each of the points suggested build upon each other and can be completed step-by-step. They are small steps, each easy to be done. Good for persons starting with mycloud/tradleconf development.

In the meantime I also thought of further ways to improve this situation beyond the initial tasks:

  • Allow specifying a reason for disabling that can be shown in the CLI, so the admin may know in future why they (or their coworker) disabled a specific MyCloud.
  • Extend tim and the protocol to show information when a MyCloud is disabled to the user.
Offtopic: Slowdown / Why disable?

we may need to have a "slow down" feature

"Slow down" is a performance knob. I think that is an interesting thought but I think it should be additional/separate to a disable switch. After all: when I re-enable a MyCloud I want it to run in the same configuration as when it was disabled before.

We can make a different issue for this?!

Why do we even have disable function?

While the cost reasons are valid, I think there are two other valid reasons:

  • To disable a mycloud in "emergency": e.g. if one of the enabled products is destructive or before an update is available for a very problematic security issue.
  • Scheduled tasks can be not just expensive but also annoying (i.e. sending out emails) and pausing scheduled tasks may be a comfortable thing to do.

The discussion on "why disable" or "how to disable" is something we should have but maybe let's do that in a separate issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants