diff --git a/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md b/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md index 5df46055..fb6c9289 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md @@ -1,6 +1,6 @@ --- title: "Analyzing costs" -weight: 145 +weight: 110 --- In this section we will use AWS Cost explorer to look at the costs of our EMR cluster, including the underlying EC2 Spot Instances. diff --git a/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md b/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md index 59bfcb7f..91091e14 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md @@ -1,6 +1,6 @@ --- title: "Automations and monitoring" -weight: 110 +weight: 145 --- When adopting EMR into your analytics flows and data processing pipelines, you will want to launch EMR clusters and run jobs in a programmatic manner. There are many ways to do so with AWS SDKs that can run in different environments like Lambda Functions, invoked by AWS Data Pipeline or AWS Step Functions, with third party tools like Apache Airflow, and more. diff --git a/content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md b/content/running_spark_apps_with_emr_on_spot_instances/calltoaction.md similarity index 70% rename from content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md rename to content/running_spark_apps_with_emr_on_spot_instances/calltoaction.md index 2e7428e7..8b877911 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/calltoaction.md @@ -1,26 +1,20 @@ ---- -title: "Conclusions and cleanup" -weight: 150 ---- - -**Congratulations!** you have reached the end of the workshop. In this workshop, you learned about the need to be flexible with EC2 instance types when using Spot Instances, and how to size your Spark executors to allow for this flexibility. You ran a Spark application solely on Spot Instances using EMR Instance Fleets, you verified the results of the application, and saw the cost savings that you achieved by running the application on Spot Instances. - - -#### Cleanup - -Select the correct tab, depending on where you are running the workshop: -{{< tabs name="EventorOwnAccount" >}} - {{< tab name="In your own account" include="cleanup_ownaccount" />}} - {{< tab name="In an AWS event" include="cleanup_event.md" />}} -{{< /tabs >}} - - -#### Thank you - -We hope you found this workshop educational, and that it will help you adopt Spot Instances into your Spark applications running on Amazon EMR, in order to optimize your costs. -If you have any feedback or questions, click the "**Feedback / Questions?**" link in the left pane to reach out to the authors of the workshop. - -#### Other Resources: -Visit the [**Amazon EMR on EC2 Spot Instances**](https://aws.amazon.com/ec2/spot/use-case/emr/) page for more information, customer case studies and videos. -Read the blog post: [**Best practices for running Apache Spark applications using Amazon EC2 Spot Instances with Amazon EMR**](https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-spark-applications-using-amazon-ec2-spot-instances-with-amazon-emr/) -Watch the AWS Online Tech-Talk: [**Best Practices for Running Spark Applications Using Spot Instances on EMR - AWS Online Tech Talks**](https://www.youtube.com/watch?v=u5dFozl1fW8) +--- +title: "Call to Action" +weight: 170 +--- + +**Congratulations!** you have reached the end of the workshop. In this workshop, you learned about the need to be flexible with EC2 instance types when using Spot Instances, and how to size your Spark executors to allow for this flexibility. You ran a Spark application solely on Spot Instances using EMR Instance Fleets, you verified the results of the application, and saw the cost savings that you achieved by running the application on Spot Instances. + +#### Thank you + +We hope you found this workshop educational, and that it will help you adopt Spot Instances into your Spark applications running on Amazon EMR, in order to optimize your costs. +If you have any feedback or questions, click the "**Feedback / Questions?**" link in the left pane to reach out to the authors of the workshop. + +#### Other Resources: +Visit the [**Amazon EMR on EC2 Spot Instances**](https://aws.amazon.com/ec2/spot/use-case/emr/) page for more information, customer case studies and videos. + +Read the blog post: [**Best practices for running Apache Spark applications using Amazon EC2 Spot Instances with Amazon EMR**](https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-spark-applications-using-amazon-ec2-spot-instances-with-amazon-emr/) + +Watch the AWS Online Tech-Talk: + +{{< youtube u5dFozl1fW8 >}} \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/cleanup.md b/content/running_spark_apps_with_emr_on_spot_instances/cleanup.md new file mode 100644 index 00000000..844d8755 --- /dev/null +++ b/content/running_spark_apps_with_emr_on_spot_instances/cleanup.md @@ -0,0 +1,13 @@ +--- +title: "Cleanup" +weight: 150 +--- + +#### Cleanup + +Select the correct tab, depending on where you are running the workshop: +{{< tabs name="EventorOwnAccount" >}} + {{< tab name="In your own account" include="cleanup_ownaccount" />}} + {{< tab name="In an AWS event" include="cleanup_event.md" />}} +{{< /tabs >}} + diff --git a/content/running_spark_apps_with_emr_on_spot_instances/cleanup_ownaccount.md b/content/running_spark_apps_with_emr_on_spot_instances/cleanup_ownaccount.md index cd9e56a1..5b842dc1 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/cleanup_ownaccount.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/cleanup_ownaccount.md @@ -5,8 +5,16 @@ disableToc: true hidden: true --- -1. In the EMR Management Console, check that the cluster is in the **Terminated** state. If it isn't, then you can terminate it from the console. +1. Go the EMR Management Console and termiate the cluster. 2. Go to the [Cloud9 Dashboard](https://console.aws.amazon.com/cloud9/home) and delete your environment. 3. Delete the VPC you deployed via CloudFormation, by going to the CloudFormation service in the AWS Management Console, selecting the VPC stack (default name is Quick-Start-VPC) and click the Delete option. Make sure that the deletion has completed successfully (this should take around 1 minute), the status of the stack will be DELETE_COMPLETE (the stack will move to the Deleted list of stacks). 4. Delete your S3 bucket from the AWS Management Console - choose the bucket from the list of buckets and hit the Delete button. This approach will also empty the bucket and delete all existing objects in the bucket. -5. Delete the Athena table by going to the Athena service in the AWS Management Console, find the **emrworkshopresults** Athena table, click the three dots icon next to the table and select **Delete table**. \ No newline at end of file +5. Delete the Athena table by going to the Athena service in the AWS Management Console, find the **emrworkshopresults** Athena table, click the three dots icon next to the table and select **Delete table**. +6. Delete the CloudFormation stack for the FIS experiment templates. Run the following command: +``` +aws cloudformation delete-stack --stack-name fis-spot-interruption +``` +7. Delete the CloudFormation stack for tracking the Spot interruptions. Run the following command: +``` +aws cloudformation delete-stack --stack-name track-spot-interruption +``` \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md b/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md index 5e946cab..b3c782e0 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md @@ -90,9 +90,6 @@ Some notable metrics: * **ContainerAllocated** - this represents the number of containers that are running on your cluster, on the Core and Task Instance Fleets. These would the be Spark executors and the Spark Driver. * **MemoryAllocatedMB** & **MemoryAvailableMB** - you can graph them both to see how much memory the cluster is actually consuming for the wordcount Spark application out of the memory that the instances have. -#### Terminate the cluster -When you are done examining the cluster and using the different UIs, terminate the EMR cluster from the EMR management console. This is not the end of the workshop though - we still have some interesting steps to go. - #### Number of executors in the cluster With 32 Spot Units in the Task Instance Fleet, EMR launched either 8 * xlarge (running one executor) or 4 * 2xlarge instances (running 2 executors) or 2 * 4xlarge instances (running 4 executors), so the Task Instance Fleet provides 8 executors / containers to the cluster. The Core Instance Fleet launched one xlarge instance, able to run one executor. diff --git a/content/running_spark_apps_with_emr_on_spot_instances/simulating_recovery.md b/content/running_spark_apps_with_emr_on_spot_instances/simulating_recovery.md deleted file mode 100644 index b9d45750..00000000 --- a/content/running_spark_apps_with_emr_on_spot_instances/simulating_recovery.md +++ /dev/null @@ -1,34 +0,0 @@ ---- -title: "(Optional) simulating recovery" -weight: 149 ---- - -EMR replenishes the target capacity if some EC2 Instances failed, were terminated/stopped, or received an EC2 Spot Interruption. - -In this optional step, you will re-run the cluster and the Spark application, and terminate some of the Task Fleet instances in order to observe the recovery capabilities of EMR and Spark, and check that the application still completes successfully. Since it is not possible to simulate an EC2 Spot Interruption in an EMR cluster, we will have to manually terminate EC2 instances to receive a similar effect. - -{{% notice note %}} -This is an optional step that will take approximately 20 minutes more than the original running time of the workshop. Feel free to skip this step and go directly to the **Conclusions and cleanup** step. -{{% /notice %}} - -{{% notice info %}} -[Click here](https://aws.amazon.com/blogs/big-data/spark-enhancements-for-elasticity-and-resiliency-on-amazon-emr/) For an in-depth blog post about Spark's resiliency in EMR -{{% /notice %}} - -#### Step objectives: -1. Observe that EMR replenishes the target capacity if some EC2 Instances failed, were terminated/stopped, or received an EC2 Spot Interruption. -2. Observe that the Spark application running in your EMR cluster still completes successfully, despite losing executors due to instance terminations. - -#### Re-launch your cluster and application -1. In the EMR console, select the cluster that you launched in this workshop, and click the **Clone** button. -2. In the popup dialog **"Would you like to include steps"** select **Yes** and click **Clone**. -3. EMR console duplicated all the cluster settings for you - click the **Create cluster** button. -4. Refresh the Summary tab in the EMR console until the status of the cluster is **Running step** and the Master, Core and Task fleets are all in the **Running** state. - -#### Manually terminate some of the EMR Task Fleet nodes -1. Go to the EC2 console, and identify the instances in your Task Fleet. You can do so by using the following filter: **Key=aws:elasticmapreduce:instance-group-role & Value=TASK**. If you have other EMR clusters running in the account/region, make sure you identify your own cluster by further filtering according to its Name tag. -2. Randomly select half of the instances that were filtered, and click Actions -> Instance State -> Terminate -> **Yes, Terminate** - -#### Verify that EMR replenished the capacity, and that the step completed successfully -1. Within 2-3 minutes, refresh the EC2 console as well as the Task Fleet in the EMR console under the Hardware tab, and you should see new Task Fleet instances created by EMR to replenish the capacity, after you terminated the previous instances. -2. In the EMR console, go to the Steps tab. Refresh the tab until your application has reached the **Completed** status. Because some instances were terminated mid-run, the Step will still complete, but will take longer than you previously observed in the workshop, because Spark had to repeat some of the work. diff --git a/content/running_spark_apps_with_emr_on_spot_instances/spot_interruption_experiment.md b/content/running_spark_apps_with_emr_on_spot_instances/spot_interruption_experiment.md new file mode 100644 index 00000000..dc263865 --- /dev/null +++ b/content/running_spark_apps_with_emr_on_spot_instances/spot_interruption_experiment.md @@ -0,0 +1,111 @@ +--- +title: "Creating the Spot Interruption Experiment" +weight: 97 +--- + +In this section, you're going to start creating the experiment to [trigger the interruption of Amazon EC2 Spot Instances using AWS Fault Injection Simulator (FIS)](https://aws.amazon.com/blogs/compute/implementing-interruption-tolerance-in-amazon-ec2-spot-with-aws-fault-injection-simulator/). When using Spot Instances, you need to be prepared to be interrupted. With FIS, you can test the resiliency of your workload and validate that your application is reacting to the interruption notices that EC2 sends before terminating your instances. You can target individual Spot Instances or a subset of instances in clusters managed by services that tag your instances such as ASG, Fleet and EMR. + +You're going to use the CLI, so launch your terminal to run the commands included in this section. + +#### What do you need to get started? + +Before you start launching Spot interruptions with FIS, you need to create an experiment template. Here is where you define which resources you want to interrupt (targets), and when you want to interrupt the instance. + +You're going to use the following CloudFormation template which creates the IAM role (`FISSpotRole`) with the minimum permissions FIS needs to interrupt an instance, and the experiment template (`FISExperimentTemplate`) you're going to use to trigger a Spot interruption: + +``` +AWSTemplateFormatVersion: 2010-09-09 +Description: FIS for Spot Instances +Parameters: + InstancesToInterrupt: + Description: Number of instances to interrupt + Default: 3 + Type: Number + + DurationBeforeInterruption: + Description: Number of minutes before the interruption + Default: 2 + Type: Number + +Resources: + + FISSpotRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Statement: + - Effect: Allow + Principal: + Service: [fis.amazonaws.com] + Action: ["sts:AssumeRole"] + Path: / + Policies: + - PolicyName: root + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: 'ec2:DescribeInstances' + Resource: '*' + - Effect: Allow + Action: 'ec2:SendSpotInstanceInterruptions' + Resource: 'arn:aws:ec2:*:*:instance/*' + + FISExperimentTemplate: + Type: AWS::FIS::ExperimentTemplate + Properties: + Description: "Interrupt multiple random instances" + Targets: + SpotIntances: + ResourceTags: + ResourceTagKey: ResourceTagValue + Filters: + - Path: State.Name + Values: + - running + ResourceType: aws:ec2:spot-instance + SelectionMode: !Join ["", ["COUNT(", !Ref InstancesToInterrupt, ")"]] + Actions: + interrupt: + ActionId: "aws:ec2:send-spot-instance-interruptions" + Description: "Interrupt multiple Spot instances" + Parameters: + durationBeforeInterruption: !Join ["", ["PT", !Ref DurationBeforeInterruption, "M"]] + Targets: + SpotInstances: SpotIntances + StopConditions: + - Source: none + RoleArn: !GetAtt FISSpotRole.Arn + Tags: + Name: "FIS_EXP_NAME" + +Outputs: + FISExperimentID: + Value: !GetAtt FISExperimentTemplate.Id +``` + +Here are some important notes about the template: + +* You can configure how many instances you want to interrupt with the `InstancesToInterrupt` parameter. In the template it's defined that it's going to interrupt **three** instances. +* You can also configure how much time you want the expriment to run with the `DurationBeforeInterruption` parameter. By default, it's going to take two minutes. This means that as soon as you launch the experiment, the instance is going to receive the two-minute notification Spot interruption warning. +* The most important section is the `Targets` from the experiment template. The template has two placeholders `ResourceTagKey` and `ResourceTagValue` which are basically the key/value for the tags to use when choosing the instances to interrupt. We're going to run a `sed` command to replace them with the proper values for this workshop. +* Notice that instances are **chosen randomly**, and only those who are in the `running` state. + +#### Create the EC2 Spot Interruption Experiment with FIS + +Let's continue by creating the Spot interruption experiment template using Cloudformation. You can view the CloudFormation template (**fisspotinterruption.yaml**) at GitHub [here](https://raw.githubusercontent.com/awslabs/ec2-spot-workshops/master/workshops/fis/fisspotinterruption.yaml). To download it, you can run the following command: + +``` +wget https://raw.githubusercontent.com/awslabs/ec2-spot-workshops/master/workshops/fis/fisspotinterruption.yaml +``` + +Now, simply run the following commands to create the FIS experiment: + +``` +export FIS_EXP_NAME=fis-spot-interruption +sed -i -e "s#ResourceTagKey#aws:elasticmapreduce:instance-group-role#g" fisspotinterruption.yaml +sed -i -e "s#ResourceTagValue#TASK#g" fisspotinterruption.yaml +sed -i -e "s#FIS_EXP_NAME#$FIS_EXP_NAME#g" fisspotinterruption.yaml +aws cloudformation create-stack --stack-name $FIS_EXP_NAME --template-body file://fisspotinterruption.yaml --capabilities CAPABILITY_NAMED_IAM +aws cloudformation wait stack-create-complete --stack-name $FIS_EXP_NAME +``` \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/spot_interruption_fis.md b/content/running_spark_apps_with_emr_on_spot_instances/spot_interruption_fis.md new file mode 100644 index 00000000..5192d10e --- /dev/null +++ b/content/running_spark_apps_with_emr_on_spot_instances/spot_interruption_fis.md @@ -0,0 +1,94 @@ +--- +title: "Interrupting a Spot Instance" +weight: 98 +--- + +In this section, you're going to launch a Spot Interruption using FIS and then verify that the capacity has been replenished and EMR was able to continue running the Spark job. This will help you to confirm the low impact of your workloads when implemeting Spot effectively. Moreover, you can discover hidden weaknesses, and make your workloads fault-tolerant and resilient. + +#### (Optional) Re-Launch the Spark Application + +The Spark job could take around seven to eight minutes to finish. However, when you arrive to this part of the workshop, either the job is about to finish or has finished already. So, here are the commands you need to run to re-laun the Spark job in EMR. + +First, you need to empty the results folder in the S3 bucket. Run the following command (replace the bucket name with yours): + +``` +export S3_BUCKET=your_bucket_name +aws s3 rm --recursive s3://$S3_BUCKET/results/ +``` + +Then, get the EMR cluster ID and replace it with `YOUR_CLUSTER_ID` in the following command to re-launch the Spark application: + +``` +aws emr add-steps --cluster-id YOUR_CLUSTER_ID --steps Type=CUSTOM_JAR,Name="Spark application",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[spark-submit,--deploy-mode,cluster,--executor-memory,18G,--executor-cores,4,s3://$S3_BUCKET/script.py,s3://$S3_BUCKET/results/] +``` + +Now go ahead an run the Spot interruption experiment before the jobs completes. + +#### Launch the Spot Interruption Experiment +After creating the experiment template in FIS, you can start a new experiment to interrupt three (unless you changed the template) Spot instances. Run the following command: + +``` +FIS_EXP_TEMP_ID=$(aws cloudformation describe-stacks --stack-name $FIS_EXP_NAME --query "Stacks[0].Outputs[?OutputKey=='FISExperimentID'].OutputValue" --output text) +FIS_EXP_ID=$(aws fis start-experiment --experiment-template-id $FIS_EXP_TEMP_ID --no-cli-pager --query "experiment.id" --output text) +``` + +Wait around 30 seconds, and you should see that the experiment completes. Run the following command to confirm: + +``` +aws fis get-experiment --id $FIS_EXP_ID --no-cli-pager +``` + +At this point, FIS has triggered a Spot interruption notice, and in two minutes the instances will be terminated. + +Go to CloudWatch Logs group `/aws/events/spotinterruptions` to see which instances are being interrupted. + +You should see a log message like this one: + +![SpotInterruptionLog](/images/running-emr-spark-apps-on-spot/spotinterruptionlog.png) + +#### Verify that EMR Instance Fleet replenished the capacity + +Run the following command to get an understanding of how many instances are currently running before the Spot interruption: + +``` +aws ec2 describe-instances --filters\ + Name='tag:aws:elasticmapreduce:instance-group-role',Values='TASK'\ + Name='instance-state-name',Values='running'\ + | jq '.Reservations[].Instances[] | "Instance with ID:\(.InstanceId) launched at \(.LaunchTime)"' +``` + +You should see a list of instances with the date and time when they were launched, like this: + +```output +"Instance with ID:i-06a82769173489f32 launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-06c97b509c5e274e0 launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-002b073c6479a5aba launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-0e96071afef3fc145 launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-0a3ddb3903526c712 launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-05717d5d7954b0250 launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-0bcd467f88ddd830e launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-04c6ced30794e965b launched at 2022-04-06T14:02:49+00:00" +``` + +Wait around three minutes while the interrupted instances terminates, and the new instances finish bootstrapping. Run the previous command again to confirm that the new Spot instances are running, and the output will be like the following: + +```output +"Instance with ID:i-06a82769173489f32 launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-06c97b509c5e274e0 launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-002b073c6479a5aba launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-0e96071afef3fc145 launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-04c6ced30794e965b launched at 2022-04-06T14:02:49+00:00" +"Instance with ID:i-0136152e14053af81 launched at 2022-04-06T14:11:25+00:00" +"Instance with ID:i-0ff78141712e93850 launched at 2022-04-06T14:11:25+00:00" +"Instance with ID:i-08818dc9ba688c3da launched at 2022-04-06T14:11:25+00:00" +``` + +Notice how the launch time from the last instances are different from the others. + +#### Verify that the Spark application completed successfully + +Follow the same steps from ["Examining the cluster"](/running_spark_apps_with_emr_on_spot_instances/examining_cluster.html) to launch the Spark History Server and explore the details of the recent Spark job submission. In the home screen, click on the latest App ID (if it's empty, wait for the job to finish) to see the execution details. You should see something like this: + +![SparkJobCompleted](/images/running-emr-spark-apps-on-spot/sparkjobcompleted.png) + +Notice how two minutes around after the job started, three executors were removed (each executor is a Spot instance). The job didn't stop, and when the new Spot instances were launched by EMR, Spark included them as new executors again to catch-up on completing the job. The job took around eight minutes to finish. If you don't see executors being added, you could re-launch the Spark job and start the FIS experiment right away. \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/spot_savings_summary.md b/content/running_spark_apps_with_emr_on_spot_instances/spot_savings_summary.md index a12e19d8..3accd702 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/spot_savings_summary.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/spot_savings_summary.md @@ -1,6 +1,6 @@ --- title: "Spot savings summary" -weight: 96 +weight: 100 --- When our cluster has finished bootstrapping and the Spark application is running or has already completed, we can have a look at how much we are saving by running Spot Instances. The Spot Savings Summary feature in the EC2 Spot console provides a current snapshot of the Spot Instances in our account, and the current savings. diff --git a/content/running_spark_apps_with_emr_on_spot_instances/tracking_spot_interruptions.md b/content/running_spark_apps_with_emr_on_spot_instances/tracking_spot_interruptions.md index 1c8b5d15..913d90cb 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/tracking_spot_interruptions.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/tracking_spot_interruptions.md @@ -1,6 +1,6 @@ --- title: "Tracking Spot interruptions" -weight: 100 +weight: 96 --- Now we're in the process of getting started with adopting Spot Instances for our EMR clusters. We're still not sure that our jobs are fully resilient and what would actually happen if some of the EC2 Spot Instances in our EMR clusters get interrupted, when EC2 needs the capacity back for On-Demand. @@ -9,39 +9,29 @@ Now we're in the process of getting started with adopting Spot Instances for our In most cases, when running fault-tolerant workloads, we don't really need to track the Spot interruptions as our applications should be built to handle them gracefully without any impact to performance or availability. However, when we get started with running our EMR jobs on Spot Instances this could be useful, as our organization can use these to correlate to possible EMR job failures or prolonged execution times, in case Spot Instances were interrupted during Spark run time. {{% /notice %}} +Let's set up CloudWatch Logs to log Spot interruptions, so if there are any failures in our EMR applications, we'll be able to check if the failures correlate to a Spot interruption. -Let's set up an email notification for when Spot interruptions occur, so if there are any failures in our EMR applications, we'll be able to check if the failures correlate to a Spot interruption. +#### Creating the CloudFormation Stack to Track EC2 Spot Interruptions -#### Creating an SNS topic for the notifications -1. Create a new SNS topic and subscribe to the topic with your email address -For guidance, you can follow steps #1 & #2 in the [Amazon SNS getting started guide] (https://docs.aws.amazon.com/sns/latest/dg/sns-getting-started.html) -1. You will receive an email with the subject "AWS Notification - Subscription Confirmation". Click the "**Confirm subscription**" link in the email in order to allow SNS to send email to the endpoint (your email). +We've created a CloudFormation template that includes all the resources you need to track EC2 Spot Interruptions. The stack creates the following: -#### Creating a CloudWatch Events rule for the Spot Interruption notifications +* An Event Rule for tracking EC2 Spot Interruption Warnings +* A CloudWatch Log group to log interruptions and instance details +* IAM Role to allow the event rule to log into CloudWatch Logs -1. You now have an SNS topic that CloudWatch Events can send the EC2 Spot Interruption Notification to, let's configure CloudWatch to do so. In the AWS Management Console, go to Cloudwatch -> Events -> Rules and click **Create Rule**. +You can view the CloudFormation template (**cloudwatchlogs.yaml**) at GitHub [here](https://raw.githubusercontent.com/awslabs/ec2-spot-workshops/master/workshops/track-spot-interruptions/cloudwatchlogs.yaml). To download it, you can run the following command: -1. Under **Service Name** select EC2 and under Event Type select **EC2 Spot Instance Interruption Warning** - -1. On the right side of the console, click **Add Target**, scroll down and select **SNS topic** -> select your topic name, Your result should look like this: -![tags](/images/running-emr-spark-apps-on-spot/cloudwatcheventsrule.png) -1. Click **Configure Details** in the bottom right corner. -1. Provide a name to your CloudWatch Events rule and click **Create rule**. - -#### Verifying that the notification works - -The only way to simulate a Spot Interruption Notification is to use Spot Fleet. Spot Fleet is an EC2 instance provisioning and management tool that is not used in this workshop for any of the actual EMR/Spark work (not to be confused with EMR Instance Fleets). We will only use Spot Fleet to trigger a Spot Interruption that will help you verify that the notification that you set up works. +``` +wget https://raw.githubusercontent.com/awslabs/ec2-spot-workshops/master/workshops/track-spot-interruptions/cloudwatchlogs.yaml +``` -1. In the AWS console, go to EC2 -> Spot Requests -> click **Request Spot Instances** -1. Leave all the settings as-is and check the box next to "**Maintain target capacity**", then at the bottom click **Launch**. This will create a Spot Fleet with one instance and you will get a Success window. -1. Still in the Spot console, click the console refresh button until your fleet has started the one instance (when capacity changes to **1 of 1**). -1. With your fleet checked, click Actions -> **Modify target capacity** -> Change "**New target capacity**" to 0, leave Terminate instances checked -> Click **Submit** -1. Within a minute or two you should receive an SNS notification from the topic you created, with a JSON event that indicates that the Spot Instance in the fleet was interrupted. +After downloading the CloudFormation template, run the following command in a terminal: -```json -{"version":"0","id":"6009a9f4-cc7a-8a77-46f2-310520b31e0f","detail-type":"EC2 Spot Instance Interruption Warning","source":"aws.ec2","account":"","time":"2019-05-27T04:52:57Z","region":"eu-west-1","resources":["arn:aws:ec2:eu-west-1b:instance/i-0481ef86f172b68d7"],"detail":{"instance-id":"i-0481ef86f172b68d7","instance-action":"terminate"}} +``` +aws cloudformation create-stack --stack-name track-spot-interruption --template-body file://cloudwatchlogs.yaml --capabilities CAPABILITY_NAMED_IAM +aws cloudformation wait stack-create-complete --stack-name track-spot-interruption ``` -Go ahead and terminate the fleet request itself by checking the fleet, click actions -> **Cancel Spot request** -> **Confirm**. +You should see an event rule in the Amazon EventBridge console, like this: -From now on, any EC2 Spot interruption in the account/region that you set this up in will alert you via email. Disable or delete the CloudWatch Event rule if you are not interested in the notifications. +![Spot Interruption Event Rule](/images/tracking-spot/itn-event-rule.png) diff --git a/content/running_spark_apps_with_emr_on_spot_instances/verifying_results.md b/content/running_spark_apps_with_emr_on_spot_instances/verifying_results.md index 75d483d3..87d21cb7 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/verifying_results.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/verifying_results.md @@ -1,6 +1,6 @@ --- title: "Verifying the app's results" -weight: 140 +weight: 99 --- In this section we will use Amazon Athena to run a SQL query against the results of our Spark application in order to make sure that it completed successfully. diff --git a/static/images/running-emr-spark-apps-on-spot/sparkjobcompleted.png b/static/images/running-emr-spark-apps-on-spot/sparkjobcompleted.png new file mode 100644 index 00000000..822d69b5 Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/sparkjobcompleted.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/spotinterruptionlog.png b/static/images/running-emr-spark-apps-on-spot/spotinterruptionlog.png new file mode 100644 index 00000000..12f01ef1 Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/spotinterruptionlog.png differ diff --git a/static/images/tracking-spot/itn-event-rule.png b/static/images/tracking-spot/itn-event-rule.png new file mode 100644 index 00000000..21797c80 Binary files /dev/null and b/static/images/tracking-spot/itn-event-rule.png differ diff --git a/workshops/fis/fisspotinterruption.yaml b/workshops/fis/fisspotinterruption.yaml new file mode 100644 index 00000000..e596bff5 --- /dev/null +++ b/workshops/fis/fisspotinterruption.yaml @@ -0,0 +1,70 @@ +--- +AWSTemplateFormatVersion: 2010-09-09 +Description: FIS for Spot Instances +Parameters: + InstancesToInterrupt: + Description: Number of instances to interrupt + Default: 3 + Type: Number + + DurationBeforeInterruption: + Description: Number of minutes before the interruption + Default: 2 + Type: Number + +Resources: + + FISSpotRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Statement: + - Effect: Allow + Principal: + Service: [fis.amazonaws.com] + Action: ["sts:AssumeRole"] + Path: / + Policies: + - PolicyName: root + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: 'ec2:DescribeInstances' + Resource: '*' + - Effect: Allow + Action: 'ec2:SendSpotInstanceInterruptions' + Resource: 'arn:aws:ec2:*:*:instance/*' + + FISExperimentTemplate: + Type: AWS::FIS::ExperimentTemplate + Properties: + Description: "Interrupt multiple random instances" + Targets: + SpotIntances: + ResourceTags: + ResourceTagKey: ResourceTagValue + Filters: + - Path: State.Name + Values: + - running + ResourceType: aws:ec2:spot-instance + SelectionMode: !Join ["", ["COUNT(", !Ref InstancesToInterrupt, ")"]] + Actions: + interrupt: + ActionId: "aws:ec2:send-spot-instance-interruptions" + Description: "Interrupt multiple Spot instances" + Parameters: + durationBeforeInterruption: !Join ["", ["PT", !Ref DurationBeforeInterruption, "M"]] + Targets: + SpotInstances: SpotIntances + StopConditions: + - Source: none + RoleArn: !GetAtt FISSpotRole.Arn + Tags: + Name: "FIS_EXP_NAME" + +Outputs: + FISExperimentID: + Value: !GetAtt FISExperimentTemplate.Id +... diff --git a/workshops/track-spot-interruptions/cloudwatchlogs.yaml b/workshops/track-spot-interruptions/cloudwatchlogs.yaml new file mode 100644 index 00000000..6085f322 --- /dev/null +++ b/workshops/track-spot-interruptions/cloudwatchlogs.yaml @@ -0,0 +1,84 @@ +--- +AWSTemplateFormatVersion: '2010-09-09' +Description: 'Logging instance details in response to EC2 Spot Instance Interruption Warnings' +Parameters: + CloudWatchLogGroupName: + Description: Name of the Cloud Watch Log Group logs will be written to. + Default: /aws/events/spotinterruptions + Type: String + CloudWatchLogGroupRetentionPeriodDays: + Description: Number of days to retain Cloud Watch Logs + Default: 7 + Type: Number +Resources: + RBREventRule: + Type: AWS::Events::Rule + Properties: + Description: EventRule + EventPattern: + source: + - aws.ec2 + detail-type: + - EC2 Instance Rebalance Recommendation + State: ENABLED + RoleArn: !GetAtt CloudWatchLogRole.Arn + Targets: + - Arn: + Fn::GetAtt: + - CloudWatchLogGroup + - Arn + Id: CloudWatchLogGroup + Metadata: + SamResourceId: RBREventRule + ITNEventRule: + Type: AWS::Events::Rule + Properties: + Description: EventRule + EventPattern: + source: + - aws.ec2 + detail-type: + - EC2 Spot Instance Interruption Warning + State: ENABLED + RoleArn: !GetAtt CloudWatchLogRole.Arn + Targets: + - Arn: + Fn::GetAtt: + - CloudWatchLogGroup + - Arn + Id: CloudWatchLogGroup + Metadata: + SamResourceId: ITNEventRule + CloudWatchLogGroup: + Type: AWS::Logs::LogGroup + Properties: + LogGroupName: + Ref: CloudWatchLogGroupName + RetentionInDays: + Ref: CloudWatchLogGroupRetentionPeriodDays + CloudWatchLogRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: 2012-10-17 + Statement: + - Effect: Allow + Principal: + Service: events.amazonaws.com + Action: 'sts:AssumeRole' + Policies: + - PolicyName: "AllowLogging" + PolicyDocument: + Version: '2012-10-17' + Statement: + - Effect: "Allow" + Action: + - 'logs:*' + Resource: "*" + +Outputs: + CloudWatchLogGroupName: + Description: CloudWatch Log Group Name + Value: + Ref: CloudWatchLogGroup +... \ No newline at end of file