warc-analytics-wac2021

This project is to build a serverless data analytics application that processes the Common Crawl files stored in the AWS S3. This application is composed of AWS Lambda, AWS Batch, Container, and ArchiveSpark. This AWS solution can be deployed using a CloudFormation stack or AWS CLI.

The source code is specific for this research work and you are free to update it for your work. See Customization for details.

As always, you are more than welcome to contribute and make this solution better and/or support more use cases.

Presentation slides

WAC 2021 - Analyzing WARC on Serverless Computing

Workflow

A task file is uploaded to the S3 bucket
S3 bucket triggers a lambda function
Lambda function reads the content in the task file and submits multiple batch jobs
Each batch job downloads a Common Crawl WARC file, processes it, creates devirative content, and uploads the result output files to the target S3 bucket

Deploy warc-analytics-wac2021 using CloudFormation stack

Step 1: Launch CloudFormation stack

Click Next to continue

Step 2: Specify stack details

Stack name: Stack name can include letters (A-Z and a-z), numbers (0-9), and dashes (-).
Parameters: Parameters are defined in your template and allow you to input custom values when you create or update a stack.

Name	Description
BatchRepositoryName	any valid name for the Batch process repository
DockerImage	any valid name for the Docker image. e.g. yinlinchen/vtl:iiifs3_v3
JDName	any valid name for the Job definition
JQName	any valid name for the Job queue
LambdaFunctionName	any valid name for the Lambda function
LambdaRoleName	any valid name for the Lambda role
S3BucketName	any valid name for the S3 bucket

Step 3: Configure stack options

Leave it as is and click Next

Step 4: Review

Make sure all checkboxes under Capabilities section are CHECKED

Click Create stack

Step 5: Finish

After Cloudformation stack is deployed successfully, you will see a service is deployed in the AWS Batch. You can see a list of resources in the Cloudformation output.

Deploy warc-analytics-wac2021 using AWS CLI (For advanced users)

Run the following in your shell to deploy the application to AWS:

New deployment

 aws cloudformation create-stack --stack-name STACKNAME --template-body file://awsbatchwarc.template --capabilities CAPABILITY_NAMED_IAM

Update the deployment

 aws cloudformation update-stack --stack-name STACKNAME --template-body file://awsbatchwarc.template --capabilities CAPABILITY_NAMED_IAM

After Cloudformation stack is deployed successfully, you will see a service is deployed in the AWS Batch. You can see a list of resources in the Cloudformation output.

See Cloudformation: create stack for --parameters option.

Cleanup

To delete the deployed application that you created, use the AWS CLI. Assuming you used your project name for the stack name, you can run the following:

aws cloudformation delete-stack --stack-name warc-analytics-wac2021

Usage

Prepare a task.json file.
Upload this task.json file to the S3 bucket. You can see this S3 bucket information from the Cloudformation output.
The out results file will be uploaded to the S3 bucket. Note: you need to update the Docker image file and update the S3 bucket information. See here.

Task File

Example task file: task.json

Name	Description
jobQueue	Batch job queue name
jobDefinition	Batch job definition name
region	AWS region, e.g. us-east-1
WARC_FILENAME	CC-MAIN-20200524210325-20200525000325-00041
WARC_URL	1590347385193.5/warc/

Note: You can get WARC_FILENAME and WARC_URL from the Common Crawl warc.paths file. E.g. crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/warc/CC-MAIN-20200524210325-20200525000325-00041.warc.gz

Customization

Lambda function

index.py: Submit a batch job when a task file is upload to a S3 bucket

Note: After you update the index.py file, compress this file into a app.zip file for cloudformation packaging.

Create and update the Cloudformation template

aws cloudformation package --template-file template/template.yaml --s3-bucket iipc-warc-code --output-template-file awsbatchwarc.template

Deploy the application

New deployment

 aws cloudformation create-stack --stack-name warctest --template-body file://awsbatchwarc.template --capabilities CAPABILITY_NAMED_IAM

Update the deployment

 aws cloudformation update-stack --stack-name warctest --template-body file://awsbatchwarc.template --capabilities CAPABILITY_NAMED_IAM

Batch Configuration

Compute Environment: Type: SPOT, MinvCpus: 0, MaxvCpus: 256, InstanceTypes: optimal
Job Definition: Type: container, Image: DockerImage, Vcpus: 4, Memory: 4096
Job Queue: Priority: 10

Docker image

Current Docker image is built for this research. You can update the source code to fit your need. This image is published at Docker Hub: yinlinchen/vtl:iipc_v3.

Note: The output result S3 bucket need to be changed to your S3 bucket that you have permission to write. It is located in here.

Create a custom ECS AMI

The default storage size is about 8GB, the ECS optimized AMI is 30GB. Also, download docker images take time, so create a custom image with docker image is already pulled. Steps to create a custom AMI list below.

Create an instance using Amazon ECS-optimized AMIs. It is region specific, so pick the correct AMI ID in your region.

Region AMI ID

us-east-1 ami-09a3cad575b7eabaa

us-west-1 ami-0ec4aded37931c8b5

us-west-2 ami-0b58521c622a24969
Pull an image from a registry
```
docker pull yinlinchen/vtl:iipc_v3
```
Creating a compute resource AMI
- You must stop it and remove any persistent data checkpoint files before creating your AMI.
```
sudo systemctl stop ecs
sudo rm -rf /var/lib/ecs/data/*
```
- Create a new AMI from your running instance. Learn more.
Note your custom AMI ID
- Use that information to update the cloudformation template here.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docker		docker
docs		docs
examples		examples
images		images
src		src
template		template
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
awsbatchwarc.template		awsbatchwarc.template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

warc-analytics-wac2021

Presentation slides

Workflow

Deploy warc-analytics-wac2021 using CloudFormation stack

Step 1: Launch CloudFormation stack

Step 2: Specify stack details

Step 3: Configure stack options

Step 4: Review

Step 5: Finish

Deploy warc-analytics-wac2021 using AWS CLI (For advanced users)

Cleanup

Usage

Task File

Customization

Lambda function

Create and update the Cloudformation template

Deploy the application

Batch Configuration

Docker image

Create a custom ECS AMI

About

Releases

Sponsor this project

Packages

Languages

Region	AMI ID
us-east-1	ami-09a3cad575b7eabaa
us-west-1	ami-0ec4aded37931c8b5
us-west-2	ami-0b58521c622a24969

yinlinchen/warc-analytics-wac2021

Folders and files

Latest commit

History

Repository files navigation

warc-analytics-wac2021

Presentation slides

Workflow

Deploy warc-analytics-wac2021 using CloudFormation stack

Step 1: Launch CloudFormation stack

Step 2: Specify stack details

Step 3: Configure stack options

Step 4: Review

Step 5: Finish

Deploy warc-analytics-wac2021 using AWS CLI (For advanced users)

Cleanup

Usage

Task File

Customization

Lambda function

Create and update the Cloudformation template

Deploy the application

Batch Configuration

Docker image

Create a custom ECS AMI

About

Resources

Code of conduct

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages