This project is to build a serverless data analytics application that processes the Common Crawl files stored in the AWS S3. This application is composed of AWS Lambda, AWS Batch, Container, and ArchiveSpark. This AWS solution can be deployed using a CloudFormation stack or AWS CLI.
The source code is specific for this research work and you are free to update it for your work. See Customization for details.
As always, you are more than welcome to contribute and make this solution better and/or support more use cases.
- A task file is uploaded to the S3 bucket
- S3 bucket triggers a lambda function
- Lambda function reads the content in the task file and submits multiple batch jobs
- Each batch job downloads a Common Crawl WARC file, processes it, creates devirative content, and uploads the result output files to the target S3 bucket
Click Next to continue
-
Stack name: Stack name can include letters (A-Z and a-z), numbers (0-9), and dashes (-).
-
Parameters: Parameters are defined in your template and allow you to input custom values when you create or update a stack.
Name | Description |
---|---|
BatchRepositoryName | any valid name for the Batch process repository |
DockerImage | any valid name for the Docker image. e.g. yinlinchen/vtl:iiifs3_v3 |
JDName | any valid name for the Job definition |
JQName | any valid name for the Job queue |
LambdaFunctionName | any valid name for the Lambda function |
LambdaRoleName | any valid name for the Lambda role |
S3BucketName | any valid name for the S3 bucket |
Leave it as is and click Next
Make sure all checkboxes under Capabilities section are CHECKED
Click Create stack
After Cloudformation stack is deployed successfully, you will see a service is deployed in the AWS Batch. You can see a list of resources in the Cloudformation output.
Run the following in your shell to deploy the application to AWS:
- New deployment
aws cloudformation create-stack --stack-name STACKNAME --template-body file://awsbatchwarc.template --capabilities CAPABILITY_NAMED_IAM
- Update the deployment
aws cloudformation update-stack --stack-name STACKNAME --template-body file://awsbatchwarc.template --capabilities CAPABILITY_NAMED_IAM
- After Cloudformation stack is deployed successfully, you will see a service is deployed in the AWS Batch. You can see a list of resources in the Cloudformation output.
See Cloudformation: create stack for --parameters
option.
To delete the deployed application that you created, use the AWS CLI. Assuming you used your project name for the stack name, you can run the following:
aws cloudformation delete-stack --stack-name warc-analytics-wac2021
- Prepare a task.json file.
- Upload this task.json file to the S3 bucket. You can see this S3 bucket information from the Cloudformation output.
- The out results file will be uploaded to the S3 bucket. Note: you need to update the Docker image file and update the S3 bucket information. See here.
- Example task file: task.json
Name | Description |
---|---|
jobQueue | Batch job queue name |
jobDefinition | Batch job definition name |
region | AWS region, e.g. us-east-1 |
WARC_FILENAME | CC-MAIN-20200524210325-20200525000325-00041 |
WARC_URL | 1590347385193.5/warc/ |
Note: You can get WARC_FILENAME
and WARC_URL
from the Common Crawl warc.paths file. E.g. crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/warc/CC-MAIN-20200524210325-20200525000325-00041
.warc.gz
-
index.py: Submit a batch job when a task file is upload to a S3 bucket
Note: After you update the
index.py
file, compress this file into aapp.zip
file for cloudformation packaging.
aws cloudformation package --template-file template/template.yaml --s3-bucket iipc-warc-code --output-template-file awsbatchwarc.template
- New deployment
aws cloudformation create-stack --stack-name warctest --template-body file://awsbatchwarc.template --capabilities CAPABILITY_NAMED_IAM
- Update the deployment
aws cloudformation update-stack --stack-name warctest --template-body file://awsbatchwarc.template --capabilities CAPABILITY_NAMED_IAM
- Compute Environment: Type:
SPOT
, MinvCpus:0
, MaxvCpus:256
, InstanceTypes:optimal
- Job Definition: Type:
container
, Image:DockerImage
, Vcpus:4
, Memory:4096
- Job Queue: Priority:
10
Current Docker image is built for this research. You can update the source code to fit your need. This image is published at Docker Hub: yinlinchen/vtl:iipc_v3.
Note: The output result S3 bucket need to be changed to your S3 bucket that you have permission to write. It is located in here.
The default storage size is about 8GB, the ECS optimized AMI is 30GB. Also, download docker images take time, so create a custom image with docker image is already pulled. Steps to create a custom AMI list below.
-
Create an instance using Amazon ECS-optimized AMIs. It is region specific, so pick the correct AMI ID in your region.
Region AMI ID us-east-1 ami-09a3cad575b7eabaa us-west-1 ami-0ec4aded37931c8b5 us-west-2 ami-0b58521c622a24969 -
Pull an image from a registry
docker pull yinlinchen/vtl:iipc_v3
-
Creating a compute resource AMI
- You must stop it and remove any persistent data checkpoint files before creating your AMI.
sudo systemctl stop ecs sudo rm -rf /var/lib/ecs/data/*
- Create a new AMI from your running instance. Learn more.
-
Note your custom AMI ID
- Use that information to update the cloudformation template here.