-
Notifications
You must be signed in to change notification settings - Fork 43
feature generation
Library : boto3, pandas, sklearn
aws-build-deployment-package -> pandas, sklearn
In a machine learning job, raw input data generally needs pre-processing to prepare the input as features for training. In the featurization workload, we use Amazon Fine Food Review3 text dataset assuming that each review is transformed into a TF-IDF vector. To run the workload on a FaaS environment with different RAM configuration in parallel, we partition the input dataset into various sizes. Also, to calculate a global TF-IDF vector from partitioned input datasets, multiple invocations of the function are necessary for parallel processing and aggregation.
Orchestrator (code) Multiple invocations of the function are necessary for parallel processing
Feature Extractor (code) Data Preprocessing - Extract Word from sentence.
Feature Reducer (code) Generate global Tf-IDF vector.
Get-job-status (code) Check for number of s3 object in bucket.
step function state machine code
{
"StartAt": "OrcheStrator",
"States": {
"OrcheStrator": {
"Type": "Task",
"Resource": [ORCHESTRATOR-FUNCTION-ARN],
"ResultPath": "$.num_of_file",
"Next": "Wait X Seconds"
},
"Wait X Seconds": {
"Type": "Wait",
"Seconds": 12,
"Next": "Get Job Status"
},
"Get Job Status": {
"Type": "Task",
"Resource": [GET-JOB-STATUS-FUNTION-ARN],
"Next": "Job Complete?",
"InputPath": "$.num_of_file",
"ResultPath": "$"
},
"Job Complete?": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.status",
"StringEquals": "FAILED",
"Next": "Wait X Seconds"
},
{
"Variable": "$.status",
"StringEquals": "SUCCEEDED",
"Next": "Feature Reducer"
}
],
"Default": "Wait X Seconds"
},
"Feature Reducer": {
"Type": "Task",
"Resource": [FEATURE-REDUCER-FUNCTION-ARN],
"End": true
}
}
}
Workload Input : Text
Workload Output : Text
Lambda payload(test-event) example:
Datset-bucket is stored amazon-fine-food-reviews dataset which is needed one more partition file reviews10mb.csv, reviews20mb.csv, reviews50mb.csv, reviews100mb.csv or https://snap.stanford.edu/data/web-FineFoods.html
{
"bucket": "[DATASET-BUCKET]"
}