Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to download pretrained models to EMR cluster using AWS step function #1576

Open
1 task done
NSManogna opened this issue Oct 28, 2024 · 1 comment
Open
1 task done
Assignees
Labels
question Further information is requested

Comments

@NSManogna
Copy link

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

We are trying to execute deidentification pipeline which reads data from Maria DB table and writes to snowflake
these are the steps that are involved to execute this script in AWS step function
1)create EMR cluster using step function
2)Install dependencies
3)Trigger deidentification script

Current Behavior

When i run my script manually on EMR cluster my script is running fine. But when i try to execute the script on EMR cluster using step function .It is failing at downloading the pretrained models step.
error_log.txt
Attaching the complete log for your reference

Expected Behavior

we expect it download models and load deidentified data into table. Here is the log that is created when we run manually on EMR
output.log

Steps To Reproduce

Use the below code to create step function and run the script.
{
"Comment": "A description of my state machine",
"StartAt": "StoreClusterId",
"States": {
"StoreClusterId": {
"Type": "Pass",
"Result": {
"ClusterId": "j-3LN9LXY44F0W2"
},
"ResultPath": "$.Input",
"Next": "CopyZipFile"
},
"CopyZipFile": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
"Parameters": {
"ClusterId": "j-3LN9LXY44F0W2",
"Step": {
"Name": "Copy ZIP File",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"bash",
"-c",
"aws s3 cp s3://inno-data/cldp_emr_phase2_v4.zip /home/hadoop/cldp_emr_phase2_v4.zip"
]
}
}
},
"ResultPath": "$.CopyInfo",
"Next": "UnzipCode"
},
"UnzipCode": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
"Parameters": {
"ClusterId": "j-3LN9LXY44F0W2",
"Step": {
"Name": "Unzip Code",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"bash",
"-c",
"unzip /home/hadoop/cldp_emr_phase2_v4.zip -d /home/hadoop/cldp_emr_phase2_v4"
]
}
}
},
"ResultPath": "$.UnzipInfo",
"Next": "SetExecutePermission"
},
"SetExecutePermission": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
"Parameters": {
"ClusterId": "j-3LN9LXY44F0W2",
"Step": {
"Name": "Set Execute Permission",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"bash",
"-c",
"chmod +x /home/hadoop/cldp_emr_phase2_v4/cldp_emr_phase2_v4/jsl/setup_env_nlp.sh && /home/hadoop/cldp_emr_phase2_v4/cldp_emr_phase2_v4/jsl/setup_env_nlp.sh"
]
}
}
},
"ResultPath": "$.PermissionInfo",
"Next": "RunPythonScript"
},
"RunPythonScript": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
"Parameters": {
"ClusterId": "j-3LN9LXY44F0W2",
"Step": {
"Name": "Run Python Script",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"bash",
"-c",
"chmod +x /home/hadoop/cldp_emr_phase2_v4/cldp_emr_phase2_v4/jsl/install_nlp.py && /home/hadoop/cldp_emr_phase2_v4/cldp_emr_phase2_v4/jsl/jsl_env/bin/python /home/hadoop/cldp_emr_phase2_v4/cldp_emr_phase2_v4/jsl/install_nlp.py"
]
}
}
},
"ResultPath": "$.StepInfo",
"Next": "RunPythonScript2"
},
"RunPythonScript2": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
"Parameters": {
"ClusterId": "j-3LN9LXY44F0W2",
"Step": {
"Name": "Run Python Script",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"bash",
"-c",
"cd /home/hadoop/cldp_emr_phase2_v4/cldp_emr_phase2_v4/jsl && /home/hadoop/cldp_emr_phase2_v4/cldp_emr_phase2_v4/jsl/jsl_env/bin/python /home/hadoop/cldp_emr_phase2_v4/cldp_emr_phase2_v4/jsl/jsl_convatec_deid_script.py"
]
}
}
},
"ResultPath": "$.StepInfo",
"End": true
}
}
}

It is failing at the step name:RunPythonScript2

Spark NLP version and Apache Spark

sparknlp version :5.5.0
spark version :3.4.0

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

@NSManogna NSManogna added the question Further information is requested label Oct 28, 2024
@maziyarpanahi maziyarpanahi transferred this issue from JohnSnowLabs/spark-nlp Oct 28, 2024
@maziyarpanahi
Copy link
Member

Since you mentioned deidentification pipeline, this is a licensed pipeline that requires a licensed library. It's best if you look for support either here or on the Slack (under healthcare channel)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants