|
| 1 | +# paved path project |
| 2 | + |
| 3 | +**This project is currently in Prototype. If you have suggestions for improvements, please open a GitHub issue. We'd love to hear your feedback.** |
| 4 | + |
| 5 | +## Local development |
| 6 | +1. Install dependencies |
| 7 | +```bash |
| 8 | +pip install -r requirements.txt |
| 9 | +``` |
| 10 | + |
| 11 | +2. Train a model |
| 12 | +```bash |
| 13 | +python charnn/main.py |
| 14 | +``` |
| 15 | + |
| 16 | +3. Generate text from a model |
| 17 | +```bash |
| 18 | +python charnn/main.py charnn.task="generate" charnn.phrase="hello world" |
| 19 | +``` |
| 20 | + |
| 21 | +4. [Optional] train a model with torchx |
| 22 | +```bash |
| 23 | +torchx run -s local_cwd dist.ddp -j 1x2 --script charnn/main.py |
| 24 | +``` |
| 25 | +> **_NOTE_**: `-j 1x2` specifies single node with 2 GPUs. Learn more about torchx [here](https://pytorch.org/torchx/latest/) |
| 26 | +
|
| 27 | +## Development in AWS |
| 28 | +### Setup environment |
| 29 | +1. Launch an EC2 instance following [EC2 GetStarted](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html) |
| 30 | +2. Install docker and nvidia driver if not already installed |
| 31 | + |
| 32 | +You can use EC2 for [Local development](#Local-development). However, you may need to a cluster and scheduler to manage resources(GPU, CPU, RAM, etc.) more efficiently. There are various options like [Slurm](https://slurm.schedmd.com/documentation.html), [kubernetes](https://kubernetes.io/), etc. AWS provides a fully managed [Batch](https://aws.amazon.com/batch/) that is easy to get started. We will use it as the default scheduler in this example. With torchx, the job launching CLI will be similar for all supported schedulers. |
| 33 | + |
| 34 | +### Create a container image on AWS ECS |
| 35 | +Before launching a job in Batch, we need to create a docker image containing the executable(`charnn/main.py` and its dependencies). Please follow [docker/README.md](https://github.com/facebookresearch/recipes/tree/main/torchrecipes/paved_path/docker). |
| 36 | + |
| 37 | +### AWS Batch |
| 38 | +1. Create Batch through Wizard: https://docs.aws.amazon.com/batch/latest/userguide/Batch_GetStarted.html |
| 39 | + > **_NOTE_**: Configure Compute Environment and Job Queue(named it as "torch-gpu"). Do not need to Define Job if launch with torch.x |
| 40 | +2. Setup env variables |
| 41 | +```bash |
| 42 | +export REGION="us-west-2" # or any region in your case |
| 43 | +export JOB_QUEUE="torchx-gpu" # must match the name of your Job Queue |
| 44 | +export ECR_URL="YOUR_AWS_ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/charnn" # defined in docker/README.md |
| 45 | +``` |
| 46 | +3. Launch a model training job with torchx |
| 47 | +```bash |
| 48 | +torchx run --workspace '' -s aws_batch \ |
| 49 | + -cfg queue=$JOB_QUEUE,image_repo=$ECR_URL/charnn dist.ddp \ |
| 50 | + --script charnn/main.py --image $ECR_URL/charnn:latest \ |
| 51 | + --cpu 8 --gpu 2 -j 1x2 --memMB 20480 |
| 52 | +``` |
| 53 | +You should get output as below. You can monitor and manage the job in AWS Batch console through the `UI URL`. |
| 54 | +``` |
| 55 | +torchx 2022-08-29 22:01:22 INFO Found credentials in environment variables. |
| 56 | +aws_batch://torchx/torchx-gpu:main-pqwtnnj6dqhr0 |
| 57 | +torchx 2022-08-29 22:01:23 INFO Launched app: aws_batch://torchx/torchx-gpu:main-pqwtnnj6dqhr0 |
| 58 | +torchx 2022-08-29 22:01:23 INFO AppStatus: |
| 59 | + State: PENDING |
| 60 | + Num Restarts: -1 |
| 61 | + Roles: |
| 62 | + Msg: <NONE> |
| 63 | + Structured Error Msg: <NONE> |
| 64 | + UI URL: https://us-west-2.console.aws.amazon.com/batch/home?region=us-west-2#jobs/mnp-job/d6c98cdd-693e-47b8-981b-69e119743768 |
| 65 | +``` |
0 commit comments