Skip to content

Commit d7baacb

Browse files
hudevenfacebook-github-bot
authored andcommitted
add docs about local dev and AWS dev
Summary: Pull Request resolved: #34 Test Plan: Imported from OSS Reviewed By: dracifer Differential Revision: D39113708 Pulled By: hudeven fbshipit-source-id: 218a9cefaa70064d692ae3d1d287373107b97899
1 parent 051be7c commit d7baacb

File tree

1 file changed

+65
-0
lines changed

1 file changed

+65
-0
lines changed

torchrecipes/paved_path/README.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# paved path project
2+
3+
**This project is currently in Prototype. If you have suggestions for improvements, please open a GitHub issue. We'd love to hear your feedback.**
4+
5+
## Local development
6+
1. Install dependencies
7+
```bash
8+
pip install -r requirements.txt
9+
```
10+
11+
2. Train a model
12+
```bash
13+
python charnn/main.py
14+
```
15+
16+
3. Generate text from a model
17+
```bash
18+
python charnn/main.py charnn.task="generate" charnn.phrase="hello world"
19+
```
20+
21+
4. [Optional] train a model with torchx
22+
```bash
23+
torchx run -s local_cwd dist.ddp -j 1x2 --script charnn/main.py
24+
```
25+
> **_NOTE_**: `-j 1x2` specifies single node with 2 GPUs. Learn more about torchx [here](https://pytorch.org/torchx/latest/)
26+
27+
## Development in AWS
28+
### Setup environment
29+
1. Launch an EC2 instance following [EC2 GetStarted](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html)
30+
2. Install docker and nvidia driver if not already installed
31+
32+
You can use EC2 for [Local development](#Local-development). However, you may need to a cluster and scheduler to manage resources(GPU, CPU, RAM, etc.) more efficiently. There are various options like [Slurm](https://slurm.schedmd.com/documentation.html), [kubernetes](https://kubernetes.io/), etc. AWS provides a fully managed [Batch](https://aws.amazon.com/batch/) that is easy to get started. We will use it as the default scheduler in this example. With torchx, the job launching CLI will be similar for all supported schedulers.
33+
34+
### Create a container image on AWS ECS
35+
Before launching a job in Batch, we need to create a docker image containing the executable(`charnn/main.py` and its dependencies). Please follow [docker/README.md](https://github.com/facebookresearch/recipes/tree/main/torchrecipes/paved_path/docker).
36+
37+
### AWS Batch
38+
1. Create Batch through Wizard: https://docs.aws.amazon.com/batch/latest/userguide/Batch_GetStarted.html
39+
> **_NOTE_**: Configure Compute Environment and Job Queue(named it as "torch-gpu"). Do not need to Define Job if launch with torch.x
40+
2. Setup env variables
41+
```bash
42+
export REGION="us-west-2" # or any region in your case
43+
export JOB_QUEUE="torchx-gpu" # must match the name of your Job Queue
44+
export ECR_URL="YOUR_AWS_ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/charnn" # defined in docker/README.md
45+
```
46+
3. Launch a model training job with torchx
47+
```bash
48+
torchx run --workspace '' -s aws_batch \
49+
-cfg queue=$JOB_QUEUE,image_repo=$ECR_URL/charnn dist.ddp \
50+
--script charnn/main.py --image $ECR_URL/charnn:latest \
51+
--cpu 8 --gpu 2 -j 1x2 --memMB 20480
52+
```
53+
You should get output as below. You can monitor and manage the job in AWS Batch console through the `UI URL`.
54+
```
55+
torchx 2022-08-29 22:01:22 INFO Found credentials in environment variables.
56+
aws_batch://torchx/torchx-gpu:main-pqwtnnj6dqhr0
57+
torchx 2022-08-29 22:01:23 INFO Launched app: aws_batch://torchx/torchx-gpu:main-pqwtnnj6dqhr0
58+
torchx 2022-08-29 22:01:23 INFO AppStatus:
59+
State: PENDING
60+
Num Restarts: -1
61+
Roles:
62+
Msg: <NONE>
63+
Structured Error Msg: <NONE>
64+
UI URL: https://us-west-2.console.aws.amazon.com/batch/home?region=us-west-2#jobs/mnp-job/d6c98cdd-693e-47b8-981b-69e119743768
65+
```

0 commit comments

Comments
 (0)