Merging NCCL to Master #20

jiazhihao · 2021-02-04T08:57:39Z

No description provided.

Conflicts: legion

…e calls

…tible with Legion tracing

[Fusion] multiple bug fixes

[Print] update print_tensor API [Legion] version update

* Added Hello World config for circleci * Remove internal submodules * OSS Automated Fix: Addition of Contributing * OSS Automated Fix: Addition of Code of Conduct * [CircleCI] Added simple onnx test for importing a model. * [CircleCI] Added pytest job to config.yml for running onnx tests. Co-authored-by: Zhihao Jia <[email protected]>

parallel configurations [NCCL] eliminate MPI dependency

…e time a model takes when training (#94) * Update README.md * [ModelTiming] Added python script to examples folder for measuring the time per image that is used when training resnet152. Also added a file resnet.py that contains functions for generating the resnet model object * [ModelTiming] Changed resnet_torch so it can be imported to other places. * [ModelTiming]Changed import library for resnet

* [ModelTiming] Added python script to examples folder for measuring the time per image that is used when training resnet152. Also added a file resnet.py that contains functions for generating the resnet model object * [ModelTiming] Changed resnet_torch so it can be imported to other places. * [ModelTiming]Changed import library for resnet * [ModelTiming] Added script for training resnet152 with DDP. This allows the model to be trained on multiple gpus on multiple nodes. * [ModelTiming] Changed gpu to local_rank in the resnet152_ddp_training.py script. * [ModelTiming] resnet152_ddp_training script now obtains the master address from the Slurm environment

w/ memory usage exceeding device capacity, a penalty will be added to the simulated performance. This allows the MCMC search to discover strategy that satisfy the memory constraints

…ent partitioning

[Keras_EXP]: a new keras module Conflicts: python/flexflow/onnx/model.py [Keras-EXP]: more work [Keras-EXP]: add missing files

facebook-github-bot · 2021-08-31T09:33:17Z

Hi @jiazhihao!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

jiazhihao and others added 30 commits November 16, 2020 02:57

Initial NCCL implementation

383ccb4

Merge branch 'master' of https://github.com/flexflow/FlexFlow into nccl

a93f309

More work on NCCL implementation

36480b2

[NCCL] bug fixes

36e62ed

Merge branch 'master' of https://github.com/flexflow/FlexFlow into nccl

4af4408

Conflicts: legion

[Mapper] implement a new Mapper for FlexFlow

1c1154d

Merge branch 'master' of https://github.com/flexflow/FlexFlow into nccl

b313fa1

[Mapper] bug fixes

0ee0e68

[NCCL] set NCCL_LAUNCH_MODE to PARALLEL

6ef8a6b

[Python] bug fixes

be97885

Merge branch 'master' of https://github.com/flexflow/FlexFlow into nccl

edfd059

[Python] fix Makefile

fdf8466

[Python] bug fixes for set_weights and get_weights

de1a571

[NCCL] add execution fences to prevent deadlines across NCCL Allreduc…

097e073

…e calls

[NCCL] currently disable memoization optimization since it is incompa…

46bd812

…tible with Legion tracing

[Fusion] initial implementation for the fusion optimizations

c2e451a

[Fusion] fix a deadlock bug

a4c4221

[Fusion] multiple bug fixes

[Python] disable DEBUG mode by default

3a38954

Merge branch 'master' of https://github.com/flexflow/FlexFlow into nccl

4dad079

Merge branch 'master' of https://github.com/flexflow/FlexFlow into nccl

bb49825

[Fusion] add sanity checks

f803110

[Fusion] bug fixes

6e79895

[Print] update print_tensor API [Legion] version update

[Python] update data loader interface to support more generic tensors

a268bc1

Resolve legion version conflict

4d3088c

[docs]: try add docs

e0389bb

[docs]: add more commnets for ops

24967e9

[docs]: done with all comments of all ops

662749c

[docs]: done with python api

0da4f54

[docs]: minor fix

8e8d067

Resolve merge conflicts

a2018e1

jiazhihao and others added 29 commits February 5, 2021 18:28

[Linear] bug fixes for parameter parallel

4a637a2

Merge branch 'nccl' of https://github.com/flexflow/FlexFlow into nccl

749427f

Merge branch 'nccl' of github.com:flexflow/FlexFlow into nccl

363bcac

[NCCL] rename FF_ENABLE_NCCL to FF_USE_NCCL

e41729c

[Python]: fix for dataloader

b33f2e5

[Mapper] allow registrating multiple sharding functors for different

2c50cc9

parallel configurations [NCCL] eliminate MPI dependency

Merge branch 'nccl' of https://github.com/flexflow/FlexFlow into nccl

d0b86b0

Merge branch 'nccl' of github.com:flexflow/FlexFlow into nccl

0a7c324

[Python] changes to use new sharding functors

c6bb2c9

Merge branch 'nccl' of github.com:flexflow/FlexFlow into nccl

676cf74

[CircleCI] fix config for python_root

955b5a6

[CircleCI] another fix

9fb8165

[Makefile] minor fix

c7c094a

[ONNX]: checkpoint, add constant

fe088ab

Merge remote-tracking branch 'upstream/nccl' into nccl

e70eb46

[Makefile] including FlexFlow.mk last to avoid overwriting PYTHON_EXE

8446fa7

Merge branch 'nccl' of https://github.com/flexflow/FlexFlow into nccl

0448522

[Simulator] Add memory usage penalty to the simulator. For strategy

888bcf4

w/ memory usage exceeding device capacity, a penalty will be added to the simulated performance. This allows the MCMC search to discover strategy that satisfy the memory constraints

[Example][Python] Add a multi-head attention example

c458ae5

[Search] add support for inference

5a27e6b

[Loss][Metrics] allow the final layer and label tensor to have differ…

7536685

…ent partitioning

[Linear] bug fix in calculating the weight dims when NCCL is enabled

27e394e

[Keras_EXP]: a new keras module that uses ONNX

2d11495

[Keras_EXP]: a new keras module Conflicts: python/flexflow/onnx/model.py [Keras-EXP]: more work [Keras-EXP]: add missing files

[Keras_EXP]: add compile mode

e0a50fd

Merge branch 'master' of https://github.com/flexflow/FlexFlow into nccl

a7ffc96

Merge branch 'nccl' of https://github.com/flexflow/FlexFlow into nccl

ac366e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging NCCL to Master #20

Merging NCCL to Master #20

jiazhihao commented Feb 4, 2021

facebook-github-bot commented Aug 31, 2021

Merging NCCL to Master #20

Are you sure you want to change the base?

Merging NCCL to Master #20

Conversation

jiazhihao commented Feb 4, 2021

facebook-github-bot commented Aug 31, 2021

Process