diff --git a/CHANGELOG.md b/CHANGELOG.md index 3edbfa6d9d6e..abd6fc7757b9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -70,10 +70,305 @@ #### TTS
Changelog + - Clean up dev docs collection section by @yaoyu-33 :: PR: #9205 - Add mel codec checkpoints by @anteju :: PR: #9228 - GPU unit tests: Mark flaky tests to be fixed by @pablo-garay :: PR: #9559 - chore: Pin branch in notebooks by @ko3n1g :: PR: #9697 +- refactor: notebook branch release by @ko3n1g :: PR: #9711 + +
+ +#### NLP / NMT + +
Changelog + +- Update nemo.export module for quantized models by @janekl :: PR: #9218 +- Add save option to the TRT-LLM export test script by @oyilmaz-nvidia :: PR: #9221 +- Checkpoint resuming compatible for 2403 container by @suiyoubi :: PR: #9199 +- Clean up dev docs collection section by @yaoyu-33 :: PR: #9205 +- use get with fallback when reading checkpoint_callback_params by @akoumpa :: PR: #9223 +- Revert rope fusion defaults by @cuichenx :: PR: #9237 +- fix import by @akoumpa :: PR: #9240 +- Add TRT-LLM params like max_num_tokens and opt_num_tokens by @oyilmaz-nvidia :: PR: #9210 +- sum-reduce grad_norm in DP+CP domain by @erhoo82 :: PR: #9262 +- Alit/bert convert fix by @JRD971000 :: PR: #9285 +- conv1d stable version by @JRD971000 :: PR: #9330 +- Fix trainer builder when exp_manager is not in config by @yaoyu-33 :: PR: #9293 +- Fix Peft Weights Loading in NeVA by @yaoyu-33 :: PR: #9341 +- Skip sequence_parallel allreduce when using Mcore DistOpt by @akoumpa :: PR: #9344 +- Fix FSDP gradient calculation with orig params by @janEbert :: PR: #9335 +- TRT-LLM Export Code Cleanup by @oyilmaz-nvidia :: PR: #9270 +- support null/None truncation field by @arendu :: PR: #9355 +- NeVa token fusion by @paul-gibbons :: PR: #9245 +- bugfix if using mcore distOpt with sft by @akoumpa :: PR: #9356 +- Re-org export code by @oyilmaz-nvidia :: PR: #9353 +- QLoRA by @cuichenx :: PR: #9340 +- PeFT fix for distOpt by @akoumpa :: PR: #9392 +- [NeMo-UX] Integrating mcore's DistributedDataParallel into MegatronStrategy by @marcromeyn :: PR: #9387 +- cherry pick of #9266 by @dimapihtar :: PR: #9411 +- Enable specifying alpha for PTQ INT8 SmoothQuant method by @janekl :: PR: #9423 +- add support for new mcore ds features by @dimapihtar :: PR: #9388 +- LoRA for MoE Layer by @cuichenx :: PR: #9396 +- Mistral-7B: apply user's precision to output checkpoint by @akoumpa :: PR: #9222 +- Add option to merge distributed optimizer buckets by @timmoon10 :: PR: #9414 +- TRT-LLM 0.10 Update by @oyilmaz-nvidia :: PR: #9402 +- In-framework deployment by @oyilmaz-nvidia :: PR: #9438 +- Bugfix missing variables and argument changes to MegatronPretrainingRandomSampler by @jstjohn :: PR: #9458 +- Hyena Operator by @guyjacob :: PR: #9264 +- Refactor Quantizer for reusing in QAT by @kevalmorabia97 :: PR: #9276 +- move load state dict after initialize parallel state in nlp_model by @ryxli :: PR: #9382 +- Enable user to optionally upgrade Megatron by @jstjohn :: PR: #9478 +- Fix unwrap model by @cuichenx :: PR: #9480 +- fix operator precedence by @akoumpa :: PR: #9403 +- [NeMo-UX] Adding context- & expert-parallelism to MegatronStrategy by @marcromeyn :: PR: #9525 +- update mcoreddp call by @akoumpa :: PR: #9345 +- mcore distOpt restore fix by @akoumpa :: PR: #9421 +- vLLM Export Support by @apanteleev :: PR: #9381 +- PL: Delete precision if using plugin. TODO switch to MegatronTrainerB… by @akoumpa :: PR: #9535 +- extend get_gpt_layer_modelopt_spec to support MoE by @akoumpa :: PR: #9532 +- fix mock data generation for legacy dataset by @dimapihtar :: PR: #9530 +- add reset learning rate functionality by @dimapihtar :: PR: #9372 +- Use closed-formula to round by multiple by @akoumpa :: PR: #9307 +- GPU unit tests: Mark flaky tests to be fixed by @pablo-garay :: PR: #9559 +- Consolidate gpt continue training script into pretraining script by @yaoyu-33 :: PR: #9413 +- Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409 +- PTQ refinements by @janekl :: PR: #9574 +- Add ModelOpt QAT example for Llama2 SFT model by @kevalmorabia97 :: PR: #9326 +- Multimodal projection layer adapter fix for PP>1 by @paul-gibbons :: PR: #9445 +- Add offline quantization script for QLoRA deployment by @cuichenx :: PR: #9455 +- Make QLoRA more model-agnostic by @cuichenx :: PR: #9488 +- Set n_gpu to None in nemo export by @oyilmaz-nvidia :: PR: #9593 +- [NeMo-UX] Fix Megatron-optimizer by @marcromeyn :: PR: #9599 +- Chat template support for megatron_gpt_eval.py by @akoumpa :: PR: #9354 +- [NeMo-UX] Add PEFT by @cuichenx :: PR: #9490 +- Alit/mamba tmp by @JRD971000 :: PR: #9612 +- Enable MCore checkpointing optimizations by @mikolajblaz :: PR: #9505 +- Change mixtral moe key name for trt-llm by @oyilmaz-nvidia :: PR: #9620 +- fix ckpt load bug by @dimapihtar :: PR: #9621 +- Alit/mamba by @JRD971000 :: PR: #9575 +- Unwrap ckpt_io for model opt (async save) by @mikolajblaz :: PR: #9622 +- MCore T5 support for NeMo - Training by @huvunvidia :: PR: #9432 +- [Nemo-UX] Expose transformer_layer_spec inside GPTConfig by @marcromeyn :: PR: #9592 +- Update NeMo Clip to Use MCore Modules by @yaoyu-33 :: PR: #9594 +- Mistral + Mixtral Support for NeVa by @paul-gibbons :: PR: #9459 +- Adding support for mcore generate by @shanmugamr1992 :: PR: #9566 +- Improve error messaging during trt-llm export by @oyilmaz-nvidia :: PR: #9638 +- [Cherrypick] support lora when kv_channel != hidden_size / num_heads by @cuichenx :: PR: #9644 +- Parametrize FPS group by @mikolajblaz :: PR: #9648 +- Cherry-pick megatron export fix from main by @borisfom :: PR: #9643 +- add documentation for reset_lr feature by @dimapihta +- chore: Pin branch in notebooks by @ko3n1g :: PR: #9697 +- Cherry pick: LITA Integration by @Slyne :: PR: #9684 +- SDXL improvements (and support for Draft+) by @rohitrango :: PR: #9654 +- Gemma 2 by @cuichenx :: PR: #9672 +- Allows non-strict load with distributed checkpoints by @mikolajblaz :: PR: #9613 +- refactor: notebook branch release by @ko3n1g :: PR: #9711 +- [NeMo-UX] Make TE and Apex dependencies optional by @ashors1 :: PR: #9550 +- Alit/r2.0.0 by @JRD971000 :: PR: #9718 +- Manually cherry-pick from PR 9679 (PR to main - Support SFT/Eval/PEFT for mcore T5) by @huvunvidia :: PR: #9737 +- In framework export by @oyilmaz-nvidia :: PR: #9658 +- T5 changes based on mcore changes by @pablo-garay :: PR: #9829 +- [NeMo-UX] Use single instance of loss reductions in GPTModel by @hemildesai :: PR: #9801 +- deprecate NeMo NLP tutorial by @dimapihtar :: PR: #9864 +- Disable nvFuser setup with PyTorch 23.11 and later by @athitten :: PR: #9837 +- make torch_dist ckpt strategy as default by @dimapihtar :: PR: #9852 +- add rampup bs documentation by @dimapihtar :: PR: #9884 +- copy of #9576 by @dimapihtar :: PR: #9986 +- Support Nvidia Torch and Arch versions by @thomasdhc :: PR: #9897 +- Bug fix for pooler causing dist checkpointing exception by @shanmugamr1992 :: PR: #10008 + +
+ +#### Export + +
Changelog + +- Update nemo.export module for quantized models by @janekl :: PR: #9218 +- Add save option to the TRT-LLM export test script by @oyilmaz-nvidia :: PR: #9221 +- Add TRT-LLM params like max_num_tokens and opt_num_tokens by @oyilmaz-nvidia :: PR: #9210 +- TRT-LLM Export Code Cleanup by @oyilmaz-nvidia :: PR: #9270 +- Re-org export code by @oyilmaz-nvidia :: PR: #9353 +- Use TensorRT-LLM native parameter names in nemo.export module by @janekl :: PR: #9424 +- TRT-LLM 0.10 Update by @oyilmaz-nvidia :: PR: #9402 +- vLLM Export Support by @apanteleev :: PR: #9381 +- Add page context fmha option in TensorRTLLM export by @meatybobby :: PR: #9526 +- Test C++ runtime on demand in nemo_export.py to avoid possible OOMs by @janekl :: PR: #9544 +- Fix nemo export test by @oyilmaz-nvidia :: PR: #9547 +- Add tps and pps params to the export script by @oyilmaz-nvidia :: PR: #9558 +- Add Multimodal Exporter by @meatybobby :: PR: #9256 +- Set n_gpu to None in nemo export by @oyilmaz-nvidia :: PR: #9593 +- Inflight nemo model export support by @JimmyZhang12 :: PR: #9527 +- vLLM Export Improvements by @apanteleev :: PR: #9596 +- Akoumparouli/nemo ux mixtral export by @akoumpa :: PR: #9603 +- Change mixtral moe key name for trt-llm by @oyilmaz-nvidia :: PR: #9620 +- Fix the arguments of forward_for_export function in msdd_models by @tango4j :: PR: #9624 +- Improve error messaging during trt-llm export by @oyilmaz-nvidia :: PR: #9638 +- Cherry-pick megatron export fix from main by @borisfom :: PR: #9643 +- In framework export by @oyilmaz-nvidia :: PR: #9658 +- Add missing imports for torch dist ckpt in export by @oyilmaz-nvidia :: PR: #9826~ + +
+ + + + +#### Bugfixes + +
Changelog + +- use get with fallback when reading checkpoint_callback_params by @akoumpa :: PR: #9223 +- fix import by @akoumpa :: PR: #9240 +- Remove .nemo instead of renaming by @mikolajblaz :: PR: #9281 +- call set_expert_model_parallel_world_size instead of set_cpu_expert_m… by @akoumpa :: PR: #9275 +- Fix typos in Mixtral NeMo->HF and Starcoder2 NeMo->HF conversion scripts by @evellasques :: PR: #9325 +- Skip sequence_parallel allreduce when using Mcore DistOpt by @akoumpa :: PR: #9344 +- Add OpenAI format response to r2.0.0rc1 by @athitten :: PR: #9796 +- [NeMo UX] Support generating datasets using different train/valid/test distributions by @ashors1 :: PR: #9771 +- Add missing imports for torch dist ckpt in export by @oyilmaz-nvidia :: PR: #9826 + +
+ +#### General Improvements + +
Changelog + +- [Nemo CICD] run_cicd_for_release_branches_also by @pablo-garay :: PR: #9213 +- rename paths2audiofiles to audio by @github-actions[bot] :: PR: #9220 +- Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @github-actions[bot] :: PR: #9234 +- ci: Remove duplicated job by @ko3n1g :: PR: #9258 +- Fix document links by @yaoyu-33 :: PR: #9260 +- Pin transformers by @github-actions[bot] :: PR: #9273 +- Fix loading github raw images on notebook by @github-actions[bot] :: PR: #9283 +- Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @github-actions[bot] :: PR: #9278 +- Refactor Sequence Packing Script by @cuichenx :: PR: #9271 +- [Nemo-UX] Move code to collections + fix some small bugs by @marcromeyn :: PR: #9277 +- Fix typo in HF tutorial by @github-actions[bot] :: PR: #9304 +- Expand documentation for data parallelism and distributed optimizer by @timmoon10 :: PR: #9227 +- Install alerting by @ko3n1g :: PR: #9311 +- typos by @github-actions[bot] :: PR: #9315 +- FP8 feature documentation by @ksivaman :: PR: #9265 +- [Nemo CICD] Comment out flaky tests by @pablo-garay :: PR: #9333 +- Fixed typos in README.rst by @gdevakumar :: PR: #9322 +- Update README.rst to clarify installation via Conda by @SimonCW :: PR: #9323 +- [Nemo CICD] update flaky test by @pablo-garay :: PR: #9339 +- fix lora and ptuning and isort/black by @github-actions[bot] :: PR: #9295 +- Fix P-tuning for Llama based models by @github-actions[bot] :: PR: #9300 +- add large model stable training fix and contrastive loss update for variable seq by @github-actions[bot] :: PR: #9348 +- Guard cuda memory allocator update by @github-actions[bot] :: PR: #9313 +- [Nemo CICD] Remove unnecessary commented out code by @pablo-garay :: PR: #9364 +- Update Gemma conversion script by @yaoyu-33 :: PR: #9365 +- Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @github-actions[bot] :: PR: #9371 +- Re-enable cuda graphs in training modes. by @github-actions[bot] :: PR: #9343 +- fix typo infer_seq_lenght -> infer_seq_length by @akoumpa :: PR: #9370 +- Make a backward compatibility for old MSDD configs in label models by @github-actions[bot] :: PR: #9378 +- Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @github-actions[bot] :: PR: #9253 +- Update README.rst by @jgerh :: PR: #9393 +- Force diarizer to use CUDA if cuda is available and if device=None. by @github-actions[bot] :: PR: #9390 +- ci: Properly catch failed tests by introduction of workflow templates by @ko3n1g :: PR: #9324 +- Fix T5 G2P Input and Output Types by @github-actions[bot] :: PR: #9269 +- Huvu/rag pipeline citest by @huvunvidia :: PR: #9384 +- Fix circular import for MM dataprep notebook by @github-actions[bot] :: PR: #9292 +- add check if num layers is divisible by pp size by @github-actions[bot] :: PR: #9298 +- [Nemo CICD] timeouts fix by @pablo-garay :: PR: #9407 +- [NeMo-UX] Removing un-used ModelConfig class by @marcromeyn :: PR: #9389 +- Add tutorial for Llama-3-8B lora training and deployment by @shashank3959 :: PR: #9359 +- [NeMo-UX] Removing default_path from ModelConnector by @marcromeyn :: PR: #9401 +- Fix README by @ericharper :: PR: #9415 +- [SD] Fix SD CUDA Graph Failure by @alpha0422 :: PR: #9319 +- [NeMo-UX] Adding file-lock to Connector by @marcromeyn :: PR: #9400 +- Add Dev Container Bug Report by @pablo-garay :: PR: #9430 +- Akoumparouli/profiling docs by @akoumpa :: PR: #9420 +- ci: Enrich notifications by @ko3n1g :: PR: #9412 +- Fix failing RIR unit test with lhotse 1.24+ by @pzelasko :: PR: #9444 +- [NeMo-UX] Adding support for mcore distributed optimizer by @marcromeyn :: PR: #9435 +- Use ModelOpt build_tensorrt_llm for building engines for qnemo checkpoints by @janekl :: PR: #9452 +- ci(notifications): Fix extraction of last 2K chars by @ko3n1g :: PR: #9450 +- Update readme with mlperf news by @ericharper :: PR: #9457 +- [NeMo-UX] Add nsys callback by @ashors1 :: PR: #9461 +- [NeMo UX] Introducing optimizer module by @marcromeyn :: PR: #9454 +- Fix minor import bug in deploy module by @oyilmaz-nvidia :: PR: #9463 +- ci(notifications): Fetch all jobs by @ko3n1g :: PR: #9465 +- Update build_dataset.py by @stevehuang52 :: PR: #9467 +- bionemo: bn2/add pipelineparallel dtype by @skothenhill-nv :: PR: #9475 +- [NeMo-UX] Integrate experiment manager features with NeMo-UX APIs by @ashors1 :: PR: #9460 +- Add python_requires by @galv :: PR: #9431 +- [NeMo-UX] Fixing imports of NeMoLogging, AutoResume & ModelCheckpoint by @marcromeyn :: PR: #9476 +- Modelopt Refactor for SDXL Quantization by @suiyoubi :: PR: #9279 +- [NeMo-UX] Fixing defaults in llm.train & Mistral7BModel by @marcromeyn :: PR: #9486 +- In framework deploy using deploy script by @oyilmaz-nvidia :: PR: #9468 +- [NeMo-UX] Integrate tokenizer import into model.import_ckpt by @marcromeyn :: PR: #9485 +- append to file by @malay-nagda :: PR: #9483 +- [NeMo-UX] Fix bug in import_ckpt by @marcromeyn :: PR: #9492 +- Add nemotron news by @ericharper :: PR: #9510 +- Add CICD test for Stable Diffusion by @michal2409 :: PR: #9464 +- Akoumparouli/nemo ux mixtral by @akoumpa :: PR: #9446 +- [NeMo-UX] Llama and Gemma by @cuichenx :: PR: #9528 +- [NeMo-UX] minor logging bug fixes by @ashors1 :: PR: #9529 +- Update neva conversion script from and to HF by @yaoyu-33 :: PR: #9296 +- [Nemo-UX] IO fixes by @marcromeyn :: PR: #9512 +- Fix lhotse tests for v1.24.2 by @pzelasko :: PR: #9546 +- [Nemo CICD] Make GPU Unit Tests non-optional by @pablo-garay :: PR: #9551 +- Add Python AIStore SDK to container and bump min Lhotse version by @pzelasko :: PR: #9537 +- [NeMo-UX] Fix tokenizer IO by @marcromeyn :: PR: #9555 +- [NeMo UX] Move mistral_7b.py to mistral.py by @akoumpa :: PR: #9545 +- ci: Do not attempt to send slack on fork by @ko3n1g :: PR: #9556 +- Fix SDXL incorrect name in Docs by @suiyoubi :: PR: #9534 +- Bump PTL version by @athitten :: PR: #9557 +- [Resiliency] Straggler detection by @jbieniusiewi :: PR: #9473 +- [NeMo-UX] Switch to torch_dist as default distributed checkpointing backend by @ashors1 :: PR: #9541 +- [NeMo-UX] Checkpointing bug fixes by @ashors1 :: PR: #9562 +- Expose MCore path_to_cache option by @maanug-nv :: PR: #9570 +- [NeMo-UX] Fix Trainer serialization by @marcromeyn :: PR: #9571 +- Update click version requirement by @thomasdhc :: PR: #9580 +- [Fault tolerance] Heartbeat detection by @maanug-nv :: PR: #9352 +- [Nemo-UX] Add fabric-API for manual forward-pass by @marcromeyn :: PR: #9577 +- [Nemo-UX] Add SDK-factories to llm-collection by @marcromeyn :: PR: #9589 +- [NeMo-UX] Some improvements to NeMoLogger by @marcromeyn :: PR: #9591 +- Set no_sync_func & grad_sync_fucn by @akoumpa :: PR: #9601 +- [NeMo-UX] Fix nemo logger when trainer has no loggers by @ashors1 :: PR: #9607 +- Fix the dictionary format returned by the `scheduler` method by @sararb :: PR: #9609 +- [NeMo-UX] Dataloading enhancements and bug fixes by @ashors1 :: PR: #9595 +- Fix serialization of AutoResume by @sararb :: PR: #9616 +- Jsonl support by @adityavavre :: PR: #9611 +- Akoumparouli/mistral import instruct chat template fix by @akoumpa :: PR: #9567 +- Remove .cuda calls, use device isntead by @akoumpa :: PR: #9602 +- fix converter defautl args by @akoumpa :: PR: #9565 +- fix: remove non_blocking from PTL's .cuda call by @akoumpa :: PR: #9618 +- NeVA Minor Fixes by @yaoyu-33 :: PR: #9608 +- [NeMo-UX] fix pretrianing data sizes and weights by @cuichenx :: PR: #9627 +- [NeMo-UX] async checkpointing support by @ashors1 :: PR: #9466 +- Change default parallel_save to False by @mikolajblaz :: PR: #9632 +- Add REST API to deploy module by @athitten :: PR: #9539 +- ci: Timeout per step, not job by @ko3n1g :: PR: #9635 +- [NeMo-UX] Fix when optimizers are setup for PEFT by @marcromeyn :: PR: #9619 +- [NeMo-UX] Fix pipeline parallel bug by @ashors1 :: PR: #9637 +- Fixing import error fior llama-index (RAG pipeline) by @pablo-garay :: PR: #9662 +- llama CI fix by @rohitrango :: PR: #9663 +- [NeMo-UX] Make 'load_directly_on_device' configurable by @ashors1 :: PR: #9657 +- [Nemo-UX] Including all trainable-params in a PEFT-checkpoint by @marcromeyn :: PR: #9650 +- [NeMo-UX] Fix imports so local configuration of runs works again by @marcromeyn :: PR: #9690 +- Set TE flag in legacy -> mcore conversion script by @terrykong :: PR: #9722 +- Update starthere docs text by @erastorgueva-nv :: PR: #9724 +- TorchAudio installation workaround for incorrect `PYTORCH_VERSION` variable by @artbataev :: PR: #9736 +- [NeMo-UX] Match nemo 1's default behavior for drop_last and pad_samples_to_global_batch_size by @ashors1 :: PR: #9707 +- add a bit more for timeout (#9702) by @pablo-garay :: PR: #9754 +- Fix missing parallelisms by @maanug-nv :: PR: #9725 +- update branch by @nithinraok :: PR: #9764 +- Fix data preprocessing script by @cuichenx :: PR: #9759 +- vLLM 0.5.1 update by @apanteleev :: PR: #9779 +- upper bound hf-hub by @akoumpa :: PR: #9805 +- Fix few issues and docs for neva and clip in r2.0.0rc1 by @yaoyu-33 :: PR: #9681 +- add dummy vision and text transformer config (assumed mcore to be false) by @rohitrango :: PR: #9699 +- fix lita bugs by @Slyne :: PR: #9810 +- [NeMo-UX] Log `val_loss` by @ashors1 :: PR: #9814 +- [NeMo-UX] Fix some dataloading bugs by @ashors1 :: PR: #9807 +- [NeMo-UX] Adding recipes by @marcromeyn :: PR: #9720 +- [NeMo-UX] Set async_save from strategy rather than ModelCheckpoint by @ashors1 :: PR: #9800 +- Fix hf hub for 0.24+ by @titu1994 :: PR: #9806 +- [NeMo-UX] Fix a minor bug with async checkpointing by @ashors1 :: PR: #9856 +- [NeMo-UX] make progress bar easier to parse by @ashors1 :: PR: #9877 +- Docs: add "Nemo Fundamentals" page by @erastorgueva-nv :: PR: #9835 - Create __init__.py by @stevehuang52 :: PR: #9892 - [NeMo-UX] Fixes to make PreemptionCallback work by @hemildesai :: PR: #9830 - Fix Docker build. Make Dockerfile consistent with CI by @artbataev :: PR: #9784 @@ -98,6 +393,10 @@ - [NeMo-UX] Update default PTL logging `save_dir` by @ashors1 :: PR: #9954 - Fix lita tutorial by @Slyne :: PR: #9980 - Add deploy and REST API support to NeMo 2.0 by @athitten :: PR: #9834 +- ci: Allow changelog manual (#10156) by @ko3n1g :: PR: #10157 +- docs: Add changelog by @ko3n1g :: PR: #10155 +- add manifest file by @ko3n1g :: PR: #10161 +
## NVIDIA Neural Modules 2.0.0rc0