NVIDIA / Megatron-LM Public

Notifications You must be signed in to change notification settings
Fork 2.5k
Star 11.3k

Code
Issues 181
Pull requests 157
Discussions
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

Issues: NVIDIA/Megatron-LM

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

181 Open 655 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[QUESTION] checkpointing/loading memory overhead

#1380 opened Feb 6, 2025 by JinjieNi

[BUG] The logic for calculating the last stage when average loss across microbatches.

#1379 opened Feb 6, 2025 by LitLeo

[ENHANCEMENT] add options how to choose topk devices for device_limited_topk

#1378 opened Feb 6, 2025 by bzantium

[QUESTION] any one used —exit-signal-hander?

#1376 opened Feb 5, 2025 by HenryTangIntel

[QUESTION] Support for Heterogeneous Parallelism in Multimodal Training

#1375 opened Feb 4, 2025 by swiftomkar

[REGRESSION]

#1372 opened Feb 2, 2025 by lawchingman

RuntimeError: The server socket has failed to listen on any local network address. port: 12341, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use Traceback (most recent call last):

#1371 opened Jan 31, 2025 by PriyaEnuganti

[ENHANCEMENT] Support pre-built wheels for Python 3.12

#1370 opened Jan 30, 2025 by kevalmorabia97

[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it

#1369 opened Jan 30, 2025 by TeddLi

[BUG] FSDP2 activation recomputation does not save memory

#1368 opened Jan 28, 2025 by janEbert

[BUG] BERT and GPT345 Model Checkpoints Returning 410 Gone HTTP Response

#1367 opened Jan 28, 2025 by GangGreenTemperTatum

[QUESTION]convert LLaMA2-7B to the Megatron format failed: the converted model only repeats meaningless numbers

#1365 opened Jan 22, 2025 by carrot0117

[QUESTION] How can I train a model from hugging face

#1364 opened Jan 22, 2025 by JavaZeroo

[QUESTION] The dataset cannot be found in multi-node multi-GPU training.

#1355 opened Jan 13, 2025 by stay88

[QUESTION] Limit Number of Saved Checkpoints

#1354 opened Jan 13, 2025 by GuokunWang

[BUG]

#1353 opened Jan 11, 2025 by lawchingman

[BUG] can't load saved fp8 checkpoint when resume training

#1350 opened Jan 8, 2025 by switiz

[BUG] Using fp16 uses more memory than using fp32

#1349 opened Jan 8, 2025 by eliird

[BUG] When trying to convert llama2-7b model from HF format to megatron format

#1348 opened Jan 6, 2025 by Sun2018421

[QUESTION] Typo in MoE README

#1346 opened Jan 4, 2025 by rgtjf

[QUESTION] Resume training about dataset

#1343 opened Jan 2, 2025 by JiwenJ

[QUESTION] Expert Parallelism with Non-Identical Experts

#1342 opened Jan 1, 2025 by kevin3567

[QUESTION]"a2a+p2p" for context parallel(cp)

#1341 opened Dec 27, 2024 by heavyrain-lzy

[QUESTION]How to convert the weight file format of the MAMBA model from pt to safetensors format?

#1339 opened Dec 26, 2024 by fxnie

[QUESTION] Why mixral use Llama2Tokenizer?

#1338 opened Dec 25, 2024 by DemingCheng

Previous 1 2 3 4 5 6 7 8 Next

Previous Next

ProTip! Mix and match filters to narrow down what you’re looking for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly