We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我仔细的看了两个文件,我发现train的过程使用了8个H800,而prefill的过程使用了1个H800,这个是为了体现什么呢?为什么没有机器间通信呢?我们的计算和通信的overlap主要是在机器内做的吗?走的nvlink吗?如果能得到回复,我将十分感激~~~~
The text was updated successfully, but these errors were encountered:
train 用的是 ep64, prefill 用了 ep32, 都是多机跑的
Sorry, something went wrong.
感谢您的回复。我能问问train和篇refill的规模吗?如果都是2K,ep64,ep32,我感觉使用IB也会遇到大量的Incast吧,这部分IB可以轻松解决吗?特别感谢您~
No branches or pull requests
我仔细的看了两个文件,我发现train的过程使用了8个H800,而prefill的过程使用了1个H800,这个是为了体现什么呢?为什么没有机器间通信呢?我们的计算和通信的overlap主要是在机器内做的吗?走的nvlink吗?如果能得到回复,我将十分感激~~~~
The text was updated successfully, but these errors were encountered: