The parallel computing loss function in InternEvo is adapted from Apex. Users can replace the loss function with Flash-Attention to obtain speedup, which may lead to loss divergence.
For detailed modifications in InternEvo,please refer to the code InternEvo-parallel-loss