Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
More Inference Endpoints features and fixes (#68)
* feat(generator): better handle exceptions on multiprocessing This will raise an error, signaling there was a problem. Before the root thread was getting stuck waiting for the agent that was dead. This way it should exit. * feat(tgi): add more debug on server * chore(docker): entrypoint json output is set by default It is possible to disable it by setting JSON_OUTPUT_DISABLE. It is now possible also to play with more batch sizes. * feat(generator): add bucketing functions to use in prefill * feat(generator): store position_id in current slot This will further simplify the implementation of prefill bucketing. * fix(generator): correct input_ids and attention_mask padding * fix(TGI): fix input truncation Truncation was sub-optimal, and it was done on the wrong side. * feat(generator): enable logs on children processes * feat(tgi): warmup runs prefill/decode on all supported combinations This will prevent XLA compilation at inference time. Note that I had to disable dynamo compilation though, otherwise the model was not generating correct results. This leads to slower generation, but at least generation seems stable now. * ci(tgi): create images when pushing on current branch This will allow for testing IE before release. * feat(tgi): reversed loop order in warmup to test memory limits earlier * chore(ci): remove image generation for this branch
- Loading branch information