Benchmarking Triton (TensorRT) Inference Server for Transformer Models
Nitish Shirish Keskar · #engineeringSummaryWe investigate NVIDIA's Triton (TensorRT) Inference Server as a way of hosting Transformer Language Models. The blog is roughly divided into two parts: (i) instructions for setting up your own inference server, and (ii) benchmarking experiments. The instructions are intended to be detailed and standalone, but readers interested solely in