no code implementations • 30 Nov 2023 • Ankit Bhardwaj, Amar Phanishayee, Deepak Narayanan, Mihail Tarta, Ryan Stutsman
We present Packrat, a new serving system for online inference that given a model and batch size ($B$) algorithmically picks the optimal number of instances ($i$), the number of threads each should be allocated ($t$), and the batch sizes each should operate on ($b$) that minimizes latency.