no code implementations • 30 Nov 2023 • Ankit Bhardwaj, Amar Phanishayee, Deepak Narayanan, Mihail Tarta, Ryan Stutsman
We present Packrat, a new serving system for online inference that given a model and batch size ($B$) algorithmically picks the optimal number of instances ($i$), the number of threads each should be allocated ($t$), and the batch sizes each should operate on ($b$) that minimizes latency.
1 code implementation • 25 Feb 2021 • Yu Jian Wu, Hongyi Wang, Yuhong Zhong, Asaf Cidon, Ryan Stutsman, Amy Tai, Junfeng Yang
The overhead of the kernel storage path accounts for half of the access latency for new NVMe storage devices.
Operating Systems Databases