no code implementations • 18 Oct 2023 • Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao
For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized, which allows us to accelerate generative LLM inference.
1 code implementation • 20 Sep 2019 • Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Sophia Shao, Krste Asanovic, Ion Stoica
However, these models are unable to capture the data dependency, the computation graph, or the organization of instructions.
Distributed, Parallel, and Cluster Computing Performance Programming Languages