AIML Special Presentation: Google Research Visit
Generative LLMs are transforming multiple industries and have proven to be robust for multitude of use cases across industries and settings. One of the key impediments to their widespread deployment is the cost of serving and its deployability across multiple devices/settings. In this talk, Grace and Prateek discussed the key challenges in improving efficiency of LLM serving and provided an overview of some of the key techniques to address the problem. They also discussed tandem transformers and HIRE, novel methods to speed up decoding in LLMs.