The transformer architecture from Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding.
We present and evaluate two approaches for knowledge transfer between LLMs.
1 code implementation • 27 Nov 2023 • Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut
Large language models (LLMs) can potentially democratize access to medical knowledge.
Ranked #1 on Multiple Choice Question Answering (MCQA) on MedMCQA (Dev Set (Acc-%) metric)
While Transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts.
Learned metrics such as BLEURT have in recent years become widely employed to evaluate the quality of machine translation systems.
However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance. We demonstrate the same effects also in the noise-less case, i. e. for full-batch GD.
It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness.
We reproduce the results of the paper ”On Warm-Starting Neural Network Training.” In many real-world applications, the training data is not readily available and is accumulated over time.
Designing the architecture of deep neural networks (DNNs) requires human expertise and is a cumbersome task.