no code implementations • 15 Apr 2024 • Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov
For instance, in the fundamental Reinforcement Learning From Human Feedback (RLHF) technique of Language Model alignment, in addition to reward maximization, the Kullback-Leibler divergence between the trainable policy and the SFT policy is minimized.
2 code implementations • 16 Feb 2024 • Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, Daniil Gavrilov
Advancing the frontier of subquadratic architectures for Language Models (LMs) is crucial in the rapidly evolving field of natural language processing.
no code implementations • 19 Jan 2024 • Alexey Gorbatovski, Sergey Kovalchuk
In this study, we investigate the enhancement of the GPT Neo 125M performance in Community Question Answering (CQA) with a focus on programming, through the integration of Reinforcement Learning from Human Feedback (RLHF) and the utilization of scores from Stack Overflow.
no code implementations • 26 Feb 2023 • Alexey Gorbatovski, Sergey Kovalchuk
We also discuss the influence of penalty terms on the structure of Bayesian networks and how they can be used to analyze the relationships between entities.