1 code implementation • 23 Oct 2024 • Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-Cristià, Lucía Tormo-Bañuelos, Seungone Kim
To address this, we introduce MM-Eval, a multilingual meta-evaluation benchmark that covers 18 languages across six categories.
no code implementations • 21 Oct 2024 • Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig
Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented.
no code implementations • 3 Oct 2024 • Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone Kim, Sina Pakazad, Graham Neubig
General-purpose LLM judges capable of human-level evaluation provide not only a scalable and accurate way of evaluating instruction-following LLMs but also new avenues for supervising and improving their performance.
1 code implementation • 24 Jul 2024 • Seungyoon Kim, Seungone Kim
Large language model (LLM)-based evaluation pipelines have demonstrated their capability to robustly evaluate machine-generated text.
no code implementations • 20 Jul 2024 • Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, JianGuo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, HanLin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Sandy Pentland
To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora.
1 code implementation • 9 Jun 2024 • Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
1 code implementation • 28 May 2024 • Seongyun Lee, Sue Hyun Park, Seungone Kim, Minjoon Seo
Using this dataset, we train a 7B LLM called Janus and test it on 921 prompts from 5 benchmarks (AlpacaEval 2. 0, FLASK, Koala, MT-Bench, and Self-Instruct) by adding various unseen system messages that reflect user preferences.
1 code implementation • 2 May 2024 • Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs.
1 code implementation • 16 Apr 2024 • Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, Minjoon Seo
Training on large amounts of rationales (i. e., CoT Fine-tuning) is effective at improving the reasoning capabilities of large language models (LLMs).
no code implementations • 3 Apr 2024 • Hyungjoo Chae, Yeonghyeon Kim, Seungone Kim, Kai Tzu-iunn Ong, Beong-woo Kwak, Moohyeon Kim, SeongHwan Kim, Taeyoon Kwon, Jiwan Chung, Youngjae Yu, Jinyoung Yeo
Also, we show that compared to natural language, pseudocode can better guide the reasoning of LMs, even though they are trained to follow natural language instructions.
no code implementations • 18 Feb 2024 • Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, Stella Biderman
We propose KMMLU, a new Korean benchmark with 35, 030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM.
1 code implementation • 18 Feb 2024 • Guijin Son, Sangwon Baek, Sangdae Nam, Ilgyun Jeong, Seungone Kim
Large language models (LLMs) are typically prompted to follow a single instruction per inference call.
1 code implementation • 19 Jan 2024 • Dongkeun Yoon, Joel Jang, Sungdong Kim, Seungone Kim, Sheikh Shafayat, Minjoon Seo
We introduce LangBridge, a zero-shot approach to adapt language models for multilingual reasoning tasks without multilingual supervision.
1 code implementation • 12 Jan 2024 • Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, Minjoon Seo
Assessing long-form responses generated by Vision-Language Models (VLMs) is challenging.
1 code implementation • 17 Oct 2023 • Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu
In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem.
3 code implementations • 12 Oct 2023 • Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, Minjoon Seo
We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4.
1 code implementation • 20 Jul 2023 • Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo
Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction.
2 code implementations • 23 May 2023 • Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, Minjoon Seo
Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2. 24% (Flan-T5 3B) and +2. 37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13. 98% margin.
Ranked #1 on on BIG-bench (SNARKS)
Common Sense Reasoning Common Sense Reasoning (Zero-Shot) +7
1 code implementation • 7 Mar 2023 • Seungone Kim, Se June Joo, Yul Jang, Hyungjoo Chae, Jinyoung Yeo
To improve the correctness of the explanations, fine-tuning language models with explanation data is needed.
2 code implementations • 7 Feb 2023 • Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung Kim, Lajanugen Logeswaran, Moontae Lee, Kyungjae Lee, Minjoon Seo
Recently, Language Models (LMs) instruction-tuned on multiple tasks, also known as multitask-prompted fine-tuning (MT), have shown the capability to generalize to unseen tasks.
Ranked #9 on Question Answering on StoryCloze
1 code implementation • COLING 2022 • Seungone Kim, Se June Joo, Hyungjoo Chae, Chaehyeong Kim, Seung-won Hwang, Jinyoung Yeo
In this paper, we propose to leverage the unique characteristics of dialogues sharing commonsense knowledge across participants, to resolve the difficulties in summarizing them.
Ranked #4 on Text Summarization on DialogSum
1 code implementation • 7 Jul 2022 • Seungone Kim
Abductive Reasoning is a task of inferring the most plausible hypothesis given a set of observations.