1 code implementation • EACL (AdaptNLP) 2021 • Mandy Guo, Yinfei Yang, Daniel Cer, Qinlan Shen, Noah Constant
Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus (Ahmad et al., 2019). This dataset paper presents MultiReQA, a new multi-domain ReQA evaluation suite composed of eight retrieval QA tasks drawn from publicly available QA datasets.
no code implementations • 2 Apr 2024 • Frank Palma Gomez, Ramon Sanabria, Yun-Hsuan Sung, Daniel Cer, Siddharth Dalmia, Gustavo Hernandez Abrego
Our multi-modal LLM-based retrieval system is capable of matching speech and text in 102 languages despite only training on 21 languages.
no code implementations • 29 Mar 2024 • Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, Iftekhar Naim
On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size.
1 code implementation • 13 Mar 2024 • Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent SIfre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross Mcilroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu-Hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, Kathleen Kenealy
This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models.
2 code implementations • 10 Nov 2023 • Nandan Thakur, Jianmo Ni, Gustavo Hernández Ábrego, John Wieting, Jimmy Lin, Daniel Cer
There has been limited success for dense retrieval models in multilingual retrieval, due to uneven and scarce training data available across multiple languages.
no code implementations • 10 Oct 2022 • Cicero Nogueira dos santos, Zhe Dong, Daniel Cer, John Nham, Siamak Shakeri, Jianmo Ni, Yun-Hsuan Sung
The resulting soft knowledge prompts (KPs) are task independent and work as an external memory of the LMs.
1 code implementation • 25 May 2022 • Tu Vu, Aditya Barua, Brian Lester, Daniel Cer, Mohit Iyyer, Noah Constant
In this paper, we explore the challenging problem of performing a generative task in a target language when labeled data is only available in English, using summarization as a case study.
no code implementations • ACL 2022 • Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, Daniel Cer
Finally, we propose an efficient retrieval approach that interprets task prompts as task embeddings to identify similar tasks and predict the most transferable source tasks for a novel target task.
1 code implementation • EMNLP 2021 • ZiYi Yang, Yinfei Yang, Daniel Cer, Eric Darve
A simple but highly effective method "Language Information Removal (LIR)" factors out language identity information from semantic related components in multilingual representations pre-trained on multi-monolingual data.
2 code implementations • Findings (ACL) 2022 • Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, Yinfei Yang
To support our investigation, we establish a new sentence representation transfer benchmark, SentGLUE, which extends the SentEval toolkit to nine tasks from the GLUE benchmark.
1 code implementation • 15 Apr 2021 • Peng-Jian Yang, Ying Ting Chen, Yuechan Chen, Daniel Cer
Numerical reasoning over text (NRoT) presents unique challenges that are not well addressed by existing pre-training objectives.
no code implementations • EMNLP 2021 • ZiYi Yang, Yinfei Yang, Daniel Cer, Jax Law, Eric Darve
This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora.
2 code implementations • 7 Nov 2020 • Ning li, Bethany Keller, Mark Butler, Daniel Cer
We explore using T5 (Raffel et al. (2019)) to directly translate natural language questions into SQL statements.
no code implementations • ACL 2021 • Yinfei Yang, Ning Jin, Kuo Lin, Mandy Guo, Daniel Cer
Independently computing embeddings for questions and answers results in late fusion of information related to matching questions to their answers.
6 code implementations • ACL 2022 • Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang
While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019), BERT based cross-lingual sentence embeddings have yet to be explored.
1 code implementation • 5 May 2020 • Mandy Guo, Yinfei Yang, Daniel Cer, Qinlan Shen, Noah Constant
Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus (Ahmad et al., 2019). This paper presents MultiReQA, anew multi-domain ReQA evaluation suite com-posed of eight retrieval QA tasks drawn from publicly available QA datasets.
2 code implementations • EACL 2021 • Zarana Parekh, Jason Baldridge, Daniel Cer, Austin Waters, Yinfei Yang
By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning.
1 code implementation • WS 2019 • Amin Ahmad, Noah Constant, Yinfei Yang, Daniel Cer
Popular QA benchmarks like SQuAD have driven progress on the task of identifying answer spans within a specific passage, with models now surpassing human performance.
no code implementations • ACL 2020 • Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
We introduce two pre-trained retrieval focused multilingual sentence encoding models, respectively based on the Transformer and CNN model architectures.
no code implementations • WS 2019 • Mandy Guo, Yinfei Yang, Keith Stevens, Daniel Cer, Heming Ge, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
We explore using multilingual document embeddings for nearest neighbor mining of parallel data.
no code implementations • 22 Feb 2019 • Yinfei Yang, Gustavo Hernandez Abrego, Steve Yuan, Mandy Guo, Qinlan Shen, Daniel Cer, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
On the UN document-level retrieval task, document embeddings achieve around 97% on P@1 for all experimented language pairs.
no code implementations • EMNLP 2018 • Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, Ray Kurzweil
We present easy-to-use TensorFlow Hub sentence embedding models having good task transfer performance.
no code implementations • WS 2019 • Muthuraman Chidambaram, Yinfei Yang, Daniel Cer, Steve Yuan, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
A significant roadblock in multilingual neural language modeling is the lack of labeled non-English data.
no code implementations • WS 2018 • Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Stevens, Noah Constant, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
This paper presents an effective approach for parallel corpus mining using bilingual sentence embeddings.
1 code implementation • WS 2018 • Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
We present a novel approach to learn representations for sentence-level semantic similarity using conversational data.
24 code implementations • 29 Mar 2018 • Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil
For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance.
Ranked #2 on Text Classification on TREC-6
Conversational Response Selection Semantic Textual Similarity +7
no code implementations • SEMEVAL 2017 • Daniel Cer, Mona Diab, Eneko Agirre, I{\~n}igo Lopez-Gazpio, Lucia Specia
Semantic Textual Similarity (STS) measures the meaning similarity of sentences.
3 code implementations • 31 Jul 2017 • Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, Lucia Specia
Semantic Textual Similarity (STS) measures the meaning similarity of sentences.
no code implementations • SEMEVAL 2015 • Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, I{\~n}igo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, Janyce Wiebe