Search Results for author: Rodrigo Nogueira

Found 68 papers, 42 papers with code

ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

no code implementations16 Jun 2024 Marcos Piau, Roberto Lotufo, Rodrigo Nogueira

However, the impact of different pretraining settings on downstream tasks remains underexplored.


Measuring Cross-lingual Transfer in Bytes

1 code implementation12 Apr 2024 Leandro Rodrigues de Souza, Thales Sales Almeida, Roberto Lotufo, Rodrigo Nogueira

We also found evidence that this transfer is not related to language contamination or language proximity, which strengthens the hypothesis that the model also relies on language-agnostic knowledge.

Cross-Lingual Transfer

Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

no code implementations10 Apr 2024 Mirelle Bueno, Eduardo Seiti de Oliveira, Rodrigo Nogueira, Roberto A. Lotufo, Jayr Alencar Pereira

Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language.

Information Retrieval Retrieval

Juru: Legal Brazilian Large Language Model from Reputable Sources

no code implementations26 Mar 2024 Roseval Malaquias Junior, Ramon Pires, Roseli Romero, Rodrigo Nogueira

This study contributes to the growing body of scientific evidence showing that pretraining data selection may enhance the performance of large language models, enabling the exploration of these models at a lower cost.

General Knowledge Language Modelling +1

Sabiá-2: A New Generation of Portuguese Large Language Models

no code implementations14 Mar 2024 Thales Sales Almeida, Hugo Abonizio, Rodrigo Nogueira, Ramon Pires

We introduce Sabi\'a-2, a family of large language models trained on Portuguese texts.


Lissard: Long and Simple Sequential Reasoning Datasets

no code implementations12 Feb 2024 Mirelle Bueno, Roberto Lotufo, Rodrigo Nogueira

Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens.

ExaRanker-Open: Synthetic Explanation for IR using Open-Source LLMs

1 code implementation9 Feb 2024 Fernando Ferraretto, Thiago Laitz, Roberto Lotufo, Rodrigo Nogueira

ExaRanker recently introduced an approach to training information retrieval (IR) models, incorporating natural language explanations as additional labels.

Data Augmentation Information Retrieval +1

InRanker: Distilled Rankers for Zero-shot Information Retrieval

no code implementations12 Jan 2024 Thiago Laitz, Konstantinos Papakostas, Roberto Lotufo, Rodrigo Nogueira

Despite multi-billion parameter neural rankers being common components of state-of-the-art information retrieval pipelines, they are rarely used in production due to the enormous amount of compute required for inference.

Information Retrieval Language Modelling +2

INACIA: Integrating Large Language Models in Brazilian Audit Courts: Opportunities and Challenges

no code implementations10 Jan 2024 Jayr Pereira, Andre Assumpcao, Julio Trecenti, Luiz Airosa, Caio Lente, Jhonatan Cléto, Guilherme Dobins, Rodrigo Nogueira, Luis Mitchell, Roberto Lotufo

This paper introduces INACIA (Instru\c{c}\~ao Assistida com Intelig\^encia Artificial), a groundbreaking system designed to integrate Large Language Models (LLMs) into the operational framework of Brazilian Federal Court of Accounts (TCU).

Decision Making Fairness

Evaluating GPT-4's Vision Capabilities on Brazilian University Admission Exams

1 code implementation23 Nov 2023 Ramon Pires, Thales Sales Almeida, Hugo Abonizio, Rodrigo Nogueira

Recent advancements in language models have showcased human-comparable performance in academic entrance exams.

An experiment on an automated literature survey of data-driven speech enhancement methods

no code implementations10 Oct 2023 Arthur dos Santos, Jayr Pereira, Rodrigo Nogueira, Bruno Masiero, Shiva Sander-Tavallaey, Elias Zea

The increasing number of scientific publications in acoustics, in general, presents difficulties in conducting traditional literature surveys.

Speech Enhancement

Predictive Authoring for Brazilian Portuguese Augmentative and Alternative Communication

1 code implementation18 Aug 2023 Jayr Pereira, Rodrigo Nogueira, Cleber Zanchettin, Robson Fidalgo

We tested different approaches to representing a pictogram for prediction: as a word (using pictogram captions), as a concept (using a dictionary definition), and as a set of synonyms (using related terms).


BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams

1 code implementation11 Jul 2023 Thales Sales Almeida, Thiago Laitz, Giovana K. Bonás, Rodrigo Nogueira

One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation.

Natural Language Understanding

InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

1 code implementation10 Jul 2023 Hugo Abonizio, Luiz Bonifacio, Vitor Jeronymo, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

Our toolkit not only reproduces the InPars method and partially reproduces Promptagator, but also provides a plug-and-play functionality allowing the use of different LLMs, exploring filtering methods and finetuning various reranker models on the generated data.

Information Retrieval Retrieval +1

A Personalized Dense Retrieval Framework for Unified Information Access

1 code implementation26 Apr 2023 Hansi Zeng, Surya Kallumadi, Zaid Alibadi, Rodrigo Nogueira, Hamed Zamani

Developing a universal model that can efficiently and effectively respond to a wide range of information access requests -- from retrieval to recommendation to question answering -- has been a long-lasting goal in the information retrieval community.

Information Retrieval Question Answering +1

Sabiá: Portuguese Large Language Models

no code implementations16 Apr 2023 Ramon Pires, Hugo Abonizio, Thales Sales Almeida, Rodrigo Nogueira

By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture.

Cultural Vocal Bursts Intensity Prediction

Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

no code implementations3 Apr 2023 Jimmy Lin, David Alfonso-Hermelo, Vitor Jeronymo, Ehsan Kamalloo, Carlos Lassance, Rodrigo Nogueira, Odunayo Ogundepo, Mehdi Rezagholizadeh, Nandan Thakur, Jheng-Hong Yang, Xinyu Zhang

The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another.

Cross-Lingual Information Retrieval Retrieval

Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams

1 code implementation29 Mar 2023 Desnes Nunes, Ricardo Primi, Ramon Pires, Roberto Lotufo, Rodrigo Nogueira

The present study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests, represented here by the Exame Nacional do Ensino M\'edio (ENEM), a multidisciplinary entrance examination widely adopted by Brazilian universities.


NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval

1 code implementation28 Mar 2023 Vitor Jeronymo, Roberto Lotufo, Rodrigo Nogueira

This paper reports on a study of cross-lingual information retrieval (CLIR) using the mT5-XXL reranker on the NeuCLIR track of TREC 2022.

Cross-Lingual Information Retrieval Retrieval

ExaRanker: Explanation-Augmented Neural Ranker

1 code implementation25 Jan 2023 Fernando Ferraretto, Thiago Laitz, Roberto Lotufo, Rodrigo Nogueira

Recent work has shown that inducing a large language model (LLM) to generate explanations prior to outputting an answer is an effective strategy to improve performance on a wide range of reasoning tasks.

Language Modelling Large Language Model +1

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

1 code implementation4 Jan 2023 Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents.

Information Retrieval Retrieval

Visconde: Multi-document QA with GPT-3 and Neural Reranking

1 code implementation19 Dec 2022 Jayr Pereira, Robson Fidalgo, Roberto Lotufo, Rodrigo Nogueira

This paper proposes a question-answering system that can answer questions whose supporting evidence is spread over multiple (potentially long) documents.

Language Modelling Large Language Model +2

In Defense of Cross-Encoders for Zero-Shot Retrieval

1 code implementation12 Dec 2022 Guilherme Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

We find that the number of parameters and early query-document interactions of cross-encoders play a significant role in the generalization ability of retrieval models.


NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost

no code implementations26 Oct 2022 Thales Sales Almeida, Thiago Laitz, João Seródio, Luiz Henrique Bonifacio, Roberto Lotufo, Rodrigo Nogueira

We compare our system with Microsoft's Biomedical Search and show that our design choices led to a much cost-effective system with competitive QPS while having close to state-of-the-art results on a wide range of public benchmarks.


mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

no code implementations27 Sep 2022 Vitor Jeronymo, Mauricio Nascimento, Roberto Lotufo, Rodrigo Nogueira

Robust 2004 is an information retrieval benchmark whose large number of judgments per query make it a reliable evaluation dataset.

Information Retrieval Retrieval

MonoByte: A Pool of Monolingual Byte-level Language Models

1 code implementation COLING 2022 Hugo Abonizio, Leandro Rodrigues de Souza, Roberto Lotufo, Rodrigo Nogueira

The zero-shot cross-lingual ability of models pretrained on multilingual and even monolingual corpora has spurred many hypotheses to explain this intriguing empirical result.

Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models

1 code implementation24 Aug 2022 Mirelle Bueno, Carlos Gemmell, Jeffrey Dalton, Roberto Lotufo, Rodrigo Nogueira

Our experimental results show that generating step-by-step rationales and introducing marker tokens are both required for effective extrapolation.

Language Modelling

Billions of Parameters Are Worth More Than In-domain Training Data: A case study in the Legal Case Entailment Task

1 code implementation30 May 2022 Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Roberto Lotufo, Rodrigo Nogueira

Recent work has shown that language models scaled to billions of parameters, such as GPT-3, perform remarkably well in zero-shot and few-shot scenarios.

Language Modelling

InPars: Data Augmentation for Information Retrieval using Large Language Models

1 code implementation10 Feb 2022 Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Rodrigo Nogueira

In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks.

Data Augmentation Diversity +3

To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment

1 code implementation7 Feb 2022 Guilherme Moraes Rosa, Ruan Chaves Rodrigues, Roberto de Alencar Lotufo, Rodrigo Nogueira

For that, we participated in the legal case entailment task of COLIEE 2021, in which we use such models with no adaptations to the target domain.

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

1 code implementation14 Jan 2022 Ramon Pires, Fábio C. de Souza, Guilherme Rosa, Roberto A. Lotufo, Rodrigo Nogueira

A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts.

Open Information Extraction Question Answering

mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset

1 code implementation31 Aug 2021 Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

In this work, we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 13 languages that was created using machine translation.

Information Retrieval Machine Translation +4

A cost-benefit analysis of cross-lingual transfer methods

2 code implementations14 May 2021 Guilherme Moraes Rosa, Luiz Henrique Bonifacio, Leandro Rodrigues de Souza, Roberto Lotufo, Rodrigo Nogueira

An effective method for cross-lingual transfer is to fine-tune a bilingual or multilingual model on a supervised dataset in one language and evaluating it on another language in a zero-shot manner.

Cross-Lingual Transfer Translation

Investigating the Limitations of Transformers with Simple Arithmetic Tasks

1 code implementation25 Feb 2021 Rodrigo Nogueira, Zhiying Jiang, Jimmy Lin

In this work, we investigate if the surface form of a number has any influence on how sequence-to-sequence language models learn simple arithmetic tasks such as addition and subtraction across a wide range of values.

The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models

3 code implementations14 Jan 2021 Ronak Pradeep, Rodrigo Nogueira, Jimmy Lin

We propose a design pattern for tackling text ranking problems, dubbed "Expando-Mono-Duo", that has been empirically validated for a number of ad hoc retrieval tasks in different domains.

Document Ranking Retrieval

Designing Templates for Eliciting Commonsense Knowledge from Pretrained Sequence-to-Sequence Models

no code implementations COLING 2020 Jheng-Hong Yang, Sheng-Chieh Lin, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

While internalized {``}implicit knowledge{''} in pretrained transformers has led to fruitful progress in many natural language understanding tasks, how to most effectively elicit such knowledge remains an open question.

Multiple-choice Natural Language Understanding +1

Scientific Claim Verification with VERT5ERINI

no code implementations EACL (Louhi) 2021 Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, Jimmy Lin

This work describes the adaptation of a pretrained sequence-to-sequence model to the task of scientific claim verification in the biomedical domain.

Claim Verification Retrieval +1

Pretrained Transformers for Text Ranking: BERT and Beyond

1 code implementation NAACL 2021 Jimmy Lin, Rodrigo Nogueira, Andrew Yates

There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i. e., result quality) and efficiency (e. g., query latency, model and index size).

Information Retrieval Retrieval +1

PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

3 code implementations20 Aug 2020 Diedre Carmo, Marcos Piau, Israel Campiotti, Rodrigo Nogueira, Roberto Lotufo

In natural language processing (NLP), there is a need for more resources in Portuguese, since much of the data used in the state-of-the-art research is in other languages.

Lite Training Strategies for Portuguese-English and English-Portuguese Translation

1 code implementation WMT (EMNLP) 2020 Alexandre Lopes, Rodrigo Nogueira, Roberto Lotufo, Helio Pedrini

Despite the widespread adoption of deep learning for machine translation, it is still expensive to develop high-quality translation models.

Machine Translation Translation

Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset

1 code implementation EMNLP (sdp) 2020 Edwin Zhang, Nikhil Gupta, Raphael Tang, Xiao Han, Ronak Pradeep, Kuang Lu, Yue Zhang, Rodrigo Nogueira, Kyunghyun Cho, Hui Fang, Jimmy Lin

We present Covidex, a search engine that exploits the latest neural ranking models to provide information access to the COVID-19 Open Research Dataset curated by the Allen Institute for AI.

Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset

no code implementations ACL 2020 Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira, Kyunghyun Cho, Jimmy Lin

The Neural Covidex is a search engine that exploits the latest neural ranking architectures to provide information access to the COVID-19 Open Research Dataset (CORD-19) curated by the Allen Institute for AI.

Decision Making

Rapidly Bootstrapping a Question Answering Dataset for COVID-19

1 code implementation23 Apr 2020 Raphael Tang, Rodrigo Nogueira, Edwin Zhang, Nikhil Gupta, Phuong Cam, Kyunghyun Cho, Jimmy Lin

We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge.

Question Answering

Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned

1 code implementation10 Apr 2020 Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira, Kyunghyun Cho, Jimmy Lin

We present the Neural Covidex, a search engine that exploits the latest neural ranking architectures to provide information access to the COVID-19 Open Research Dataset curated by the Allen Institute for AI.

Decision Making

Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

no code implementations4 Apr 2020 Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

This paper presents an empirical study of conversational question reformulation (CQR) with sequence-to-sequence architectures and pretrained language models (PLMs).

Task-Oriented Dialogue Systems

TTTTTackling WinoGrande Schemas

no code implementations18 Mar 2020 Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrande Challenge by decomposing each example into two input text strings, each containing a hypothesis, and using the probabilities assigned to the "entailment" token as a score of the hypothesis.

Coreference Resolution

Electricity Theft Detection with self-attention

1 code implementation14 Feb 2020 Paulo Finardi, Israel Campiotti, Gustavo Plensack, Rafael Derradi de Souza, Rodrigo Nogueira, Gustavo Pinheiro, Roberto Lotufo

In this work we propose a novel self-attention mechanism model to address electricity theft detection on an imbalanced realistic dataset that presents a daily electricity consumption provided by State Grid Corporation of China.


Navigation-Based Candidate Expansion and Pretrained Language Models for Citation Recommendation

no code implementations23 Jan 2020 Rodrigo Nogueira, Zhiying Jiang, Kyunghyun Cho, Jimmy Lin

Citation recommendation systems for the scientific literature, to help authors find papers that should be cited, have the potential to speed up discoveries and uncover new routes for scientific exploration.

Citation Recommendation Domain Adaptation +3

Meta Answering for Machine Reading

no code implementations11 Nov 2019 Benjamin Borschinger, Jordan Boyd-Graber, Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Michelle Chen Huebscher, Wojciech Gajewski, Yannic Kilcher, Rodrigo Nogueira, Lierni Sestorain Saralegu

We investigate a framework for machine reading, inspired by real world information-seeking problems, where a meta question answering system interacts with a black box environment.

Natural Questions Question Answering +1

Multi-Stage Document Ranking with BERT

3 code implementations31 Oct 2019 Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, Jimmy Lin

The advent of deep neural networks pre-trained via language modeling tasks has spurred a number of successful applications in natural language processing.

Document Ranking Language Modelling

Portuguese Named Entity Recognition using BERT-CRF

1 code implementation23 Sep 2019 Fábio Souza, Rodrigo Nogueira, Roberto Lotufo

Recent advances in language representation using neural networks have made it viable to transfer the learned internal states of a trained model to downstream natural language processing tasks, such as named entity recognition (NER) and question answering.

named-entity-recognition Named Entity Recognition +2

Learning Representations and Agents for Information Retrieval

no code implementations16 Aug 2019 Rodrigo Nogueira

We argue, however, that although this approach has been very successful for tasks such as machine translation, storing the world's knowledge as parameters of a learning machine can be very hard.

Information Retrieval Machine Translation +2

Document Expansion by Query Prediction

5 code implementations17 Apr 2019 Rodrigo Nogueira, Wei Yang, Jimmy Lin, Kyunghyun Cho

One technique to improve the retrieval effectiveness of a search engine is to expand documents with terms that are related or representative of the documents' content. From the perspective of a question answering system, this might comprise questions the document can potentially answer.

Passage Re-Ranking Question Answering +2

Multi-agent query reformulation: Challenges and the role of diversity

no code implementations ICLR Workshop drlStructPred 2019 Rodrigo Nogueira, Jannis Bulian, Massimiliano Ciaramita

We investigate methods to efficiently learn diverse strategies in reinforcement learning for a generative structured prediction problem: query reformulation.

Diversity Question Answering +4

Passage Re-ranking with BERT

6 code implementations13 Jan 2019 Rodrigo Nogueira, Kyunghyun Cho

Recently, neural models pretrained on a language modeling task, such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2018), have achieved impressive results on various natural language processing tasks such as question-answering and natural language inference.

Ranked #3 on Passage Re-Ranking on MS MARCO (using extra training data)

Passage Re-Ranking Passage Retrieval +2

Learning to Coordinate Multiple Reinforcement Learning Agents for Diverse Query Reformulation

no code implementations ICLR 2019 Rodrigo Nogueira, Jannis Bulian, Massimiliano Ciaramita

We propose a method to efficiently learn diverse strategies in reinforcement learning for query reformulation in the tasks of document retrieval and question answering.

Diversity Question Answering +3

Task-Oriented Query Reformulation with Reinforcement Learning

2 code implementations EMNLP 2017 Rodrigo Nogueira, Kyunghyun Cho

In this work, we introduce a query reformulation system based on a neural network that rewrites a query to maximize the number of relevant documents returned.

reinforcement-learning Reinforcement Learning (RL)

End-to-End Goal-Driven Web Navigation

1 code implementation NeurIPS 2016 Rodrigo Nogueira, Kyunghyun Cho

We propose a goal-driven web navigation as a benchmark task for evaluating an agent with abilities to understand natural language and plan on partially observed environments.

Decision Making Question Answering

Cannot find the paper you are looking for? You can Submit a new open access paper.