Paper List
Return a paginated listing of all papers.
GET /api/v1/papers/?page=2&q=Large+Language+Models
https://paperswithcode.com/api/v1/papers/?page=3&q=Large+Language+Models", "previous": "https://paperswithcode.com/api/v1/papers/?q=Large+Language+Models", "results": [ { "id": "addressing-documentation-debt-in-machine", "arxiv_id": "2105.05241", "nips_id": null, "url_abs": "https://arxiv.org/abs/2105.05241v1", "url_pdf": "https://arxiv.org/pdf/2105.05241v1.pdf", "title": "Addressing \"Documentation Debt\" in Machine Learning Research: A Retrospective Datasheet for BookCorpus", "abstract": "Recent literature has underscored the importance of dataset documentation work for machine learning, and part of this work involves addressing \"documentation debt\" for datasets that have been used widely but documented sparsely. This paper aims to help address documentation debt for BookCorpus, a popular text dataset for training large language models. Notably, researchers have used BookCorpus to train OpenAI's GPT-N models and Google's BERT models, even though little to no documentation exists about the dataset's motivation, composition, collection process, etc. We offer a preliminary datasheet that provides key context and information about BookCorpus, highlighting several notable deficiencies. In particular, we find evidence that (1) BookCorpus likely violates copyright restrictions for many books, (2) BookCorpus contains thousands of duplicated books, and (3) BookCorpus exhibits significant skews in genre representation. We also find hints of other potential deficiencies that call for future research, including problematic content, potential skews in religious representation, and lopsided author contributions. While more work remains, this initial effort to provide a datasheet for BookCorpus adds to growing literature that urges more careful and systematic documentation for machine learning datasets.", "authors": [ "Nicholas Vincent", "Jack Bandy" ], "published": "2021-05-11", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "coconet-co-optimizing-computation-and", "arxiv_id": "2105.05720", "nips_id": null, "url_abs": "https://arxiv.org/abs/2105.05720v5", "url_pdf": "https://arxiv.org/pdf/2105.05720v5.pdf", "title": "Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads", "abstract": "Recent trend towards increasing large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, current logical separation between computation and communication kernels in deep learning frameworks misses the optimization opportunities across such barrier. Breaking this abstraction with a holistic consideration can provide many optimizations to provide performance improvements in distributed workloads. Manually applying these optimizations needs modifications in underlying computation and communication libraries for each scenario, which is time consuming and error-prone. Therefore, we present CoCoNeT, with a DSL to express a program with both computation and communication. CoCoNeT contains several machine learning aware transformations to optimize a program and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNeT enables us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Experiments show CoCoNeT significantly outperforms state-of-the-art distributed machine learning implementations.", "authors": [ "Olli Sarikivi", "Todd Mytkowicz", "Madanlal Musuvathi", "Youshan Miao", "Saeed Maleki", "Amir Hossein Nodehi Sabet", "Guodong Liu", "Jun Huang", "Abhinav Jangda" ], "published": "2021-05-12", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "measuring-coding-challenge-competence-with", "arxiv_id": "2105.09938", "nips_id": null, "url_abs": "https://arxiv.org/abs/2105.09938v3", "url_pdf": "https://arxiv.org/pdf/2105.09938v3.pdf", "title": "Measuring Coding Challenge Competence With APPS", "abstract": "While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for tracking advancements.", "authors": [ "Jacob Steinhardt", "Dawn Song", "Horace He", "Samir Puranik", "Collin Burns", "Ethan Guo", "Akul Arora", "Mantas Mazeika", "Saurav Kadavath", "Steven Basart", "Dan Hendrycks" ], "published": "2021-05-20", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "parameter-efficient-neural-question-answering", "arxiv_id": "2106.00851", "nips_id": null, "url_abs": "https://arxiv.org/abs/2106.00851v1", "url_pdf": "https://arxiv.org/pdf/2106.00851v1.pdf", "title": "Parameter-Efficient Neural Question Answering Models via Graph-Enriched Document Representations", "abstract": "As the computational footprint of modern NLP systems grows, it becomes increasingly important to arrive at more efficient models. We show that by employing graph convolutional document representation, we can arrive at a question answering system that performs comparably to, and in some cases exceeds the SOTA solutions, while using less than 5\\% of their resources in terms of trainable parameters. As it currently stands, a major issue in applying GCNs to NLP is document representation. In this paper, we show that a GCN enriched document representation greatly improves the results seen in HotPotQA, even when using a trivial topology. Our model (gQA), performs admirably when compared to the current SOTA, and requires little to no preprocessing. In Shao et al. 2020, the authors suggest that graph networks are not necessary for good performance in multi-hop QA. In this paper, we suggest that large language models are not necessary for good performance by showing a na\\\"{i}ve implementation of a GCN performs comparably to SoTA models based on pretrained language models.", "authors": [ "Won Young Shin", "Stephen Fitz", "Louis Castricato" ], "published": "2021-06-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "layered-gradient-accumulation-and-modular", "arxiv_id": "2106.02679", "nips_id": null, "url_abs": "https://arxiv.org/abs/2106.02679v1", "url_pdf": "https://arxiv.org/pdf/2106.02679v1.pdf", "title": "Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models", "abstract": "The advent of the transformer has sparked a quick growth in the size of language models, far outpacing hardware improvements. (Dense) transformers are expected to reach the trillion-parameter scale in the near future, for which training requires thousands or even tens of thousands of GPUs. We investigate the challenges of training at this scale and beyond on commercially available hardware. In particular, we analyse the shortest possible training time for different configurations of distributed training, leveraging empirical scaling laws for language models to estimate the optimal (critical) batch size. Contrary to popular belief, we find no evidence for a memory wall, and instead argue that the real limitation -- other than the cost -- lies in the training duration. In addition to this analysis, we introduce two new methods, \\textit{layered gradient accumulation} and \\textit{modular pipeline parallelism}, which together cut the shortest training time by half. The methods also reduce data movement, lowering the network requirement to a point where a fast InfiniBand connection is not necessary. This increased network efficiency also improve on the methods introduced with the ZeRO optimizer, reducing the memory usage to a tiny fraction of the available GPU memory.", "authors": [ "Joel Lamy-Poirier" ], "published": "2021-06-04", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "when-does-text-prediction-benefit-from", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.naacl-industry.1", "url_pdf": "https://aclanthology.org/2021.naacl-industry.1.pdf", "title": "When does text prediction benefit from additional context? An exploration of contextual signals for chat and email messages", "abstract": "Email and chat communication tools are increasingly important for completing daily tasks. Accurate real-time phrase completion can save time and bolster productivity. Modern text prediction algorithms are based on large language models which typically rely on the prior words in a message to predict a completion. We examine how additional contextual signals (from previous messages, time, and subject) affect the performance of a commercial text prediction model. We compare contextual text prediction in chat and email messages from two of the largest commercial platforms Microsoft Teams and Outlook, finding that contextual signals contribute to performance differently between these scenarios. On emails, time context is most beneficial with small relative gains of 2{\\%} over baseline. Whereas, in chat scenarios, using a tailored set of previous messages as context yields relative improvements over the baseline between 9.3{\\%} and 18.6{\\%} across various critical service-oriented text prediction metrics.", "authors": [ "Chris Quirk", "Milad Shokouhi", "Vipul Agarwal", "Kunho Kim", "Chad Atalla", "Stojan Trajanovski" ], "published": "2021-06-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-2021-4" }, { "id": "story-centaur-large-language-model-few-shot", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.eacl-demos.29", "url_pdf": "https://aclanthology.org/2021.eacl-demos.29.pdf", "title": "Story Centaur: Large Language Model Few Shot Learning as a Creative Writing Tool", "abstract": "Few shot learning with large language models has the potential to give individuals without formal machine learning training the access to a wide range of text to text models. We consider how this applies to creative writers and present Story Centaur, a user interface for prototyping few shot models and a set of recombinable web components that deploy them. Story Centaur{'}s goal is to expose creative writers to few shot learning with a simple but powerful interface that lets them compose their own co-creation tools that further their own unique artistic directions. We build out several examples of such tools, and in the process probe the boundaries and issues surrounding generation with large language models.", "authors": [ "Monica Dinalescu", "Sherol Chen", "Ben Pietrzak", "Kory Mathewson", "Ben Swanson" ], "published": "2021-04-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "eacl-2021-2" }, { "id": "direction-is-what-you-need-improving-word", "arxiv_id": "2106.08181", "nips_id": null, "url_abs": "https://arxiv.org/abs/2106.08181v2", "url_pdf": "https://arxiv.org/pdf/2106.08181v2.pdf", "title": "Direction is what you need: Improving Word Embedding Compression in Large Language Models", "abstract": "The adoption of Transformer-based models in natural language processing (NLP) has led to great success using a massive number of parameters. However, due to deployment constraints in edge devices, there has been a rising interest in the compression of these models to improve their inference time and memory footprint. This paper presents a novel loss objective to compress token embeddings in the Transformer-based models by leveraging an AutoEncoder architecture. More specifically, we emphasize the importance of the direction of compressed embeddings with respect to original uncompressed embeddings. The proposed method is task-agnostic and does not require further language modeling pre-training. Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity. Moreover, we evaluate our proposed approach over SQuAD v1.1 dataset and several downstream tasks from the GLUE benchmark, where we also outperform the baseline in most scenarios. Our code is public.", "authors": [ "Karl Aberer", "Jacek Tabor", "Rémi Lebret", "Mohammadreza Banaei", "Klaudia Bałazy" ], "published": "2021-06-15", "conference": null, "conference_url_abs": "https://aclanthology.org/2021.repl4nlp-1.32", "conference_url_pdf": "https://aclanthology.org/2021.repl4nlp-1.32.pdf", "proceeding": "acl-repl4nlp-2021-8" }, { "id": "an-enriched-category-theory-of-language-from", "arxiv_id": "2106.07890", "nips_id": null, "url_abs": "https://arxiv.org/abs/2106.07890v2", "url_pdf": "https://arxiv.org/pdf/2106.07890v2.pdf", "title": "An enriched category theory of language: from syntax to semantics", "abstract": "State of the art language models return a natural language text continuation from any piece of input text. This ability to generate coherent text extensions implies significant sophistication, including a knowledge of grammar and semantics. In this paper, we propose a mathematical framework for passing from probability distributions on extensions of given texts, such as the ones learned by today's large language models, to an enriched category containing semantic information. Roughly speaking, we model probability distributions on texts as a category enriched over the unit interval. Objects of this category are expressions in language, and hom objects are conditional probabilities that one expression is an extension of another. This category is syntactical -- it describes what goes with what. Then, via the Yoneda embedding, we pass to the enriched category of unit interval-valued copresheaves on this syntactical category. This category of enriched copresheaves is semantic -- it is where we find meaning, logical operations such as entailment, and the building blocks for more elaborate semantic concepts.", "authors": [ "Yiannis Vlassopoulos", "John Terilla", "Tai-Danae Bradley" ], "published": "2021-06-15", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "lora-low-rank-adaptation-of-large-language", "arxiv_id": "2106.09685", "nips_id": null, "url_abs": "https://arxiv.org/abs/2106.09685v2", "url_pdf": "https://arxiv.org/pdf/2106.09685v2.pdf", "title": "LoRA: Low-Rank Adaptation of Large Language Models", "abstract": "An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.", "authors": [ "Lu Wang", "Weizhu Chen", "Shean Wang", "Yuanzhi Li", "Zeyuan Allen-Zhu", "Phillip Wallis", "Yelong Shen", "Edward J. Hu" ], "published": "2021-06-17", "conference": "lora-low-rank-adaptation-of-large-language-1", "conference_url_abs": "https://openreview.net/forum?id=nZeVKeeFYf9", "conference_url_pdf": "https://openreview.net/pdf?id=nZeVKeeFYf9", "proceeding": "iclr-2022-4" }, { "id": "packing-towards-2x-nlp-bert-acceleration", "arxiv_id": "2107.02027", "nips_id": null, "url_abs": "https://arxiv.org/abs/2107.02027v2", "url_pdf": "https://arxiv.org/pdf/2107.02027v2.pdf", "title": "Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance", "abstract": "Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding tokens, so that all sequences in a batch have the same length. We show in this paper that the variation in sequence lengths in common NLP datasets is such that up to 50% of all tokens can be padding. In less common, but not extreme, cases (e.g. GLUE-cola with sequence length 128), the ratio is up to 89%. Existing methods to address the resulting inefficiency are complicated by the need to avoid cross-contamination in self-attention, by a reduction in accuracy when sequence ordering information is lost, or by customized kernel implementations only valid for specific accelerators. This paper introduces a new formalization of sequence packing in the context of the well-studied bin packing problem, and presents new algorithms based on this formulation which, for example, confer a 2x speedup for phase 2 pre-training in BERT. We show how existing models can be adapted to ensure mathematical equivalence between the original and packed models, meaning that packed models can be trained with existing pre-training and fine-tuning practices.", "authors": [ "Andrew Fitzgibbon", "Sergio P. Perez", "Mario Michael Krell", "Matej Kosec" ], "published": "2021-06-29", "conference": null, "conference_url_abs": "https://openreview.net/forum?id=3_MUAtqR0aA", "conference_url_pdf": "https://openreview.net/pdf?id=3_MUAtqR0aA", "proceeding": "neurips-2021-12" }, { "id": "evaluating-large-language-models-trained-on", "arxiv_id": "2107.03374", "nips_id": null, "url_abs": "https://arxiv.org/abs/2107.03374v2", "url_pdf": "https://arxiv.org/pdf/2107.03374v2.pdf", "title": "Evaluating Large Language Models Trained on Code", "abstract": "We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.", "authors": [ "Matthew Knight", "Alec Radford", "Andrew N. Carr", "Christopher Hesse", "William Saunders", "Jie Tang", "Nikolas Tezak", "Alex Paino", "William Hebgen Guss", "Felipe Petroski Such", "Yuri Burda", "Henrique Ponde de Oliveira Pinto", "Wojciech Zaremba", "Ilya Sutskever", "Sam McCandlish", "Dario Amodei", "Bob McGrew", "Peter Welinder", "Katie Mayer", "Mira Murati", "Miles Brundage", "Evan Morikawa", "Vedant Misra", "Josh Achiam", "Jan Leike", "Shantanu Jain", "Suchir Balaji", "Igor Babuschkin", "Alex Nichol", "Ariel Herbert-Voss", "Elizabeth Barnes", "Fotios Chantzis", "Matthias Plappert", "Dave Cummings", "Philippe Tillet", "Clemens Winter", "Mohammad Bavarian", "Lukasz Kaiser", "Alethea Power", "Mikhail Pavlov", "Nick Ryder", "Scott Gray", "Brooke Chan", "Pamela Mishkin", "Girish Sastry", "Heidy Khlaaf", "Michael Petrov", "Gretchen Krueger", "Raul Puri", "Alex Ray", "Greg Brockman", "Nicholas Joseph", "Harri Edwards", "Jared Kaplan", "Qiming Yuan", "Heewoo Jun", "Jerry Tworek", "Mark Chen" ], "published": "2021-07-07", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "similar-cases-recommendation-using-legal", "arxiv_id": "2107.04771", "nips_id": null, "url_abs": "https://arxiv.org/abs/2107.04771v2", "url_pdf": "https://arxiv.org/pdf/2107.04771v2.pdf", "title": "Similar Cases Recommendation using Legal Knowledge Graphs", "abstract": "A legal knowledge graph constructed from court cases, judgments, laws and other legal documents can enable a number of applications like question answering, document similarity, and search. While the use of knowledge graphs for distant supervision in NLP tasks is well researched, using knowledge graphs for applications like case similarity presents challenges. In this work, we describe our solution for predicting similar cases in Indian court judgements. We present our results and also discuss the impact of large language models on this task.", "authors": [ "Vasudha Bhatnagar", "Parikshet Sirohi", "Balaji Ganesan", "Ruchika Bhatt", "Jaspreet Singh Dhani" ], "published": "2021-07-10", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "internet-augmented-dialogue-generation", "arxiv_id": "2107.07566", "nips_id": null, "url_abs": "https://arxiv.org/abs/2107.07566v1", "url_pdf": "https://arxiv.org/pdf/2107.07566v1.pdf", "title": "Internet-Augmented Dialogue Generation", "abstract": "The largest store of continually updating knowledge on our planet can be accessed via internet search. In this work we study giving access to this information to conversational agents. Large language models, even though they store an impressive amount of knowledge within their weights, are known to hallucinate facts when generating dialogue (Shuster et al., 2021); moreover, those facts are frozen in time at the point of model training. In contrast, we propose an approach that learns to generate an internet search query based on the context, and then conditions on the search results to finally generate a response, a method that can employ up-to-the-minute relevant information. We train and evaluate such models on a newly collected dataset of human-human conversations whereby one of the speakers is given access to internet search during knowledgedriven discussions in order to ground their responses. We find that search-query based access of the internet in conversation provides superior performance compared to existing approaches that either use no augmentation or FAISS-based retrieval (Lewis et al., 2020).", "authors": [ "Jason Weston", "Kurt Shuster", "Mojtaba Komeili" ], "published": "2021-07-15", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.acl-long.579", "conference_url_pdf": "https://aclanthology.org/2022.acl-long.579.pdf", "proceeding": "acl-2022-5" }, { "id": "hybrid-autoregressive-solver-for-scalable", "arxiv_id": "2107.11879", "nips_id": null, "url_abs": "https://arxiv.org/abs/2107.11879v2", "url_pdf": "https://arxiv.org/pdf/2107.11879v2.pdf", "title": "Hybrid Autoregressive Inference for Scalable Multi-hop Explanation Regeneration", "abstract": "Regenerating natural language explanations in the scientific domain has been proposed as a benchmark to evaluate complex multi-hop and explainable inference. In this context, large language models can achieve state-of-the-art performance when employed as cross-encoder architectures and fine-tuned on human-annotated explanations. However, while much attention has been devoted to the quality of the explanations, the problem of performing inference efficiently is largely under-studied. Cross-encoders, in fact, are intrinsically not scalable, possessing limited applicability to real-world scenarios that require inference on massive facts banks. To enable complex multi-hop reasoning at scale, this paper focuses on bi-encoder architectures, investigating the problem of scientific explanation regeneration at the intersection of dense and sparse models. Specifically, we present SCAR (for Scalable Autoregressive Inference), a hybrid framework that iteratively combines a Transformer-based bi-encoder with a sparse model of explanatory power, designed to leverage explicit inference patterns in the explanations. Our experiments demonstrate that the hybrid framework significantly outperforms previous sparse models, achieving performance comparable with that of state-of-the-art cross-encoders while being approx 50 times faster and scalable to corpora of millions of facts. Further analyses on semantic drift and multi-hop question answering reveal that the proposed hybridisation boosts the quality of the most challenging explanations, contributing to improved performance on downstream inference tasks.", "authors": [ "André Freitas", "Deborah Ferreira", "Mokanarangan Thayaparan", "Marco Valentino" ], "published": "2021-07-25", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "robust-transfer-learning-with-pretrained", "arxiv_id": "2108.02340", "nips_id": null, "url_abs": "https://arxiv.org/abs/2108.02340v1", "url_pdf": "https://arxiv.org/pdf/2108.02340v1.pdf", "title": "Robust Transfer Learning with Pretrained Language Models through Adapters", "abstract": "Transfer learning with large pretrained transformer-based language models like BERT has become a dominating approach for most NLP tasks. Simply fine-tuning those large language models on downstream tasks or combining it with task-specific pretraining is often not robust. In particular, the performance considerably varies as the random seed changes or the number of pretraining and/or fine-tuning iterations varies, and the fine-tuned model is vulnerable to adversarial attack. We propose a simple yet effective adapter-based approach to mitigate these issues. Specifically, we insert small bottleneck layers (i.e., adapter) within each layer of a pretrained model, then fix the pretrained layers and train the adapter layers on the downstream task data, with (1) task-specific unsupervised pretraining and then (2) task-specific supervised training (e.g., classification, sequence labeling). Our experiments demonstrate that such a training scheme leads to improved stability and adversarial robustness in transfer learning to various downstream tasks.", "authors": [ "YingNian Wu", "Bo Pang", "Wenjuan Han" ], "published": "2021-08-05", "conference": null, "conference_url_abs": "https://aclanthology.org/2021.acl-short.108", "conference_url_pdf": "https://aclanthology.org/2021.acl-short.108.pdf", "proceeding": "acl-2021-5" }, { "id": "towards-structured-dynamic-sparse-pre", "arxiv_id": "2108.06277", "nips_id": null, "url_abs": "https://arxiv.org/abs/2108.06277v1", "url_pdf": "https://arxiv.org/pdf/2108.06277v1.pdf", "title": "Towards Structured Dynamic Sparse Pre-Training of BERT", "abstract": "Identifying algorithms for computational efficient unsupervised training of large language models is an important and active area of research. In this work, we develop and study a straightforward, dynamic always-sparse pre-training approach for BERT language modeling task, which leverages periodic compression steps based on magnitude pruning followed by random parameter re-allocation. This approach enables us to achieve Pareto improvements in terms of the number of floating-point operations (FLOPs) over statically sparse and dense models across a broad spectrum of network sizes. Furthermore, we demonstrate that training remains FLOP-efficient when using coarse-grained block sparsity, making it particularly promising for efficient execution on modern hardware accelerators.", "authors": [ "Carlo Luschi", "Daniel Justus", "Ivan Chelombiev", "Douglas Orr", "Frithjof Gressmann", "Anastasia Dietrich" ], "published": "2021-08-13", "conference": "towards-structured-dynamic-sparse-pre-1", "conference_url_abs": "https://openreview.net/forum?id=-e7awdzWsOc", "conference_url_pdf": "https://openreview.net/pdf?id=-e7awdzWsOc", "proceeding": null }, { "id": "program-synthesis-with-large-language-models", "arxiv_id": "2108.07732", "nips_id": null, "url_abs": "https://arxiv.org/abs/2108.07732v1", "url_pdf": "https://arxiv.org/pdf/2108.07732v1.pdf", "title": "Program Synthesis with Large Language Models", "abstract": "This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.", "authors": [ "Charles Sutton", "Quoc Le", "Michael Terry", "Carrie Cai", "Ellen Jiang", "David Dohan", "Henryk Michalewski", "Maarten Bosma", "Maxwell Nye", "Augustus Odena", "Jacob Austin" ], "published": "2021-08-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "viola-a-topic-agnostic-generate-and-rank", "arxiv_id": "2108.11063", "nips_id": null, "url_abs": "https://arxiv.org/abs/2108.11063v1", "url_pdf": "https://arxiv.org/pdf/2108.11063v1.pdf", "title": "Viola: A Topic Agnostic Generate-and-Rank Dialogue System", "abstract": "We present Viola, an open-domain dialogue system for spoken conversation that uses a topic-agnostic dialogue manager based on a simple generate-and-rank approach. Leveraging recent advances of generative dialogue systems powered by large language models, Viola fetches a batch of response candidates from various neural dialogue models trained with different datasets and knowledge-grounding inputs. Additional responses originating from template-based generators are also considered, depending on the user's input and detected entities. The hand-crafted generators build on a dynamic knowledge graph injected with rich content that is crawled from the web and automatically processed on a daily basis. Viola's response ranker is a fine-tuned polyencoder that chooses the best response given the dialogue history. While dedicated annotations for the polyencoder alone can indirectly steer it away from choosing problematic responses, we add rule-based safety nets to detect neural degeneration and a dedicated classifier to filter out offensive content. We analyze conversations that Viola took part in for the Alexa Prize Socialbot Grand Challenge 4 and discuss the strengths and weaknesses of our approach. Lastly, we suggest future work with a focus on curating conversation data specifcially for socialbots that will contribute towards a more robust data-driven socialbot.", "authors": [ "Jonathan May", "Jennifer Lee", "Hitesh Pindikanti", "Nikhil Patel", "Shuai Liu", "Kartik Shenoy", "Basel Shbita", "Hyundong Cho" ], "published": "2021-08-25", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "gpt-3-models-are-poor-few-shot-learners-in", "arxiv_id": "2109.02555", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.02555v2", "url_pdf": "https://arxiv.org/pdf/2109.02555v2.pdf", "title": "GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain", "abstract": "Deep neural language models have set new breakthroughs in many tasks of Natural Language Processing (NLP). Recent work has shown that deep transformer language models (pretrained on large amounts of texts) can achieve high levels of task-specific few-shot performance comparable to state-of-the-art models. However, the ability of these large language models in few-shot transfer learning has not yet been explored in the biomedical domain. We investigated the performance of two powerful transformer language models, i.e. GPT-3 and BioBERT, in few-shot settings on various biomedical NLP tasks. The experimental results showed that, to a great extent, both the models underperform a language model fine-tuned on the full training data. Although GPT-3 had already achieved near state-of-the-art results in few-shot knowledge transfer on open-domain NLP tasks, it could not perform as effectively as BioBERT, which is orders of magnitude smaller than GPT-3. Regarding that BioBERT was already pretrained on large biomedical text corpora, our study suggests that language models may largely benefit from in-domain pretraining in task-specific few-shot learning. However, in-domain pretraining seems not to be sufficient; novel pretraining and few-shot learning strategies are required in the biomedical NLP domain.", "authors": [ "Matthias Samwald", "Florian Haberl", "Kathrin Blagec", "Milad Moradi" ], "published": "2021-09-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "a-recipe-for-arbitrary-text-style-transfer", "arxiv_id": "2109.03910", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.03910v4", "url_pdf": "https://arxiv.org/pdf/2109.03910v4.pdf", "title": "A Recipe For Arbitrary Text Style Transfer with Large Language Models", "abstract": "In this paper, we leverage large language models (LMs) to perform zero-shot text style transfer. We present a prompting method that we call augmented zero-shot learning, which frames style transfer as a sentence rewriting task and requires only a natural language instruction, without model fine-tuning or exemplars in the target style. Augmented zero-shot learning is simple and demonstrates promising results not just on standard style transfer tasks such as sentiment, but also on arbitrary transformations such as \"make this melodramatic\" or \"insert a metaphor.\"", "authors": [ "Jason Wei", "Chris Callison-Burch", "Andy Coenen", "Ann Yuan", "Daphne Ippolito", "Emily Reif" ], "published": "2021-09-08", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.acl-short.94", "conference_url_pdf": "https://aclanthology.org/2022.acl-short.94.pdf", "proceeding": "acl-2022-5" }, { "id": "examining-cross-lingual-contextual-embeddings", "arxiv_id": "2109.04921", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.04921v1", "url_pdf": "https://arxiv.org/pdf/2109.04921v1.pdf", "title": "Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes", "abstract": "State-of-the-art contextual embeddings are obtained from large language models available only for a few languages. For others, we need to learn representations using a multilingual model. There is an ongoing debate on whether multilingual embeddings can be aligned in a space shared across many languages. The novel Orthogonal Structural Probe (Limisiewicz and Mare\\v{c}ek, 2021) allows us to answer this question for specific linguistic features and learn a projection based only on mono-lingual annotated datasets. We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's contextual representations for nine diverse languages. We observe that for languages closely related to English, no transformation is needed. The evaluated information is encoded in a shared cross-lingual embedding space. For other languages, it is beneficial to apply orthogonal transformation learned separately for each language. We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.", "authors": [ "David Mareček", "Tomasz Limisiewicz" ], "published": "2021-09-10", "conference": null, "conference_url_abs": "https://aclanthology.org/2021.emnlp-main.376", "conference_url_pdf": "https://aclanthology.org/2021.emnlp-main.376.pdf", "proceeding": "emnlp-2021-11" }, { "id": "epic-employing-proverbs-in-context-as-a", "arxiv_id": "2109.06838", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.06838v3", "url_pdf": "https://arxiv.org/pdf/2109.06838v3.pdf", "title": "ePiC: Employing Proverbs in Context as a Benchmark for Abstract Language Understanding", "abstract": "While large language models have shown exciting progress on several NLP benchmarks, evaluating their ability for complex analogical reasoning remains under-explored. Here, we introduce a high-quality crowdsourced dataset of narratives for employing proverbs in context as a benchmark for abstract language understanding. The dataset provides fine-grained annotation of aligned spans between proverbs and narratives, and contains minimal lexical overlaps between narratives and proverbs, ensuring that models need to go beyond surface-level reasoning to succeed. We explore three tasks: (1) proverb recommendation and alignment prediction, (2) narrative generation for a given proverb and topic, and (3) identifying narratives with similar motifs. Our experiments show that neural language models struggle on these tasks compared to humans, and these tasks pose multiple learning challenges.", "authors": [ "Shashank Srivastava", "Sayan Ghosh" ], "published": "2021-09-14", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.acl-long.276", "conference_url_pdf": "https://aclanthology.org/2022.acl-long.276.pdf", "proceeding": "acl-2022-5" }, { "id": "holms-alternative-summary-evaluation-with", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2020.coling-main.498", "url_pdf": "https://aclanthology.org/2020.coling-main.498.pdf", "title": "HOLMS: Alternative Summary Evaluation with Large Language Models", "abstract": "Efficient document summarization requires evaluation measures that can not only rank a set of systems based on an average score, but also highlight which individual summary is better than another. However, despite the very active research on summarization approaches, few works have proposed new evaluation measures in the recent years. The standard measures relied upon for the development of summarization systems are most often ROUGE and BLEU which, despite being efficient in overall system ranking, remain lexical in nature and have a limited potential when it comes to training neural networks. In this paper, we present a new hybrid evaluation measure for summarization, called HOLMS, that combines both language models pre-trained on large corpora and lexical similarity measures. Through several experiments, we show that HOLMS outperforms ROUGE and BLEU substantially in its correlation with human judgments on several extractive summarization datasets for both linguistic quality and pyramid scores.", "authors": [ "Dina Demner-Fushman", "Yassine Mrabet" ], "published": "2020-12-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "coling-2020-8" }, { "id": "challenges-in-detoxifying-language-models", "arxiv_id": "2109.07445", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.07445v1", "url_pdf": "https://arxiv.org/pdf/2109.07445v1.pdf", "title": "Challenges in Detoxifying Language Models", "abstract": "Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world; to this end, prior work often relies on automatic evaluation of LM toxicity. We critically discuss this approach, evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation, and analyze consequences of toxicity mitigation in terms of model bias and LM quality. We demonstrate that while basic intervention strategies can effectively optimize previously established automatic metrics on the RealToxicityPrompts dataset, this comes at the cost of reduced LM coverage for both texts about, and dialects of, marginalized groups. Additionally, we find that human raters often disagree with high automatic toxicity scores after strong toxicity reduction interventions -- highlighting further the nuances involved in careful evaluation of LM toxicity.", "authors": [ "Po-Sen Huang", "Ben Coppin", "Pushmeet Kohli", "Kirsty Anderson", "Lisa Anne Hendricks", "John Mellor", "Sumanth Dathathri", "Jonathan Uesato", "Amelia Glaese", "Johannes Welbl" ], "published": "2021-09-15", "conference": null, "conference_url_abs": "https://aclanthology.org/2021.findings-emnlp.210", "conference_url_pdf": "https://aclanthology.org/2021.findings-emnlp.210.pdf", "proceeding": "findings-emnlp-2021-11" }, { "id": "grounding-natural-language-instructions-can", "arxiv_id": "2109.08634", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.08634v1", "url_pdf": "https://arxiv.org/pdf/2109.08634v1.pdf", "title": "Grounding Natural Language Instructions: Can Large Language Models Capture Spatial Information?", "abstract": "Models designed for intelligent process automation are required to be capable of grounding user interface elements. This task of interface element grounding is centred on linking instructions in natural language to their target referents. Even though BERT and similar pre-trained language models have excelled in several NLP tasks, their use has not been widely explored for the UI grounding domain. This work concentrates on testing and probing the grounding abilities of three different transformer-based models: BERT, RoBERTa and LayoutLM. Our primary focus is on these models' spatial reasoning skills, given their importance in this domain. We observe that LayoutLM has a promising advantage for applications in this domain, even though it was created for a different original purpose (representing scanned documents): the learned spatial features appear to be transferable to the UI grounding setting, especially as they demonstrate the ability to discriminate between target directions in natural language instructions.", "authors": [ "Andre Freitas", "Dell Zhang", "Weiwei Cheng", "Krishna Dubba", "Deborah Ferreira", "Julia Rozanova" ], "published": "2021-09-17", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "salience-aware-event-chain-modeling-for", "arxiv_id": "2109.10475", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.10475v1", "url_pdf": "https://arxiv.org/pdf/2109.10475v1.pdf", "title": "Salience-Aware Event Chain Modeling for Narrative Understanding", "abstract": "Storytelling, whether via fables, news reports, documentaries, or memoirs, can be thought of as the communication of interesting and related events that, taken together, form a concrete process. It is desirable to extract the event chains that represent such processes. However, this extraction remains a challenging problem. We posit that this is due to the nature of the texts from which chains are discovered. Natural language text interleaves a narrative of concrete, salient events with background information, contextualization, opinion, and other elements that are important for a variety of necessary discourse and pragmatics acts but are not part of the principal chain of events being communicated. We introduce methods for extracting this principal chain from natural language text, by filtering away non-salient events and supportive sentences. We demonstrate the effectiveness of our methods at isolating critical event chains by comparing their effect on downstream tasks. We show that by pre-training large language models on our extracted chains, we obtain improvements in two tasks that benefit from a clear understanding of event chains: narrative prediction and event-based temporal question answering. The demonstrated improvements and ablative studies confirm that our extraction method isolates critical event chains.", "authors": [ "Jonathan May", "Muhao Chen", "Xiyang Zhang" ], "published": "2021-09-22", "conference": null, "conference_url_abs": "https://aclanthology.org/2021.emnlp-main.107", "conference_url_pdf": "https://aclanthology.org/2021.emnlp-main.107.pdf", "proceeding": "emnlp-2021-11" }, { "id": "transferring-knowledge-from-vision-to", "arxiv_id": "2109.11321", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.11321v2", "url_pdf": "https://arxiv.org/pdf/2109.11321v2.pdf", "title": "Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?", "abstract": "Large language models are known to suffer from the hallucination problem in that they are prone to output statements that are false or inconsistent, indicating a lack of knowledge. A proposed solution to this is to provide the model with additional data modalities that complements the knowledge obtained through text. We investigate the use of visual data to complement the knowledge of large language models by proposing a method for evaluating visual knowledge transfer to text for uni- or multimodal language models. The method is based on two steps, 1) a novel task querying for knowledge of memory colors, i.e. typical colors of well-known objects, and 2) filtering of model training data to clearly separate knowledge contributions. Additionally, we introduce a model architecture that involves a visual imagination step and evaluate it with our proposed method. We find that our method can successfully be used to measure visual knowledge transfer capabilities in models and that our novel model architecture shows promising results for leveraging multimodal knowledge in a unimodal setting.", "authors": [ "Richard Johansson", "Lovisa Hagström", "Tobias Norlund" ], "published": "2021-09-23", "conference": null, "conference_url_abs": "https://aclanthology.org/2021.blackboxnlp-1.10", "conference_url_pdf": "https://aclanthology.org/2021.blackboxnlp-1.10.pdf", "proceeding": "emnlp-blackboxnlp-2021-11" }, { "id": "curb-your-carbon-emissions-benchmarking", "arxiv_id": "2109.12584", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.12584v4", "url_pdf": "https://arxiv.org/pdf/2109.12584v4.pdf", "title": "Curb Your Carbon Emissions: Benchmarking Carbon Emissions in Machine Translation", "abstract": "In recent times, there has been definitive progress in the field of NLP, with its applications growing as the utility of our language models increases with advances in their performance. However, these models require a large amount of computational power and data to train, consequently leading to large carbon footprints. Therefore, it is imperative that we study the carbon efficiency and look for alternatives to reduce the overall environmental impact of training models, in particular large language models. In our work, we assess the performance of models for machine translation, across multiple language pairs to assess the difference in computational power required to train these models for each of these language pairs and examine the various components of these models to analyze aspects of our pipeline that can be optimized to reduce these carbon emissions.", "authors": [ "Krithika Ramesh", "Gauri Gupta", "Praatibh Surana", "Mirza Yusuf" ], "published": "2021-09-26", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "generating-texts-under-constraint-through", "arxiv_id": "2109.13582", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.13582v2", "url_pdf": "https://arxiv.org/pdf/2109.13582v2.pdf", "title": "PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided MCTS Decoding", "abstract": "Large language models (LM) based on Transformers allow to generate plausible long texts. In this paper, we explore how this generation can be further controlled at decoding time to satisfy certain constraints (e.g. being non-toxic, conveying certain emotions, using a specific writing style, etc.) without fine-tuning the LM. Precisely, we formalize constrained generation as a tree exploration process guided by a discriminator that indicates how well the associated sequence respects the constraint. This approach, in addition to being easier and cheaper to train than fine-tuning the LM, allows to apply the constraint more finely and dynamically. We propose several original methods to search this generation tree, notably the Monte Carlo Tree Search (MCTS) which provides theoretical guarantees on the search efficiency, but also simpler methods based on re-ranking a pool of diverse sequences using the discriminator scores. These methods are evaluated, with automatic and human-based metrics, on two types of constraints and languages: review polarity and emotion control in French and English. We show that discriminator-guided MCTS decoding achieves state-of-the-art results without having to tune the language model, in both tasks and languages. We also demonstrate that other proposed decoding methods based on re-ranking can be really effective when diversity among the generated propositions is encouraged.", "authors": [ "Ewa Kijak", "Vincent Claveau", "Antoine Chaffin" ], "published": "2021-09-28", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.naacl-main.215", "conference_url_pdf": "https://aclanthology.org/2022.naacl-main.215.pdf", "proceeding": "naacl-2022-7" }, { "id": "collaborative-storytelling-with-human-actors", "arxiv_id": "2109.14728", "nips_id": null, "url_abs": "https://arxiv.org/abs/2109.14728v1", "url_pdf": "https://arxiv.org/pdf/2109.14728v1.pdf", "title": "Collaborative Storytelling with Human Actors and AI Narrators", "abstract": "Large language models can be used for collaborative storytelling. In this work we report on using GPT-3 \\cite{brown2020language} to co-narrate stories. The AI system must track plot progression and character arcs while the human actors perform scenes. This event report details how a novel conversational agent was employed as creative partner with a team of professional improvisers to explore long-form spontaneous story narration in front of a live public audience. We introduced novel constraints on our language model to produce longer narrative text and tested the model in rehearsals with a team of professional improvisers. We then field tested the model with two live performances for public audiences as part of a live theatre festival in Europe. We surveyed audience members after each performance as well as performers to evaluate how well the AI performed in its role as narrator. Audiences and performers responded positively to AI narration and indicated preference for AI narration over AI characters within a scene. Performers also responded positively to AI narration and expressed enthusiasm for the creative and meaningful novel narrative directions introduced to the scenes. Our findings support improvisational theatre as a useful test-bed to explore how different language models can collaborate with humans in a variety of social contexts.", "authors": [ "Kory W. Mathewson", "Piotr Mirowski", "Boyd Branch" ], "published": "2021-09-29", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "probing-language-models-for-understanding-of", "arxiv_id": "2110.01113", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.01113v1", "url_pdf": "https://arxiv.org/pdf/2110.01113v1.pdf", "title": "Probing Language Models for Understanding of Temporal Expressions", "abstract": "We present three Natural Language Inference (NLI) challenge sets that can evaluate NLI models on their understanding of temporal expressions. More specifically, we probe these models for three temporal properties: (a) the order between points in time, (b) the duration between two points in time, (c) the relation between the magnitude of times specified in different units. We find that although large language models fine-tuned on MNLI have some basic perception of the order between points in time, at large, these models do not have a thorough understanding of the relation between temporal expressions.", "authors": [ "Christian Kavouras", "Kunal Kukreja", "Shivin Thukral" ], "published": "2021-10-03", "conference": null, "conference_url_abs": "https://aclanthology.org/2021.blackboxnlp-1.31", "conference_url_pdf": "https://aclanthology.org/2021.blackboxnlp-1.31.pdf", "proceeding": "emnlp-blackboxnlp-2021-11" }, { "id": "ai-chains-transparent-and-controllable-human", "arxiv_id": "2110.01691", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.01691v3", "url_pdf": "https://arxiv.org/pdf/2110.01691v3.pdf", "title": "AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts", "abstract": "Although large language models (LLMs) have demonstrated impressive potential on simple tasks, their breadth of scope, lack of transparency, and insufficient controllability can make them less effective when assisting humans on more complex tasks. In response, we introduce the concept of Chaining LLM steps together, where the output of one step becomes the input for the next, thus aggregating the gains per step. We first define a set of LLM primitive operations useful for Chain construction, then present an interactive system where users can modify these Chains, along with their intermediate results, in a modular way. In a 20-person user study, we found that Chaining not only improved the quality of task outcomes, but also significantly enhanced system transparency, controllability, and sense of collaboration. Additionally, we saw that users developed new ways of interacting with LLMs through Chains: they leveraged sub-tasks to calibrate model expectations, compared and contrasted alternative strategies by observing parallel downstream effects, and debugged unexpected model outputs by \"unit-testing\" sub-components of a Chain. In two case studies, we further explore how LLM Chains may be used in future applications", "authors": [ "Carrie J. Cai", "Michael Terry", "Tongshuang Wu" ], "published": "2021-10-04", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "multimodal-datasets-misogyny-pornography-and", "arxiv_id": "2110.01963", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.01963v1", "url_pdf": "https://arxiv.org/pdf/2110.01963v1.pdf", "title": "Multimodal datasets: misogyny, pornography, and malignant stereotypes", "abstract": "We have now entered the era of trillion parameter machine learning models trained on billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has given rise to formidable bodies of critical work that has called for caution while generating these large datasets. These address concerns surrounding the dubious curation practices used to generate these datasets, the sordid quality of alt-text data available on the world wide web, the problematic content of the CommonCrawl dataset often used as a source for training large language models, and the entrenched biases in large-scale visio-linguistic models (such as OpenAI's CLIP model) trained on opaque datasets (WebImageText). In the backdrop of these specific calls of caution, we examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset. We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content. We outline numerous implications, concerns and downstream harms regarding the current state of large scale datasets while raising open questions for various stakeholders including the AI community, regulators, policy makers and data subjects.", "authors": [], "published": "2021-10-05", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "leveraging-the-inductive-bias-of-large", "arxiv_id": "2110.02370", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.02370v1", "url_pdf": "https://arxiv.org/pdf/2110.02370v1.pdf", "title": "Leveraging the Inductive Bias of Large Language Models for Abstract Textual Reasoning", "abstract": "Large natural language models (such as GPT-3 or T5) demonstrate impressive abilities across a range of general NLP tasks. Here, we show that the knowledge embedded in such models provides a useful inductive bias, not just on traditional NLP tasks, but also in the nontraditional task of training a symbolic reasoning engine. We observe that these engines learn quickly and generalize in a natural way that reflects human intuition. For example, training such a system to model block-stacking might naturally generalize to stacking other types of objects because of structure in the real world that has been partially captured by the language describing it. We study several abstract textual reasoning tasks, such as object manipulation and navigation, and demonstrate multiple types of generalization to novel scenarios and the symbols that comprise them. We also demonstrate the surprising utility of \\textit{compositional learning}, where a learner dedicated to mastering a complicated task gains an advantage by training on relevant simpler tasks instead of jumping straight to the complicated task.", "authors": [ "David Wingate", "Christopher Michael Rytting" ], "published": "2021-10-05", "conference": null, "conference_url_abs": "http://proceedings.neurips.cc/paper/2021/hash/8e08227323cd829e449559bb381484b7-Abstract.html", "conference_url_pdf": "http://proceedings.neurips.cc/paper/2021/file/8e08227323cd829e449559bb381484b7-Paper.pdf", "proceeding": "neurips-2021-12" }, { "id": "towards-continual-knowledge-learning-of", "arxiv_id": "2110.03215", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.03215v4", "url_pdf": "https://arxiv.org/pdf/2110.03215v4.pdf", "title": "Towards Continual Knowledge Learning of Language Models", "abstract": "Large Language Models (LMs) are known to encode world knowledge in their parameters as they pretrain on a vast amount of web corpus, which is often utilized for performing knowledge-dependent downstream tasks such as question answering, fact-checking, and open dialogue. In real-world scenarios, the world knowledge stored in the LMs can quickly become outdated as the world changes, but it is non-trivial to avoid catastrophic forgetting and reliably acquire new knowledge while preserving invariant knowledge. To push the community towards better maintenance of ever-changing LMs, we formulate a new continual learning (CL) problem called Continual Knowledge Learning (CKL). We construct a new benchmark and metric to quantify the retention of time-invariant world knowledge, the update of outdated knowledge, and the acquisition of new knowledge. We adopt applicable recent methods from literature to create several strong baselines. Through extensive experiments, we find that CKL exhibits unique challenges that are not addressed in previous CL setups, where parameter expansion is necessary to reliably retain and learn knowledge simultaneously. By highlighting the critical causes of knowledge forgetting, we show that CKL is a challenging and important problem that helps us better understand and train ever-changing LMs. The benchmark datasets, evaluation script, and baseline code to reproduce our results are available at https://github.com/joeljang/continual-knowledge-learning.", "authors": [ "Minjoon Seo", "Stanley Jungkyu Choi", "Gyeonghun Kim", "Janghoon Han", "Joongbo Shin", "Sohee Yang", "Seonghyeon Ye", "Joel Jang" ], "published": "2021-10-07", "conference": "towards-continual-knowledge-learning-of-1", "conference_url_abs": "https://openreview.net/forum?id=vfsRB5MImo9", "conference_url_pdf": "https://openreview.net/pdf?id=vfsRB5MImo9", "proceeding": "iclr-2022-4" }, { "id": "large-language-models-can-be-strong-1", "arxiv_id": "2110.05679", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.05679v6", "url_pdf": "https://arxiv.org/pdf/2110.05679v6.pdf", "title": "Large Language Models Can Be Strong Differentially Private Learners", "abstract": "Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language models; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained language models doesn't tend to suffer from dimension-dependent performance degradation. Code to reproduce results can be found at https://github.com/lxuechen/private-transformers.", "authors": [ "Tatsunori Hashimoto", "Percy Liang", "Florian Tramèr", "Xuechen Li" ], "published": "2021-10-12", "conference": "large-language-models-can-be-strong", "conference_url_abs": "https://openreview.net/forum?id=bVuP3ltATMz", "conference_url_pdf": "https://openreview.net/pdf?id=bVuP3ltATMz", "proceeding": "iclr-2022-4" }, { "id": "teaching-models-new-apis-domain-agnostic", "arxiv_id": "2110.06905", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.06905v1", "url_pdf": "https://arxiv.org/pdf/2110.06905v1.pdf", "title": "Teaching Models new APIs: Domain-Agnostic Simulators for Task Oriented Dialogue", "abstract": "We demonstrate that large language models are able to simulate Task Oriented Dialogues in novel domains, provided only with an API implementation and a list of goals. We show these simulations can formulate online, automatic metrics that correlate well with human evaluations. Furthermore, by checking for whether the User's goals are met, we can use simulation to repeatedly generate training data and improve the quality of simulations themselves. With no human intervention or domain-specific training data, our simulations bootstrap end-to-end models which achieve a 37\\% error reduction in previously unseen domains. By including as few as 32 domain-specific conversations, bootstrapped models can match the performance of a fully-supervised model with $10\\times$ more data. To our knowledge, this is the first time simulations have been shown to be effective at bootstrapping models without explicitly requiring any domain-specific training data, rule-engineering, or humans-in-the-loop.", "authors": [ "Stephen Roller", "Paul A. Crook", "Moya Chen" ], "published": "2021-10-13", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "p-adapters-robustly-extracting-factual-1", "arxiv_id": "2110.07280", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.07280v2", "url_pdf": "https://arxiv.org/pdf/2110.07280v2.pdf", "title": "P-Adapters: Robustly Extracting Factual Information from Language Models with Diverse Prompts", "abstract": "Recent work (e.g. LAMA (Petroni et al., 2019)) has found that the quality of the factual information extracted from Large Language Models (LLMs) depends on the prompts used to query them. This inconsistency is problematic because different users will query LLMs for the same information using different wording, but should receive the same, accurate responses regardless. In this work we aim to address this shortcoming by introducing P-Adapters: lightweight models that sit between the embedding layer and first attention layer of LLMs. They take LLM embeddings as input and output continuous prompts that are used to query the LLM. Additionally, we investigate Mixture of Experts (MoE) models that learn a set of continuous prompts (\"experts\") and select one to query the LLM. They require a separate classifier trained on human-annotated data to map natural language prompts to the continuous ones. P-Adapters perform comparably to the more complex MoE models in extracting factual information from BERT and RoBERTa while eliminating the need for additional annotations. P-Adapters show between 12-26% absolute improvement in precision and 36-50% absolute improvement in consistency over a baseline of only using natural language queries. Finally, we investigate what makes P-Adapters successful and conclude that a significant factor is access to the LLM's embeddings of the original natural language prompt, particularly the subject of the entity pair being queried.", "authors": [ "Nazneen Rajani", "Prafulla Kumar Choubey", "Benjamin Newman" ], "published": "2021-10-14", "conference": "p-adapters-robustly-extracting-factual", "conference_url_abs": "https://openreview.net/forum?id=DhzIU48OcZh", "conference_url_pdf": "https://openreview.net/pdf?id=DhzIU48OcZh", "proceeding": "iclr-2022-4" }, { "id": "multitask-prompted-training-enables-zero-shot-1", "arxiv_id": "2110.08207", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.08207v3", "url_pdf": "https://arxiv.org/pdf/2110.08207v3.pdf", "title": "Multitask Prompted Training Enables Zero-Shot Task Generalization", "abstract": "Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero and all prompts are available at https://github.com/bigscience-workshop/promptsource.", "authors": [ "Leo Gao", "Stella Biderman", "Tali Bers", "Alexander M. Rush", "Thomas Wolf", "Ryan Teehan", "Jason Alan Fries", "Thibault Fevry", "Andrea Santilli", "Abheesht Sharma", "Jos Rozen", "Trishala Neeraj", "Thomas Wang", "Rachel Bawden", "Harshit Pandey", "Zheng Xin Yong", "Sheng Shen", "Matteo Manica", "Han Wang", "Mike Tian-Jian Jiang", "Jonathan Chang", "Debajyoti Datta", "Nihal Nayak", "Gunjan Chhablani", "Taewoon Kim", "Eliza Szczechla", "Shanya Sharma Sharma", "Urmish Thakker", "Canwen Xu", "M Saiful Bari", "Manan Dey", "Arun Raja", "Teven Le Scao", "Arnaud Stiegler", "Antoine Chaffin", "Zaid Alyafeai", "Lintang Sutawika", "Stephen H. Bach", "Colin Raffel", "Albert Webson", "Victor Sanh" ], "published": "2021-10-15", "conference": "multitask-prompted-training-enables-zero-shot", "conference_url_abs": "https://openreview.net/forum?id=9Vrb9D0WI4", "conference_url_pdf": "https://openreview.net/pdf?id=9Vrb9D0WI4", "proceeding": "iclr-2022-4" }, { "id": "boosting-coherence-of-language-models", "arxiv_id": "2110.08294", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.08294v2", "url_pdf": "https://arxiv.org/pdf/2110.08294v2.pdf", "title": "Coherence boosting: When your pretrained language model is not paying enough attention", "abstract": "Long-range semantic coherence remains a challenge in automatic language generation and understanding. We demonstrate that large language models have insufficiently learned the effect of distant words on next-token prediction. We present coherence boosting, an inference procedure that increases a LM's focus on a long context. We show the benefits of coherence boosting with pretrained models by distributional analyses of generated ordinary text and dialog responses. It is also found that coherence boosting with state-of-the-art models for various zero-shot NLP tasks yields performance gains with no additional training.", "authors": [ "Nebojsa Jojic", "Zhen Wang", "Nikolay Malkin" ], "published": "2021-10-15", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.acl-long.565", "conference_url_pdf": "https://aclanthology.org/2022.acl-long.565.pdf", "proceeding": "acl-2022-5" }, { "id": "risks-of-ai-foundation-models-in-education", "arxiv_id": "2110.10024", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.10024v1", "url_pdf": "https://arxiv.org/pdf/2110.10024v1.pdf", "title": "Risks of AI Foundation Models in Education", "abstract": "If the authors of a recent Stanford report (Bommasani et al., 2021) on the opportunities and risks of \"foundation models\" are to be believed, these models represent a paradigm shift for AI and for the domains in which they will supposedly be used, including education. Although the name is new (and contested (Field, 2021)), the term describes existing types of algorithmic models that are \"trained on broad data at scale\" and \"fine-tuned\" (i.e., adapted) for particular downstream tasks, and is intended to encompass large language models such as BERT or GPT-3 and computer vision models such as CLIP. Such technologies have the potential for harm broadly speaking (e.g., Bender et al., 2021), but their use in the educational domain is particularly fraught, despite the potential benefits for learners claimed by the authors. In section 3.3 of the Stanford report, Malik et al. argue that achieving the goal of providing education for all learners requires more efficient computational approaches that can rapidly scale across educational domains and across educational contexts, for which they argue foundation models are uniquely well-suited. However, evidence suggests that not only are foundation models not likely to achieve the stated benefits for learners, but their use may also introduce new risks for harm.", "authors": [ "Michael Madaio", "Su Lin Blodgett" ], "published": "2021-10-19", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "a-data-bootstrapping-recipe-for-low-resource", "arxiv_id": "2110.09570", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.09570v1", "url_pdf": "https://arxiv.org/pdf/2110.09570v1.pdf", "title": "A Data Bootstrapping Recipe for Low Resource Multilingual Relation Classification", "abstract": "Relation classification (sometimes called 'extraction') requires trustworthy datasets for fine-tuning large language models, as well as for evaluation. Data collection is challenging for Indian languages, because they are syntactically and morphologically diverse, as well as different from resource-rich languages like English. Despite recent interest in deep generative models for Indian languages, relation classification is still not well served by public data sets. In response, we present IndoRE, a dataset with 21K entity and relation tagged gold sentences in three Indian languages, plus English. We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information and provides competitive monolingual relation classification. Using this system, we explore and compare transfer mechanisms between languages. In particular, we study the accuracy efficiency tradeoff between expensive gold instances vs. translated and aligned 'silver' instances. We release the dataset for future research.", "authors": [ "Soumen Chakrabarti", "Niloy Ganguly", "Animesh Mukherjee", "Bidisha Samanta", "Arijit Nag" ], "published": "2021-10-18", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "hierarchical-transformers-are-more-efficient", "arxiv_id": "2110.13711", "nips_id": null, "url_abs": "https://arxiv.org/abs/2110.13711v2", "url_pdf": "https://arxiv.org/pdf/2110.13711v2.pdf", "title": "Hierarchical Transformers Are More Efficient Language Models", "abstract": "Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently. In particular, Hourglass sets new state-of-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark.", "authors": [ "Henryk Michalewski", "Christian Szegedy", "Yuhuai Wu", "Łukasz Kaiser", "Michał Tyrolski", "Szymon Tworkowski", "Piotr Nawrot" ], "published": "2021-10-26", "conference": "hierarchical-transformers-are-more-efficient-1", "conference_url_abs": "https://aclanthology.org/2022.findings-naacl.117", "conference_url_pdf": "https://aclanthology.org/2022.findings-naacl.117.pdf", "proceeding": "findings-naacl-2022-7" }, { "id": "a-systematic-investigation-of-commonsense", "arxiv_id": "2111.00607", "nips_id": null, "url_abs": "https://arxiv.org/abs/2111.00607v3", "url_pdf": "https://arxiv.org/pdf/2111.00607v3.pdf", "title": "A Systematic Investigation of Commonsense Knowledge in Large Language Models", "abstract": "Language models (LMs) trained on large amounts of data have shown impressive performance on many NLP tasks under the zero-shot and few-shot setup. Here we aim to better understand the extent to which such models learn commonsense knowledge -- a critical component of many NLP applications. We conduct a systematic and rigorous zero-shot and few-shot commonsense evaluation of large pre-trained LMs, where we: (i) carefully control for the LMs' ability to exploit potential surface cues and annotation artefacts, and (ii) account for variations in performance that arise from factors that are not related to commonsense knowledge. Our findings highlight the limitations of pre-trained LMs in acquiring commonsense knowledge without task-specific supervision; furthermore, using larger models or few-shot evaluation are insufficient to achieve human-level commonsense performance.", "authors": [ "Phil Blunsom", "Cyprien de Masson d'Autume", "Jordan Hoffmann", "Adhiguna Kuncoro", "Aida Nematzadeh", "Xiang Lorraine Li" ], "published": "2021-10-31", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "magic-pyramid-accelerating-inference-with", "arxiv_id": "2111.00230", "nips_id": null, "url_abs": "https://arxiv.org/abs/2111.00230v1", "url_pdf": "https://arxiv.org/pdf/2111.00230v1.pdf", "title": "Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning", "abstract": "Pre-training and then fine-tuning large language models is commonly used to achieve state-of-the-art performance in natural language processing (NLP) tasks. However, most pre-trained models suffer from low inference speed. Deploying such large models to applications with latency constraints is challenging. In this work, we focus on accelerating the inference via conditional computations. To achieve this, we propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise computation via token pruning and early exiting for Transformer-based models, particularly BERT. The former manages to save the computation via removing non-salient tokens, while the latter can fulfill the computation reduction by terminating the inference early before reaching the final layer, if the exiting condition is met. Our empirical studies demonstrate that compared to previous state of arts, MP is not only able to achieve a speed-adjustable inference but also to surpass token pruning and early exiting by reducing up to 70% giga floating point operations (GFLOPs) with less than 0.5% accuracy drop. Token pruning and early exiting express distinctive preferences to sequences with different lengths. However, MP is capable of achieving an average of 8.06x speedup on two popular text classification tasks, regardless of the sizes of the inputs.", "authors": [ "Trishul Chilimbi", "Santosh Rajagopalan", "Belinda Zeng", "Xiang He", "Yi Xu", "Iman Keivanloo", "Xuanli He" ], "published": "2021-10-30", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "recent-advances-in-natural-language", "arxiv_id": "2111.01243", "nips_id": null, "url_abs": "https://arxiv.org/abs/2111.01243v1", "url_pdf": "https://arxiv.org/pdf/2111.01243v1.pdf", "title": "Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey", "abstract": "Large, pre-trained transformer-based language models such as BERT have drastically changed the Natural Language Processing (NLP) field. We present a survey of recent work that uses these large language models to solve NLP tasks via pre-training then fine-tuning, prompting, or text generation approaches. We also present approaches that use pre-trained language models to generate data for training augmentation or other purposes. We conclude with discussions on limitations and suggested directions for future research.", "authors": [ "Dan Roth", "Ilana Heinz", "Eneko Agirre", "Oscar Sainz", "Thien Huu Nguyen", "Amir Pouran Ben Veyseh", "Elior Sulem", "Hayley Ross", "Bonan Min" ], "published": "2021-11-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "an-explanation-of-in-context-learning-as-1", "arxiv_id": "2111.02080", "nips_id": null, "url_abs": "https://arxiv.org/abs/2111.02080v6", "url_pdf": "https://arxiv.org/pdf/2111.02080v6.pdf", "title": "An Explanation of In-context Learning as Implicit Bayesian Inference", "abstract": "Large language models (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. The LM learns from these examples without being explicitly pretrained to learn. Thus, it is unclear what enables in-context learning. In this paper, we study how in-context learning can emerge when pretraining documents have long-range coherence. Here, the LM must infer a latent document-level concept to generate coherent next tokens during pretraining. At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt. We prove when this occurs despite a distribution mismatch between prompts and pretraining data in a setting where the pretraining distribution is a mixture of HMMs. In contrast to messy large-scale datasets used to train LMs capable of in-context learning, we generate a small-scale synthetic dataset (GINC) where Transformers and LSTMs both exhibit in-context learning. Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning.", "authors": [ "Tengyu Ma", "Percy Liang", "aditi raghunathan", "Sang Michael Xie" ], "published": "2021-11-03", "conference": "an-explanation-of-in-context-learning-as", "conference_url_abs": "https://openreview.net/forum?id=RdJVFCHjUMI", "conference_url_pdf": "https://openreview.net/pdf?id=RdJVFCHjUMI", "proceeding": "iclr-2022-4" }, { "id": "the-klarna-product-page-dataset-a", "arxiv_id": "2111.02168", "nips_id": null, "url_abs": "https://arxiv.org/abs/2111.02168v4", "url_pdf": "https://arxiv.org/pdf/2111.02168v4.pdf", "title": "The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models", "abstract": "Web automation holds the potential to revolutionize how users interact with the digital world, offering unparalleled assistance and simplifying tasks via sophisticated computational methods. Central to this evolution is the web element nomination task, which entails identifying unique elements on webpages. Unfortunately, the development of algorithmic designs for web automation is hampered by the scarcity of comprehensive and realistic datasets that reflect the complexity faced by real-world applications on the Web. To address this, we introduce the Klarna Product Page Dataset, a comprehensive and diverse collection of webpages that surpasses existing datasets in richness and variety. The dataset features 51,701 manually labeled product pages from 8,175 e-commerce websites across eight geographic regions, accompanied by a dataset of rendered page screenshots. To initiate research on the Klarna Product Page Dataset, we empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task. We make three important contributions. First, we found that a simple Convolutional GNN (GCN) outperforms complex state-of-the-art nomination methods. Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page using the aforementioned GCN. These elements are then passed to a large language model for the final nomination. This procedure significantly improves the nomination accuracy by 16.8 percentage points on our challenging dataset, without any need for fine-tuning. Finally, in response to another prevalent challenge in this field - the abundance of training methodologies suitable for element nomination - we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.", "authors": [ "Jens Lagergren", "Aref Moradi", "Stefan Magureanu", "Riccardo Sven Risuleo", "Alexandra Hotti" ], "published": "2021-11-03", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "hyperparameter-power-impact-in-transformer", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.sustainlp-1.12", "url_pdf": "https://aclanthology.org/2021.sustainlp-1.12.pdf", "title": "Hyperparameter Power Impact in Transformer Language Model Training", "abstract": "Training large language models can consume a large amount of energy. We hypothesize that the language model’s configuration impacts its energy consumption, and that there is room for power consumption optimisation in modern large language models. To investigate these claims, we introduce a power consumption factor to the objective function, and explore the range of models and hyperparameter configurations that affect power. We identify multiple configuration factors that can reduce power consumption during language model training while retaining model quality.", "authors": [ "Leon Derczynski", "Timmie Rantzau", "Mads Guldborg Kjeldgaard Kongsbak", "Lucas Høyberg Puvis de Chavannes" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "emnlp-sustainlp-2021-11" } ] }{ "count": 24708, "next": "