Paper List
Return a paginated listing of all papers.
GET /api/v1/papers/?page=3&q=Large+Language+Models
https://paperswithcode.com/api/v1/papers/?page=4&q=Large+Language+Models", "previous": "https://paperswithcode.com/api/v1/papers/?page=2&q=Large+Language+Models", "results": [ { "id": "a-data-bootstrapping-recipe-for-low-resource-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.conll-1.45", "url_pdf": "https://aclanthology.org/2021.conll-1.45.pdf", "title": "A Data Bootstrapping Recipe for Low-Resource Multilingual Relation Classification", "abstract": "Relation classification (sometimes called ‘extraction’) requires trustworthy datasets for fine-tuning large language models, as well as for evaluation. Data collection is challenging for Indian languages, because they are syntactically and morphologically diverse, as well as different from resource-rich languages like English. Despite recent interest in deep generative models for Indian languages, relation classification is still not well-served by public data sets. In response, we present IndoRE, a dataset with 39K entity- and relation-tagged gold sentences in three Indian languages, plus English. We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information and provides competitive monolingual relation classification. Using this system, we explore and compare transfer mechanisms between languages. In particular, we study the accuracy-efficiency tradeoff between expensive gold instances vs. translated and aligned ‘silver’ instances.", "authors": [ "Soumen Chakrabarti", "Niloy Ganguly", "Animesh Mukherjee", "Bidisha Samanta", "Arijit Nag" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "conll-emnlp-2021-11" }, { "id": "surface-form-competition-why-the-highest-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.emnlp-main.564", "url_pdf": "https://aclanthology.org/2021.emnlp-main.564.pdf", "title": "Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right", "abstract": "Large language models have shown promising results in zero-shot settings. For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition—wherein different surface forms compete for probability mass, even if they represent the same underlying concept in a given context, e.g. “computer” and “PC.” Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options). We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to its a priori likelihood within the context of a specific task. It achieves consistent gains in zero-shot performance over both calibrated and uncalibrated scoring functions on all GPT-2 and GPT-3 models on a variety of multiple choice datasets.", "authors": [ "Luke Zettlemoyer", "Yejin Choi", "Vered Shwartz", "Peter West", "Ari Holtzman" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "emnlp-2021-11" }, { "id": "autosumm-automatic-model-creation-for-text", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.emnlp-main.798", "url_pdf": "https://aclanthology.org/2021.emnlp-main.798.pdf", "title": "AUTOSUMM: Automatic Model Creation for Text Summarization", "abstract": "Recent efforts to develop deep learning models for text generation tasks such as extractive and abstractive summarization have resulted in state-of-the-art performances on various datasets. However, obtaining the best model configuration for a given dataset requires an extensive knowledge of deep learning specifics like model architecture, tuning parameters etc., and is often extremely challenging for a non-expert. In this paper, we propose methods to automatically create deep learning models for the tasks of extractive and abstractive text summarization. Based on the recent advances in Automated Machine Learning and the success of large language models such as BERT and GPT-2 in encoding knowledge, we use a combination of Neural Architecture Search (NAS) and Knowledge Distillation (KD) techniques to perform model search and compression using the vast knowledge provided by these language models to develop smaller, customized models for any given dataset. We present extensive empirical results to illustrate the effectiveness of our model creation methods in terms of inference time and model size, while achieving near state-of-the-art performances in terms of accuracy across a range of datasets.", "authors": [ "Aparna Garimella", "Niyati Chhaya", "Raj Snehal", "Sagnik Mukherjee", "Jay Mundra", "Atharv Tyagi", "Sharmila Reddy Nangi" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "emnlp-2021-11" }, { "id": "gender-and-representation-bias-in-gpt-3", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.nuse-1.5", "url_pdf": "https://aclanthology.org/2021.nuse-1.5.pdf", "title": "Gender and Representation Bias in GPT-3 Generated Stories", "abstract": "Using topic modeling and lexicon-based word similarity, we find that stories generated by GPT-3 exhibit many known gender stereotypes. Generated stories depict different topics and descriptions depending on GPT-3’s perceived gender of the character in a prompt, with feminine characters more likely to be associated with family and appearance, and described as less powerful than masculine characters, even when associated with high power verbs in a prompt. Our study raises questions on how one can avoid unintended social biases when using large language models for storytelling.", "authors": [ "David Bamman", "Li Lucy" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-nuse-2021-6" }, { "id": "unsupervised-and-distributional-detection-of", "arxiv_id": "2111.02878", "nips_id": null, "url_abs": "https://arxiv.org/abs/2111.02878v1", "url_pdf": "https://arxiv.org/pdf/2111.02878v1.pdf", "title": "Unsupervised and Distributional Detection of Machine-Generated Text", "abstract": "The power of natural language generation models has provoked a flurry of interest in automatic methods to detect if a piece of text is human or machine-authored. The problem so far has been framed in a standard supervised way and consists in training a classifier on annotated data to predict the origin of one given new document. In this paper, we frame the problem in an unsupervised and distributional way: we assume that we have access to a large collection of unannotated documents, a big fraction of which is machine-generated. We propose a method to detect those machine-generated documents leveraging repeated higher-order n-grams, which we show over-appear in machine-generated text as compared to human ones. That weak signal is the starting point of a self-training setting where pseudo-labelled documents are used to train an ensemble of classifiers. Our experiments show that leveraging that signal allows us to rank suspicious documents accurately. Precision at 5000 is over 90% for top-k sampling strategies, and over 80% for nucleus sampling for the largest model we used (GPT2-large). The drop with increased size of model is small, which could indicate that the results hold for other current and future large language models.", "authors": [ "Hady Elsahar", "Germán Kruszewski", "Jos Rozen", "Matthias Gallé" ], "published": "2021-11-04", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "reason-first-then-respond-modular-generation", "arxiv_id": "2111.05204", "nips_id": null, "url_abs": "https://arxiv.org/abs/2111.05204v1", "url_pdf": "https://arxiv.org/pdf/2111.05204v1.pdf", "title": "Reason first, then respond: Modular Generation for Knowledge-infused Dialogue", "abstract": "Large language models can produce fluent dialogue but often hallucinate factual inaccuracies. While retrieval-augmented models help alleviate this issue, they still face a difficult challenge of both reasoning to provide correct knowledge and generating conversation simultaneously. In this work, we propose a modular model, Knowledge to Response (K2R), for incorporating knowledge into conversational agents, which breaks down this problem into two easier steps. K2R first generates a knowledge sequence, given a dialogue context, as an intermediate step. After this \"reasoning step\", the model then attends to its own generated knowledge sequence, as well as the dialogue context, to produce a final response. In detailed experiments, we find that such a model hallucinates less in knowledge-grounded dialogue tasks, and has advantages in terms of interpretability and modularity. In particular, it can be used to fuse QA and dialogue systems together to enable dialogue agents to give knowledgeable answers, or QA models to give conversational responses in a zero-shot setting.", "authors": [ "Jason Weston", "Arthur Szlam", "Jack Urbanek", "Kurt Shuster", "Leonard Adolphs" ], "published": "2021-11-09", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "popnet-a-pop-culture-knowledge-association", "arxiv_id": "2111.04920", "nips_id": null, "url_abs": "https://arxiv.org/abs/2111.04920v3", "url_pdf": "https://arxiv.org/pdf/2111.04920v3.pdf", "title": "PopBlends: Strategies for Conceptual Blending with Large Language Models", "abstract": "Pop culture is an important aspect of communication. On social media people often post pop culture reference images that connect an event, product or other entity to a pop culture domain. Creating these images is a creative challenge that requires finding a conceptual connection between the users' topic and a pop culture domain. In cognitive theory, this task is called conceptual blending. We present a system called PopBlends that automatically suggests conceptual blends. The system explores three approaches that involve both traditional knowledge extraction methods and large language models. Our annotation study shows that all three methods provide connections with similar accuracy, but with very different characteristics. Our user study shows that people found twice as many blend suggestions as they did without the system, and with half the mental demand. We discuss the advantages of combining large language models with knowledge bases for supporting divergent and convergent thinking.", "authors": [], "published": "2021-11-09", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "solving-probability-and-statistics-problems", "arxiv_id": "2111.08267", "nips_id": null, "url_abs": "https://arxiv.org/abs/2111.08267v1", "url_pdf": "https://arxiv.org/pdf/2111.08267v1.pdf", "title": "Solving Probability and Statistics Problems by Program Synthesis", "abstract": "We solve university level probability and statistics questions by program synthesis using OpenAI's Codex, a Transformer trained on text and fine-tuned on code. We transform course problems from MIT's 18.05 Introduction to Probability and Statistics and Harvard's STAT110 Probability into programming tasks. We then execute the generated code to get a solution. Since these course questions are grounded in probability, we often aim to have Codex generate probabilistic programs that simulate a large number of probabilistic dependencies to compute its solution. Our approach requires prompt engineering to transform the question from its original form to an explicit, tractable form that results in a correct program and solution. To estimate the amount of work needed to translate an original question into its tractable form, we measure the similarity between original and transformed questions. Our work is the first to introduce a new dataset of university-level probability and statistics problems and solve these problems in a scalable fashion using the program synthesis capabilities of large language models.", "authors": [ "Iddo Drori", "Nakul Verma", "Nikhil Singh", "Elizabeth Ke", "Leonard Tang" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "think-big-teach-small-do-language-models", "arxiv_id": null, "nips_id": null, "url_abs": "http://proceedings.neurips.cc/paper/2021/hash/0cd6a652ed1f7811192db1f700c8f0e7-Abstract.html", "url_pdf": "http://proceedings.neurips.cc/paper/2021/file/0cd6a652ed1f7811192db1f700c8f0e7-Paper.pdf", "title": "Think Big, Teach Small: Do Language Models Distil Occam’s Razor?", "abstract": "Large language models have recently shown a remarkable ability for few-shot learning, including patterns of algorithmic nature. However, it is still an open question to determine what kind of patterns these models can capture and how many examples they need in their prompts. We frame this question as a teaching problem with strong priors, and study whether language models can identify simple algorithmic concepts from small witness sets. In particular, we explore how several GPT architectures, program induction systems and humans perform in terms of the complexity of the concept and the number of additional examples, and how much their behaviour differs. This first joint analysis of language models and machine teaching can address key questions for artificial intelligence and machine learning, such as whether some strong priors, and Occam’s razor in particular, can be distilled from data, making learning from a few examples possible.", "authors": [ "José Hernández-Orallo", "Cesar Ferri", "David Castellano Falcón", "Gonzalo Jaimovitch-Lopez" ], "published": "2021-12-01", "conference": null, "conference_url_abs": "https://openreview.net/forum?id=F6gvhOgTM-4", "conference_url_pdf": "https://openreview.net/pdf?id=F6gvhOgTM-4", "proceeding": "neurips-2021-12" }, { "id": "directing-the-violence-or-admonishing-it-a", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=13LjoyYWcaw", "url_pdf": "https://openreview.net/pdf?id=13LjoyYWcaw", "title": "Directing the violence or admonishing it? A survey of contronymy and androcentrism in Google Translate and some recommendations", "abstract": "The recent raft of high-profile gaffes involving neural machine translation technologies has brought to light the unreliability of this evolving technology. A worrisome\nfacet of the ubiquity of this technology is that it largely operates in a use-it-at-yourown-peril mode where the user is often unaware of either the idiosyncratic brittleness of the underlying neural translation model or when it is, that the translations\nbe deemed trustworthy and when they wouldn’t. These revelations have worryingly\ncoincided with other developments such as the emergence of large language models\nthat now produce biased and erroneous results, albeit with human-like fluency, the\nuse of back-translation as a data-augmentation strategy in so termed ’low-resource’\nsettings and the emergence of ’AI-enhanced legal-tech’ as a panacea that promises\n’disruptive democratization’ of access to legal services. In the backdrop of these\nquandaries, we present this cautionary tale where we shed light on the specifics\nof the risks surrounding cavalier deployment of this technology by exploring two\nspecific failings: Androcentrism and Enantiosemy. In this regard, we empirically\ninvestigate the fate of the pronouns and a list of contronyms when subjected to\nback-translation using Google Translate. Through this, we seek to highlight the\nprevalence of ’defaulting-to-the-masculine’ phenomenon in the context of engendered profession-related translations and also empirically demonstrate the scale and\nnature of threats pertaining to contronymous phrases covering both current-affairs\nand legal issues. Based on these observations, we have collected a series of recommendations that constitute the latter half of this paper. All of the code and datasets\ngenerated in this paper have been open-sourced for the community to build on here:\nhttps://github.com/rteehas/GT_study_recommendations.\n", "authors": [ "Anonymous" ], "published": "2021-08-18", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "plug-and-play-conversational-models-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=Z4I9PcrWdRI", "url_pdf": "https://openreview.net/pdf?id=Z4I9PcrWdRI", "title": "Plug-and-Play Conversational Models", "abstract": "There has been considerable progress made towards conversational models that generate coherent and fluent responses; however, this often involves training large language models on large dialogue datasets, such as Reddit. These large conversational models provide little control over the generated responses, and this control is further limited in the absence of annotated conversational datasets for attribute specific generation that can be used for fine-tuning the model. In this paper, we first propose and evaluate plug-and-play methods for controllable response generation, which does not require dialogue specific datasets and does not rely on fine-tuning a large model. While effective, the decoding procedure induces considerable computational overhead, rendering the conversational model unsuitable for interactive usage. To overcome this, we introduce an approach that does not require further computation at decoding time, while also does not require any fine-tuning of a large language model. We demonstrate, through extensive automatic and human evaluation, a high degree\nof control over the generated conversational responses with regard to multiple desired attributes, while being fluent.", "authors": [ "Anonymous" ], "published": "2020-07-23", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "alephbert-pre-training-and-end-to-end", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=4IgzCL-ytZs", "url_pdf": "https://openreview.net/pdf?id=4IgzCL-ytZs", "title": "AlephBERT: Pre-training and End-to-End Language Models Evaluation from Sub-Word to Sentence Level", "abstract": "Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances. While advances reported for English using PLMs are unprecedented, reported advances using PLMs in Hebrew are few and far between. The problem is twofold. First, Hebrew resources for training large language models are not at the same order of magnitude as their English counterparts. Second, there are no accepted tasks and benchmarks to evaluate the progress of Hebrew PLMs on, and in particular, evaluation on sub-word (morphological) tasks. We aim to remedy both aspects. We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before. Moreover, we introduce a novel language-agnostic architecture that extracts all of the sub-word morphological segments encoded in contextualized word embedding vectors. Utilizing this new morphological component we offer a new PLM evaluation pipeline of multiple Hebrew tasks and benchmarks, that cover word-level, sub-word level and sentence level tasks. With AlephBERT we achieve state-of-the-art results compared against contemporary baselines. We make our AlephBERT model and evaluation pipeline publicly available, providing a single point of entry for evaluating and comparing Hebrew PLMs.", "authors": [ "Anonymous" ], "published": "2021-07-17", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-july-2021-7" }, { "id": "a-recipe-for-arbitrary-text-style-transfer-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=tTwG1UKHRB1", "url_pdf": "https://openreview.net/pdf?id=tTwG1UKHRB1", "title": "A Recipe For Arbitrary Text Style Transfer with Large Language Models", "abstract": "In this paper, we leverage large language models (LLMs) to perform zero-shot text style transfer. We present a prompting method that we call augmented zero-shot learning, which frames style transfer as a sentence rewriting task and requires only a natural language instruction, without model fine-tuning or exemplars in the target style. Augmented zero-shot learning is simple and demonstrates promising results not just on standard style transfer tasks such as sentiment, but also on arbitrary transformations such as 'make this melodramatic' or 'insert a metaphor.'", "authors": [ "Anonymous" ], "published": "2021-06-16", "conference": null, "conference_url_abs": "https://openreview.net/forum?id=XnwgpvL4PRf", "conference_url_pdf": "https://openreview.net/pdf?id=XnwgpvL4PRf", "proceeding": "acl-arr-october-2021-10" }, { "id": "low-rank-softmax-can-have-unargmaxable", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=9lH-J1uPY2i", "url_pdf": "https://openreview.net/pdf?id=9lH-J1uPY2i", "title": "Low rank softmax can have unargmaxable classes in theory but rarely in practice", "abstract": "Classifiers in natural language processing (NLP) often have a large number of output classes. For example, neural language models (LMs) and machine translation (MT) models both predict tokens from a vocabulary of thousands. The softmax output layer of these models typically receives as input a dense feature representation, which has much lower dimensionality than the output. In theory, the result is some words may be impossible to predict via argmax, irrespective of input features, and empirically, this has been shown to happen in small language models (Demeter et al., 2020). In this paper we ask whether it can happen in practical large language models and translation models. To do so, we develop algorithms to detect such unargmaxable tokens in public models. We find that that 13 out of 150 models do indeed have such tokens; however, they are very infrequent and unlikely to impact model quality. We release our algorithms and code to the public.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "when-classifying-grammatical-role-bert-doesn", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=nB4zLyclbom", "url_pdf": "https://openreview.net/pdf?id=nB4zLyclbom", "title": "When classifying grammatical role, BERT doesn't care about word order... except when it matters", "abstract": "Because meaning can often be inferred from lexical semantics alone, word order is often a redundant cue in natural language. For example, the words cut, chef, and onion are more likely used to convey \"The chef cut the onion,\" not \"The onion cut the chef.\" Recent work has shown large language models to be surprisingly word order invariant, but crucially has largely considered natural prototypical inputs, where compositional meaning mostly matches lexical expectations. To overcome this confound, we probe grammatical role representation in BERT and GPT-2 on non-prototypical instances. Such instances are naturally occurring sentences with inanimate subjects or animate objects, or sentences where we systematically swap the arguments to make sentences like \"The onion cut the chef\". We find that, while early layer embeddings are largely lexical, word order is in fact crucial in defining the later-layer representations of words in semantically non-prototypical positions. Our experiments isolate the effect of word order on the contextualization process, and highlight how models use context in the uncommon, but critical, instances where it matters. ", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "impact-of-tokenization-on-language-models-an", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=4zfWLKLrehI", "url_pdf": "https://openreview.net/pdf?id=4zfWLKLrehI", "title": "Impact of Tokenization on Language Models: An Analysis for Turkish", "abstract": "Tokenization is an important text preprocessing step to prepare input tokens for language models. WordPiece and BPE are de-facto methods employed by large language models, such as BERT and GPT. However, the impact of tokenization can be different for the agglutinative languages having words with prefixes and suffixes, such as Turkic languages. We compare five tokenization methods, including a morphological-level tokenization that takes agglutinative language structure into account. We train tokenizers, and pre-train mini language models using RoBERTa pre-training procedure on Turkish OSCAR corpus. We then fine-tune our models on six downstream tasks. There are two main outcomes: (i) Morphological and word-level tokenizers outperform de-facto tokenizers in particular cases. (ii) Mini models can be competitive to larger state-of-the-art models, such that a 14-times smaller model can recover 94\\% of the performance of a larger model.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "teaching-models-new-apis-domain-agnostic-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=3u6rOiXR9RA", "url_pdf": "https://openreview.net/pdf?id=3u6rOiXR9RA", "title": "Teaching Models new APIs: Domain-Agnostic Simulators for Task Oriented Dialogue", "abstract": "We demonstrate that large language models are able to simulate Task Oriented Dialogues in novel domains, provided only with an API implementation and a list of goals. We show these simulations can formulate online, automatic metrics that correlate well with human evaluations. Furthermore, by filtering for dialogues where goals are met, we can use simulation to repeatedly generate training data and improve the quality of the dialogues themselves. With no human intervention or domain-specific training data, our simulations bootstrap end-to-end models which achieve a 37\\% error reduction over baseline in previously unseen domains. By including as few as 32 domain-specific conversations, bootstrapped models can match the performance of a fully-supervised model with $10\\times$ more data.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "boosting-coherence-of-language-models-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=CJQqdS-fx3K", "url_pdf": "https://openreview.net/pdf?id=CJQqdS-fx3K", "title": "Boosting coherence of language models", "abstract": "Naturality of long-term information structure -- coherence -- remains a challenge in language generation. Large language models have insufficiently learned such structure, as their long-form generations differ from natural text in measures of coherence. To alleviate this divergence, we propose coherence boosting, an inference procedure that increases the effect of distant context on next-token prediction. We show the benefits of coherence boosting with pretrained models by distributional analyses of generated ordinary text and dialog responses. We also find that coherence boosting with state-of-the-art models for various zero-shot NLP tasks yields performance gains with no additional training.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "contextualized-sensorimotor-norms-multi", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=DkbUnKnnKVk", "url_pdf": "https://openreview.net/pdf?id=DkbUnKnnKVk", "title": "Contextualized Sensorimotor Norms: multi-dimensional measures of sensorimotor strength for ambiguous English words, in context", "abstract": "Most large language models are trained on linguistic input alone, yet humans appear to ground their understanding of words in sensorimotor experience. A natural solution is to augment LM representations with human judgments of a word's sensorimotor associations (e.g., the Lancaster Sensorimotor Norms), but this raises another challenge: most words are ambiguous, and judgments of words in isolation fail to account for this multiplicity of meaning (e.g., \"wooden table\" vs. \"data table\". We attempted to address this problem by building a new lexical resource of contextualized sensorimotor judgments for 112 English words, each rated in four different contexts (448 sentences total). We show that these ratings encode overlapping but distinct information from the Lancaster Sensorimotor Norms, and that they also predict other measures of interest (e.g., relatedness), above and beyond measures derived from BERT. ", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "active-dialogue-simulation-in-conversational", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=5GdS7K37pKN", "url_pdf": "https://openreview.net/pdf?id=5GdS7K37pKN", "title": "Active Dialogue Simulation in Conversational Systems", "abstract": "Semantic parsing helps conversational systems in satisfying users' requests through dialogues. To train these models, collecting annotated dialogues as a dataset is a very expensive and time-consuming process. In this paper, our goal is to utilize large language models and active learning to replace Wizard-of-Oz (WoZ) collection via crowdsourcing for bootstrapping training data for task-driven semantic parsers. We first demonstrate the utility of utterances generated by GPT-3 when seeded with prior training dialogues, as evaluated by human judges. We then explore the use of parser uncertainty on generated outputs as a selection criteria for annotation and contrast this with a strategy based on Core-sets. Our pipeline leads to more useful examples on average, motivating future work on active generation for bootstrapping semantic parsers.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "data-augmentation-for-intent-classification", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=Yy2sTU8uCak", "url_pdf": "https://openreview.net/pdf?id=Yy2sTU8uCak", "title": "Data Augmentation for Intent Classification with Generic Large Language Models", "abstract": "Data augmentation alleviates the problem of data scarcity when training language models (LMs) by generating new examples based on the existing data. A successful approach to generate new samples is to fine-tune a pretrained LM on the task-specific data and then sample from the label-conditioned LM. However, fine-tuning can be difficult when task-specific data is scarce. In this work, we explore whether large pretrained LMs can be used to generate new useful samples without fine-tuning. For a given class, we propose concatenating few examples and prompt them to GPT-3 to generate new examples. We evaluate this method for few-shot intent classification on CLINC150 and SNIPS and find that data generated by GPT-3 greatly improves the performance of the intent classifiers. Importantly, we find that, without any LM fine-tuning, the gains brought by data augmentation with GPT-3 are similar to those reported in prior work on LM-based data augmentation. Experiments with models of different sizes show that larger LMs generate higher quality samples that yield higher accuracy gains.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "evaluating-the-text-to-sql-capabilities-of", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=lYli-bAuK54", "url_pdf": "https://openreview.net/pdf?id=lYli-bAuK54", "title": "Evaluating the Text-to-SQL Capabilities of Large Language Models", "abstract": "We perform an empirical evaluation of Text-to-SQL capabilities of the Codex language model. We find that, without any finetuning, Codex is a strong baseline on the Spider benchmark; we also analyze the failure modes of Codex in this setting. Furthermore, we demonstrate on the GeoQuery and Scholar benchmarks that a small number of in-domain examples provided in the prompt enables Codex to perform better than state-of-the-art models finetuned on such few-shot examples.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "re-a-study-for-restorable-embeddings", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=XVPvYByfPxV", "url_pdf": "https://openreview.net/pdf?id=XVPvYByfPxV", "title": "RE: A Study for Restorable Embeddings", "abstract": "As the number of model parameters increased, large language models achieved linguistic fluency and exhibited high performance in various natural language tasks without gradient updates because the models could retain more knowledge.\nHowever, the large model size makes difficult to apply the model to a task requiring domain knowledge not included in the training corpus, due to the fact that knowledge stored in model parameters is not controllable during generation and model parameter updates are costly.\nTo tackle the problem, we suggest separating the language model and knowledge, and divide the end-to-end language model into three parts: 1) encoding knowledge, 2) processing the encoded knowledge, and 3) restoring the processed knowledge embedding to natural language.\nIn this paper, we propose a model for learning restorable embeddings as a first step toward the study to separate the language model and knowledge.\nThe experimental results shows that the proposed model can restore most knowledge in 1-2 sentences by encoding knowledge in sentence-level embeddings and then restoring the embeddings back to the original sentence.\nWe also verify that the embeddings generated through our method significantly improves performance in the passage retrieval task.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "mitigating-gender-bias-in-machine-translation-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=nZdr6_UDzW", "url_pdf": "https://openreview.net/pdf?id=nZdr6_UDzW", "title": "Mitigating Gender Bias in Machine Translation through Adversarial Learning", "abstract": "Machine translation and other NLP systems often contain significant biases regarding sensitive attributes, such as gender or race, that worsen system performance and perpetuate harmful stereotypes. Recent preliminary research suggests that adversarial learning can be used as part of a model-agnostic bias mitigation method that requires no data modifications. However, adapting this strategy for machine translation and other modern NLP domains requires (1) restructuring training objectives in the context of fine-tuning pretrained large language models and (2) developing measures for gender or other protected variables for tasks in which these attributes must be deduced from the data itself.\n\nWe present an adversarial learning framework that addresses these challenges to mitigate gender bias in seq2seq machine translation. Our framework improves the disparity in translation quality for sentences with male vs. female entities by 86% for English-German translation and 91% for English-French translation, with minimal effect on translation quality. The results suggest that adversarial learning is a promising technique for mitigating gender bias in machine translation.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "context-aware-language-modeling-for-goal", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=qZEAjNzBHv", "url_pdf": "https://openreview.net/pdf?id=qZEAjNzBHv", "title": "Context-Aware Language Modeling for Goal-Oriented Dialogue Systems", "abstract": "Goal-oriented dialogue systems has long faced the trade-off between fluent language generation and task-specific control. While supervised learning with large language models are capable of producing realistic responses, how to steer such responses towards completing a specific task without sacrificing language quality remains an open question. In this work, by viewing a goal-oriented dialogue system as a reinforcement learning (RL) problem, we turn a supervised language model into a dynamics model and a behavioral cloning policy in a partially observable Markov decision process. This view allows RL techniques such as task relabeling and goal-conditioned policy to be naturally adopted as a form of data augmentation and task-specific fintuning of language models. We evaluate our method, Context-Aware Language Models (\\method), on a practical flight-booking task using AirDialogue. Empirically, \\method outperforms the previous state-of-the-art method by more than 10\\% in terms of task success, achieving human-level task performance on this dataset.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "pretraining-over-interactions-for-learning", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=f_zJvXNd4e", "url_pdf": "https://openreview.net/pdf?id=f_zJvXNd4e", "title": "Pretraining over Interactions for Learning Grounded Object Representations", "abstract": "Large language models have been criticized for their limited ability to reason about \\textit{affordances} - the actions that can be performed on an object. It has been argued that to accomplish this, models need some form of grounding, i.e., connection, to objects and how they interact in the physical world. Inspired by the way humans learn about the world through interaction, we develop an approach to learning physical properties directly. We introduce a dataset of 200k object interactions in a 3D virtual environment and a self-supervised pretraining objective for learning representations of these objects. We show with probing and clustering experiments that even in the zero-shot setting, derived models learn robust representations of objects and their affordances in an unsupervised manner. Our model outperforms pretrained language and vision models on an affordance prediction baseline, suggesting that pretraining on observed interactions encodes grounded information that is not readily learned in conventional text or vision models.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "alephbert-language-model-pre-training-and", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=xicP8EAgXFU", "url_pdf": "https://openreview.net/pdf?id=xicP8EAgXFU", "title": "AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level", "abstract": "Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances.\nWhile advances reported for English using PLMs are unprecedented, reported advances using PLMs for Hebrew are few and far between.\nThe problem is twofold.\nFirst, so far, Hebrew resources for training large language models are not of the same magnitude as their English counterparts.\nSecond, there are no accepted benchmarks to evaluate the progress of Hebrew PLMs on, and in particular, sub-word (morphological) tasks.\nWe aim to remedy both aspects.\nWe present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before.\nMoreover, we introduce a novel language-agnostic architecture that can recover all of the sub-word morphological segments encoded in contextualized word embedding vectors.\nBased on this new morphological component we offer a new PLM evaluation suite consisting of multiple tasks and benchmarks, that cover sentence level word-level and sub-word level analyses.\nOn all tasks, AlephBERT obtains state-of-the-art results beyond contemporary Hebrew baselines. \nWe make our AlephBERT model, the morphological extraction mode, and the Hebrew evaluation suite publicly available, providing a single point of entry for assessing Hebrew PLMs.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "meta-learning-via-language-model-in-context-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=6BBmgbDaOYB", "url_pdf": "https://openreview.net/pdf?id=6BBmgbDaOYB", "title": "Meta-learning via Language Model In-context Tuning", "abstract": "The goal of meta-learning is to learn to adapt to a new task with only a few labeled examples. Inspired by the recent progress in large language models, we propose $\\textit{in-context tuning}$ (ICT), which recasts task adaptation and prediction as a simple sequence prediction problem: to form the input sequence, we concatenate the task instruction, labeled in-context examples, and the target input to predict; to meta-train the model to learn from in-context examples, we fine-tune a pre-trained language model (LM) to predict the target label given the input sequence on a collection of tasks.\nWe benchmark our method on two collections of text classification tasks: LAMA and BinaryClfs. Compared to MAML which adapts the model through gradient descent, our method leverages the inductive bias of pre-trained LMs to perform pattern matching, and outperforms MAML by an absolute $6\\%$ average AUC-ROC score on BinaryClfs, gaining more advantage with increasing model size. Compared to non-fine-tuned in-context learning (i.e. prompting a raw LM), in-context tuning meta-trains the model to learn from in-context examples. On BinaryClfs, ICT improves the average AUC-ROC score by an absolute $10\\%$, and reduces the variance due to example ordering by 6x and example choices by 2x.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "solving-probability-and-statistics-problems-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=XOI2xQDpzqz", "url_pdf": "https://openreview.net/pdf?id=XOI2xQDpzqz", "title": "Solving Probability and Statistics Problems by Program Synthesis", "abstract": "We solve university level probability and statistics questions by program synthesis using OpenAI's Codex, a Transformer trained on text and fine-tuned on code. We transform course problems from MIT's 18.05 Introduction to Probability and Statistics and Harvard's STAT110 Probability into programming tasks. We then execute the generated code to get a solution. Since these course questions are grounded in probability, we often aim to have Codex generate probabilistic programs that simulate a large number of probabilistic dependencies, to compute its solution. Our approach requires prompt engineering to transform the question from its original form to an explicit, tractable form that results in a correct program and solution. To estimate the amount of work needed to translate an original question into its tractable form, we measure the similarity between original and transformed questions. Our work is the first to introduce a new dataset of university-level probability and statistics problems and solve these problems in a scalable fashion using the program synthesis capabilities of large language models.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "semantic-oriented-unlabeled-priming-for-large", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=A3o2LyGc9K_", "url_pdf": "https://openreview.net/pdf?id=A3o2LyGc9K_", "title": "Semantic-Oriented Unlabeled Priming for Large-Scale Language Models", "abstract": "Due to the high costs associated with finetuning large language models, various recent works propose to adapt them to specific tasks without any parameter updates through in-context learning. Unfortunately, for in-context learning there is currently no way to leverage unlabeled data, which is often much easier to obtain in large quantities than labeled examples. In this work, we therefore investigate ways to make use of unlabeled examples to improve the zero-shot performance of pretrained language models without any finetuning: We introduce Semantic-Oriented Unlabeled Priming (SOUP), a method that classifies examples by retrieving semantically similar unlabeled examples, assigning labels to them in a zero-shot fashion, and then using them for in-context learning. We also propose bag-of-contexts priming, a new priming strategy that is more suitable for our setting and enables the usage of more examples than fit into the context window.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "how-does-the-pre-training-objective-affect", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=SGgyIY2Xro", "url_pdf": "https://openreview.net/pdf?id=SGgyIY2Xro", "title": "How does the pre-training objective affect what large language models learn about linguistic properties?", "abstract": "Several pre-training objectives, such as masked language modeling (MLM), have been proposed to pre-train language models (e.g. BERT) with the aim of learning better language representations. However, to the best of our knowledge, no previous work so far has investigated how different pre-training objectives affect what BERT learns about linguistics properties. We hypothesize that linguistically motivated objectives (e.g. MLM) should help BERT to acquire better linguistic knowledge compared to using non-linguistically motivated objectives, i.e. hard for humans to guess the association between the input and the label to be predicted. To this end, we pre-train BERT with two linguistically motivated objectives and three non-linguistically motivated ones. We then probe for linguistic characteristics encoded in the representation of the resulting models. We find strong evidence that there is no actual differences in probing performance between the representations learned by the two different types of objectives. These surprising results question the dominant narrative of linguistically informed pre-training.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "it-s-my-job-to-be-repetitive-my-job-my-job", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=gro2GtKb2VY", "url_pdf": "https://openreview.net/pdf?id=gro2GtKb2VY", "title": "It's my Job to be Repetitive! My Job! My Job! -- Linking Repetitions to In-Context Learning in Language Models", "abstract": "Recent studies have shown that large language models can display surprising accuracy at learning tasks from few examples presented in the input context, which goes under the name of in-context learning. Other studies have shown that language models can sometimes display the undesirable behavior of falling back into loops in which an utterance is repeated infinitely often. Here, we observe that the model's capacity to produce repetitions goes well beyond frequent or well-formed utterances, and generalizes to repeating completely arbitrary sequences of tokens. Construing this as a simple form of in-context learning, we hypothesize that these two phenomena are linked through shared processing steps. With controlled experiments, we show that impairing the network from producing repetitions severely affects in-context learning, without reducing its overall predictive performance, thus supporting the proposed hypothesis.", "authors": [ "Anonymous" ], "published": "2021-10-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-october-2021-10" }, { "id": "wechsel-effective-initialization-of-subword", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=JcfISE1-u4", "url_pdf": "https://openreview.net/pdf?id=JcfISE1-u4", "title": "WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models", "abstract": "Recently, large pretrained language models (LMs) have gained popularity. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a method – called WECHSEL – to transfer English models to new languages. We exchange the tokenizer of the English model to a tokenizer in the target language and initialize token embeddings such that they are close to semantically similar English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer GPT-2 and RoBERTa models to 4 other languages (French, German, Chinese and Swahili). WECHSEL improves over a previously proposed method for cross-lingual parameter transfer and outperforms models of comparable size trained from scratch in the target language with up to 64x less training effort. Our method makes training large language models for new languages more accessible and less damaging to the environment. We make our code and models publicly available.", "authors": [ "Anonymous" ], "published": "2021-10-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-october-2021-10" }, { "id": "pureformer-do-we-even-need-attention", "arxiv_id": "2111.15588", "nips_id": null, "url_abs": "https://arxiv.org/abs/2111.15588v4", "url_pdf": "https://arxiv.org/pdf/2111.15588v4.pdf", "title": "SimpleTRON: Simple Transformer with O(N) Complexity", "abstract": "In this paper, we propose that the dot product pairwise matching attention layer, which is widely used in Transformer-based models, is redundant for the model performance. Attention, in its original formulation, has to be seen rather as a human-level tool to explore and/or visualize relevancy scores in sequential data. However, the way how it is constructed leads to significant computational complexity. Instead, we present SimpleTRON: Simple Transformer with O(N) Complexity, a simple and fast alternative without any approximation that, unlike other approximation models, does not have any architecture-related overhead and therefore can be seen as a purely linear Transformer-like model. This architecture, to the best of our knowledge, outperforms existing sub-quadratic attention approximation models on several tasks from the Long-Range Arena benchmark. Moreover, we show, that SimpleTRON can benefit from weight transfer from pretrained large language models, as its parameters can be fully transferable.", "authors": [ "Tomáš Mikolov", "Pavel Kordík", "Daniel Vašata", "Vojtěch Vančura", "Alexander Kovalenko", "Uladzislau Yorsh" ], "published": "2021-11-23", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "a-general-language-assistant-as-a-laboratory", "arxiv_id": "2112.00861", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.00861v3", "url_pdf": "https://arxiv.org/pdf/2112.00861v3.pdf", "title": "A General Language Assistant as a Laboratory for Alignment", "abstract": "Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a `preference model pre-training' stage of training, with the goal of improving sample efficiency when finetuning on human preferences.", "authors": [ "Jared Kaplan", "Chris Olah", "Sam McCandlish", "Jack Clark", "Tom Brown", "Dario Amodei", "Catherine Olsson", "Kamal Ndousse", "Jackson Kernion", "Danny Hernandez", "Zac Hatfield-Dodds", "Nelson Elhage", "Nova DasSarma", "Ben Mann", "Nicholas Joseph", "Andy Jones", "Tom Henighan", "Deep Ganguli", "Dawn Drain", "Anna Chen", "Yuntao Bai", "Amanda Askell" ], "published": "2021-12-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "can-openai-codex-and-other-large-language", "arxiv_id": "2112.02125", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.02125v3", "url_pdf": "https://arxiv.org/pdf/2112.02125v3.pdf", "title": "Examining Zero-Shot Vulnerability Repair with Large Language Models", "abstract": "Human developers can produce code with cybersecurity bugs. Can emerging 'smart' code completion tools help repair those bugs? In this work, we examine the use of large language models (LLMs) for code (such as OpenAI's Codex and AI21's Jurassic J-1) for zero-shot vulnerability repair. We investigate challenges in the design of prompts that coax LLMs into generating repaired versions of insecure code. This is difficult due to the numerous ways to phrase key information - both semantically and syntactically - with natural languages. We perform a large scale study of five commercially available, black-box, \"off-the-shelf\" LLMs, as well as an open-source model and our own locally-trained model, on a mix of synthetic, hand-crafted, and real-world security bug scenarios. Our experiments demonstrate that while the approach has promise (the LLMs could collectively repair 100% of our synthetically generated and hand-crafted scenarios), a qualitative evaluation of the model's performance over a corpus of historical real-world examples highlights challenges in generating functionally correct code.", "authors": [ "Brendan Dolan-Gavitt", "Ramesh Karri", "Baleegh Ahmad", "Benjamin Tan", "Hammond Pearce" ], "published": "2021-12-03", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "jigsaw-large-language-models-meet-program", "arxiv_id": "2112.02969", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.02969v1", "url_pdf": "https://arxiv.org/pdf/2112.02969v1.pdf", "title": "Jigsaw: Large Language Models meet Program Synthesis", "abstract": "Large pre-trained language models such as GPT-3, Codex, and Google's language model are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, jigsaw has an important role to play in improving the accuracy of the systems.", "authors": [], "published": "2021-12-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "wechsel-effective-initialization-of-subword-1", "arxiv_id": "2112.06598", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.06598v2", "url_pdf": "https://arxiv.org/pdf/2112.06598v2.pdf", "title": "WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models", "abstract": "Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a novel method -- called WECHSEL -- to efficiently and effectively transfer pretrained LMs to new languages. WECHSEL can be applied to any model which uses subword-based tokenization and learns an embedding for each subword. The tokenizer of the source model (in English) is replaced with a tokenizer in the target language and token embeddings are initialized such that they are semantically similar to the English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer the English RoBERTa and GPT-2 models to four languages (French, German, Chinese and Swahili). We also study the benefits of our method on very low-resource languages. WECHSEL improves over proposed methods for cross-lingual parameter transfer and outperforms models of comparable size trained from scratch with up to 64x less training effort. Our method makes training large language models for new languages more accessible and less damaging to the environment. We make our code and models publicly available.", "authors": [ "Navid Rekabsaz", "Fabian Paischer", "Benjamin Minixhofer" ], "published": "2021-12-13", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.naacl-main.293", "conference_url_pdf": "https://aclanthology.org/2022.naacl-main.293.pdf", "proceeding": "naacl-2022-7" }, { "id": "training-multi-layer-over-parametrized-neural-1", "arxiv_id": "2112.07628", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.07628v2", "url_pdf": "https://arxiv.org/pdf/2112.07628v2.pdf", "title": "Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time", "abstract": "We consider the problem of training a multi-layer over-parametrized neural network to minimize the empirical risk induced by a loss function. In the typical setting of over-parametrization, the network width $m$ is much larger than the data dimension $d$ and the number of training samples $n$ ($m=\\mathrm{poly}(n,d)$), which induces a prohibitive large weight matrix $W\\in \\mathbb{R}^{m\\times m}$ per layer. Naively, one has to pay $O(m^2)$ time to read the weight matrix and evaluate the neural network function in both forward and backward computation. In this work, we show how to reduce the training cost per iteration. Specifically, we propose a framework that uses $m^2$ cost only in the initialization phase and achieves \\emph{a truly subquadratic cost per iteration} in terms of $m$, i.e., $m^{2-\\Omega(1)}$ per iteration. Our result has implications beyond standard over-parametrization theory, as it can be viewed as designing an efficient data structure on top of a pre-trained large model to further speed up the fine-tuning process, a core procedure to deploy large language models (LLM).", "authors": [ "Ruizhe Zhang", "Lichen Zhang", "Zhao Song" ], "published": "2021-12-14", "conference": "training-multi-layer-over-parametrized-neural", "conference_url_abs": "https://openreview.net/forum?id=OMxLn4t03FG", "conference_url_pdf": "https://openreview.net/pdf?id=OMxLn4t03FG", "proceeding": null }, { "id": "language-models-are-not-models-of-language", "arxiv_id": "2112.07055", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.07055v2", "url_pdf": "https://arxiv.org/pdf/2112.07055v2.pdf", "title": "Large Language Models are not Models of Natural Language: they are Corpus Models", "abstract": "Natural Language Processing (NLP) has become one of the leading application areas in the current Artificial Intelligence boom. Transfer learning has enabled large deep learning neural networks trained on the language modeling task to vastly improve performance in almost all downstream language tasks. Interestingly, when the language models are trained with data that includes software code, they demonstrate remarkable abilities in generating functioning computer code from natural language specifications. We argue that this creates a conundrum for the claim that eliminative neural models are a radical restructuring in our understanding of cognition in that they eliminate the need for symbolic abstractions like generative phrase structure grammars. Because the syntax of programming languages is by design determined by phrase structure grammars, neural models that produce syntactic code are apparently uninformative about the theoretical foundations of programming languages. The demonstration that neural models perform well on tasks that involve clearly symbolic systems, proves that they cannot be used as an argument that language and other cognitive systems are not symbolic. Finally, we argue as a corollary that the term language model is misleading and propose the adoption of the working term corpus model instead, which better reflects the genesis and contents of the model.", "authors": [ "Csaba Veres" ], "published": "2021-12-13", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "few-shot-semantic-parsing-with-language", "arxiv_id": "2112.08696", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.08696v2", "url_pdf": "https://arxiv.org/pdf/2112.08696v2.pdf", "title": "Few-Shot Semantic Parsing with Language Models Trained On Code", "abstract": "Large language models can perform semantic parsing with little training data, when prompted with in-context examples. It has been shown that this can be improved by formulating the problem as paraphrasing into canonical utterances, which casts the underlying meaning representation into a controlled natural language-like representation. Intuitively, such models can more easily output canonical utterances as they are closer to the natural language used for pre-training. Recently, models also pre-trained on code, like OpenAI Codex, have risen in prominence. For semantic parsing tasks where we map natural language into code, such models may prove more adept at it. In this paper, we test this hypothesis and find that Codex performs better on such tasks than equivalent GPT-3 models. We evaluate on Overnight and SMCalFlow and find that unlike GPT-3, Codex performs similarly when targeting meaning representations directly, perhaps because meaning representations are structured similar to code in these datasets.", "authors": [ "Benjamin Van Durme", "Richard Shin" ], "published": "2021-12-16", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.naacl-main.396", "conference_url_pdf": "https://aclanthology.org/2022.naacl-main.396.pdf", "proceeding": "naacl-2022-7" }, { "id": "crass-a-novel-data-set-and-benchmark-to-test", "arxiv_id": "2112.11941", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.11941v3", "url_pdf": "https://arxiv.org/pdf/2112.11941v3.pdf", "title": "CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models", "abstract": "We introduce the CRASS (counterfactual reasoning assessment) data set and benchmark utilizing questionized counterfactual conditionals as a novel and powerful tool to evaluate large language models. We present the data set design and benchmark that supports scoring against a crowd-validated human baseline. We test six state-of-the-art models against our benchmark. Our results show that it poses a valid challenge for these models and opens up considerable room for their improvement.", "authors": [ "Frank Binder", "Jörg Frohberg" ], "published": "2021-12-22", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.lrec-1.229", "conference_url_pdf": "https://aclanthology.org/2022.lrec-1.229.pdf", "proceeding": "lrec-2022-6" }, { "id": "what-do-large-language-models-learn-about", "arxiv_id": "2112.13834", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.13834v2", "url_pdf": "https://arxiv.org/pdf/2112.13834v2.pdf", "title": "What do Large Language Models Learn about Scripts?", "abstract": "Script Knowledge (Schank and Abelson, 1975) has long been recognized as crucial for language understanding as it can help in filling in unstated information in a narrative. However, such knowledge is expensive to produce manually and difficult to induce from text due to reporting bias (Gordon and Van Durme, 2013). In this work, we are interested in the scientific question of whether explicit script knowledge is present and accessible through pre-trained generative language models (LMs). To this end, we introduce the task of generating full event sequence descriptions (ESDs) given a scenario in the form of natural language prompts. In zero-shot probing experiments, we find that generative LMs produce poor ESDs with mostly omitted, irrelevant, repeated or misordered events. To address this, we propose a pipeline-based script induction framework (SIF) which can generate good quality ESDs for unseen scenarios (e.g., bake a cake). SIF is a two-staged framework that fine-tunes LM on a small set of ESD examples in the first stage. In the second stage, ESD generated for an unseen scenario is post-processed using RoBERTa-based models to filter irrelevant events, remove repetitions, and reorder the temporally misordered events. Through automatic and manual evaluations, we demonstrate that SIF yields substantial improvements ($1$-$3$ BLUE points) over a fine-tuned LM. However, manual analysis shows that there is great room for improvement, offering a new research direction for inducing script knowledge.", "authors": [ "Rachel Rudinger", "Abhilasha Sancheti" ], "published": "2021-12-27", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.starsem-1.1", "conference_url_pdf": "https://aclanthology.org/2022.starsem-1.1.pdf", "proceeding": "sem-naacl-2022-7" }, { "id": "efficient-hierarchical-domain-adaptation-for", "arxiv_id": "2112.08786", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.08786v2", "url_pdf": "https://arxiv.org/pdf/2112.08786v2.pdf", "title": "Efficient Hierarchical Domain Adaptation for Pretrained Language Models", "abstract": "The remarkable success of large language models has been driven by dense models trained on massive unlabeled, unstructured corpora. These corpora typically contain text from diverse, heterogeneous sources, but information about the source of the text is rarely used during training. Transferring their knowledge to a target domain is typically done by continuing training in-domain. In this paper, we introduce a method to permit domain adaptation to many diverse domains using a computationally efficient adapter approach. Our method is based on the observation that textual domains are partially overlapping, and we represent domains as a hierarchical tree structure where each node in the tree is associated with a set of adapter weights. When combined with a frozen pretrained language model, this approach enables parameter sharing among related domains, while avoiding negative interference between unrelated ones. Experimental results with GPT-2 and a large fraction of the 100 most represented websites in C4 show across-the-board improvements in-domain. We additionally provide an inference time algorithm for a held-out domain and show that averaging over multiple paths through the tree enables further gains in generalization, while adding only a marginal cost to inference.", "authors": [ "Jesse Dodge", "Matthew E. Peters", "Alexandra Chronopoulou" ], "published": "2021-12-16", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.naacl-main.96", "conference_url_pdf": "https://aclanthology.org/2022.naacl-main.96.pdf", "proceeding": "naacl-2022-7" }, { "id": "reframing-human-ai-collaboration-for", "arxiv_id": "2112.08674", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.08674v2", "url_pdf": "https://arxiv.org/pdf/2112.08674v2.pdf", "title": "Reframing Human-AI Collaboration for Generating Free-Text Explanations", "abstract": "Large language models are increasingly capable of generating fluent-appearing text with relatively little task-specific supervision. But can these models accurately explain classification decisions? We consider the task of generating free-text explanations using human-written examples in a few-shot manner. We find that (1) authoring higher quality prompts results in higher quality generations; and (2) surprisingly, in a head-to-head comparison, crowdworkers often prefer explanations generated by GPT-3 to crowdsourced explanations in existing datasets. Our human studies also show, however, that while models often produce factual, grammatical, and sufficient explanations, they have room to improve along axes such as providing novel information and supporting the label. We create a pipeline that combines GPT-3 with a supervised filter that incorporates binary acceptability judgments from humans in the loop. Despite the intrinsic subjectivity of acceptability judgments, we demonstrate that acceptability is partially correlated with various fine-grained attributes of explanations. Our approach is able to consistently filter GPT-3-generated explanations deemed acceptable by humans.", "authors": [ "Yejin Choi", "Mark Riedl", "Swabha Swayamdipta", "Jack Hessel", "Sarah Wiegreffe" ], "published": "2021-12-16", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.naacl-main.47", "conference_url_pdf": "https://aclanthology.org/2022.naacl-main.47.pdf", "proceeding": "naacl-2022-7" }, { "id": "improving-scripts-with-a-memory-of-natural", "arxiv_id": "2112.09737", "nips_id": null, "url_abs": "https://arxiv.org/abs/2112.09737v2", "url_pdf": "https://arxiv.org/pdf/2112.09737v2.pdf", "title": "Learning to Repair: Repairing model output errors after deployment using a dynamic memory of feedback", "abstract": "Large language models (LMs), while powerful, are not immune to mistakes, but can be difficult to retrain. Our goal is for an LM to continue to improve after deployment, without retraining, using feedback from the user. Our approach pairs an LM with (i) a growing memory of cases where the user identified an output error and provided general feedback on how to correct it (ii) a corrector model, trained to translate this general feedback into specific edits to repair the model output. Given a new, unseen input, our model can then use feedback from similar, past cases to repair output errors that may occur. We instantiate our approach using an existing, fixed model for script generation, that takes a goal (e.g., \"bake a cake\") and generates a partially ordered sequence of actions to achieve that goal, sometimes containing errors. Our memory-enhanced system, FBNet, learns to apply user feedback to repair such errors (up to 30 points improvement), while making a start at avoiding similar past mistakes on new, unseen examples (up to 7 points improvement in a controlled setting). This is a first step towards strengthening deployed models, potentially broadening their utility. Our code and data is available at https://github.com/allenai/interscript/.", "authors": [ "Yiming Yang", "Peter Clark", "Aman Madaan", "Niket Tandon" ], "published": "2021-12-16", "conference": null, "conference_url_abs": "https://aclanthology.org/2022.findings-naacl.26", "conference_url_pdf": "https://aclanthology.org/2022.findings-naacl.26.pdf", "proceeding": "findings-naacl-2022-7" }, { "id": "bert-for-sentiment-analysis-pre-trained-and", "arxiv_id": "2201.03382", "nips_id": null, "url_abs": "https://arxiv.org/abs/2201.03382v1", "url_pdf": "https://arxiv.org/pdf/2201.03382v1.pdf", "title": "BERT for Sentiment Analysis: Pre-trained and Fine-Tuned Alternatives", "abstract": "BERT has revolutionized the NLP field by enabling transfer learning with large language models that can capture complex textual patterns, reaching the state-of-the-art for an expressive number of NLP applications. For text classification tasks, BERT has already been extensively explored. However, aspects like how to better cope with the different embeddings provided by the BERT output layer and the usage of language-specific instead of multilingual models are not well studied in the literature, especially for the Brazilian Portuguese language. The purpose of this article is to conduct an extensive experimental study regarding different strategies for aggregating the features produced in the BERT output layer, with a focus on the sentiment analysis task. The experiments include BERT models trained with Brazilian Portuguese corpora and the multilingual version, contemplating multiple aggregation strategies and open-source datasets with predefined training, validation, and test partitions to facilitate the reproducibility of the results. BERT achieved the highest ROC-AUC values for the majority of cases as compared to TF-IDF. Nonetheless, TF-IDF represents a good trade-off between the predictive performance and computational cost.", "authors": [ "João Filho", "Frederico Souza" ], "published": "2022-01-10", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "tiltedbert-resource-adjustable-version-of", "arxiv_id": "2201.03327", "nips_id": null, "url_abs": "https://arxiv.org/abs/2201.03327v8", "url_pdf": "https://arxiv.org/pdf/2201.03327v8.pdf", "title": "Latency Adjustable Transformer Encoder for Language Understanding", "abstract": "Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of an efficient architecture. This paper proposes an efficient Transformer architecture that adjusts the inference computational cost adaptively with a desired inference latency speedup. In fine-tuning phase, the proposed method detects less important hidden sequence elements (word-vectors) and eliminates them in each encoder layer using a proposed Attention Context Contribution (ACC) metric. After the fine-tuning phase, with the novel offline-tuning property, the inference latency of the model can be adjusted in a wide range of inference speedup selections without any further training. The proposed method is applied to the BERT_base, GPT-2 and Flan-T5 models for evaluation. Extensive experiments show that most of the word-vectors in higher Transformer layers have less contribution to the subsequent layers; hence, they can be eliminated to improve the inference latency. Experimental results on extensive sentiment analysis, classification, text generation tasks and regression benchmarks like GLUE showed that the method is effective in various datasets with minimal impact on the input's global context. The method was also evaluated under the instruction tuning paradigm, and its performance was measured using different types of prompting. The proposed method mathematically and experimentally improves the inference latency of BERT_base and GPT-2 by up to 4.8 and 3.72 times with less than 0.75% accuracy drop and passable perplexity on average. The suggested approach posits that in Large Language Models (LLMs), although the complete network is necessary for training, it can be truncated during the fine-tuning phase.", "authors": [ "Mohammad Sharifkhani", "Sajjad Kachuee" ], "published": "2022-01-10", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "language-models-as-zero-shot-planners-1", "arxiv_id": "2201.07207", "nips_id": null, "url_abs": "https://arxiv.org/abs/2201.07207v2", "url_pdf": "https://arxiv.org/pdf/2201.07207v2.pdf", "title": "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents", "abstract": "Can world knowledge learned by large language models (LLMs) be used to act in interactive environments? In this paper, we investigate the possibility of grounding high-level tasks, expressed in natural language (e.g. \"make breakfast\"), to a chosen set of actionable steps (e.g. \"open fridge\"). While prior work focused on learning from explicit step-by-step examples of how to act, we surprisingly find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into mid-level plans without any further training. However, the plans produced naively by LLMs often cannot map precisely to admissible actions. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions. Our evaluation in the recent VirtualHome environment shows that the resulting method substantially improves executability over the LLM baseline. The conducted human evaluation reveals a trade-off between executability and correctness but shows a promising sign towards extracting actionable knowledge from language models. Website at https://huangwl18.github.io/language-planner", "authors": [ "Igor Mordatch", "Deepak Pathak", "Pieter Abbeel", "Wenlong Huang" ], "published": "2022-01-18", "conference": "language-models-as-zero-shot-planners", "conference_url_abs": "https://openreview.net/forum?id=6NT1a56mNim", "conference_url_pdf": "https://openreview.net/pdf?id=6NT1a56mNim", "proceeding": null }, { "id": "coauthor-designing-a-human-ai-collaborative", "arxiv_id": "2201.06796", "nips_id": null, "url_abs": "https://arxiv.org/abs/2201.06796v2", "url_pdf": "https://arxiv.org/pdf/2201.06796v2.pdf", "title": "CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities", "abstract": "Large language models (LMs) offer unprecedented language generation capabilities and exciting opportunities for interaction design. However, their highly context-dependent capabilities are difficult to grasp and are often subjectively interpreted. In this paper, we argue that by curating and analyzing large interaction datasets, the HCI community can foster more incisive examinations of LMs' generative capabilities. Exemplifying this approach, we present CoAuthor, a dataset designed for revealing GPT-3's capabilities in assisting creative and argumentative writing. CoAuthor captures rich interactions between 63 writers and four instances of GPT-3 across 1445 writing sessions. We demonstrate that CoAuthor can address questions about GPT-3's language, ideation, and collaboration capabilities, and reveal its contribution as a writing \"collaborator\" under various definitions of good collaboration. Finally, we discuss how this work may facilitate a more principled discussion around LMs' promises and pitfalls in relation to interaction design. The dataset and an interface for replaying the writing sessions are publicly available at https://coauthor.stanford.edu.", "authors": [ "Qian Yang", "Percy Liang", "Mina Lee" ], "published": "2022-01-18", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null } ] }{ "count": 24708, "next": "