Paper List
Return a paginated listing of all papers.
GET /api/v1/papers/?ordering=-arxiv_id&q=Large+Language+Models
https://paperswithcode.com/api/v1/papers/?ordering=-arxiv_id&page=2&q=Large+Language+Models", "previous": null, "results": [ { "id": "fine-tuning-large-language-models-for", "arxiv_id": null, "nips_id": null, "url_abs": "https://link.springer.com/chapter/10.1007/978-3-031-36021-3_15", "url_pdf": "https://link.springer.com/chapter/10.1007/978-3-031-36021-3_15", "title": "Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets", "abstract": "We study the ability of pretrained large language models (LLM) to answer questions from online question answering fora such as Stack Overflow. We consider question-answer pairs where the main part of the answer consists of source code. On two benchmark datasets—CoNaLa and a newly collected dataset based on Stack Overflow—we investigate how a closed-book question answering system can be improved by fine-tuning the LLM for the downstream task, prompt engineering, and data preprocessing. We use publicly available autoregressive language models such as GPT-Neo, CodeGen, and PanGu-Coder, and after the proposed fine-tuning achieve a BLEU score of 0.4432 on the CoNaLa test set, significantly exceeding previous state of the art for this task.", "authors": [ "Artem Aliev", "Sergey Nikolenko", "Maxim Omelchenko", "Sergey Kovalchuk", "Vadim Lomshakov" ], "published": "2023-06-26", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "iccs-international-conference-on" }, { "id": "evaluation-of-large-language-model", "arxiv_id": null, "nips_id": null, "url_abs": "https://www.medrxiv.org/content/10.1101/2024.05.17.24307411v1", "url_pdf": "https://www.medrxiv.org/content/10.1101/2024.05.17.24307411v1.full.pdf", "title": "Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark", "abstract": "Background The ability of large language models (LLMs) to interpret and generate human-like text has been accompanied with speculation about their application in medicine and clinical research. There is limited data available to inform evidence-based decisions on the appropriateness for specific use cases.\r\n\r\nMethods We evaluated and compared four general-purpose LLMs (GPT-4, GPT-3.5-turbo, Flan-T5-XXL, and Zephyr-7B-Beta) and a healthcare-specific LLM (MedLLaMA-13B) on a set of 13 datasets – referred to as the Biomedical Language Understanding and Reasoning Benchmark (BLURB) – covering six commonly needed medical natural language processing tasks: named entity recognition (NER); relation extraction; population, interventions, comparators, and outcomes (PICO); sentence similarity; document classification; and question-answering. All models were evaluated without modification. Model performance was assessed according to a range of prompting strategies (formalised as a systematic, reusable prompting framework) and relied on the standard, task-specific evaluation metrics defined by BLURB.\r\n\r\nResults Across all tasks, GPT-4 outperformed other LLMs, followed by Flan-T5-XXL and GPT-3.5-turbo, then Zephyr-7b-Beta and MedLLaMA-13B. The most performant prompts for GPT-4 and Flan-T5-XXL both outperformed the previously-reported best results for the PubMedQA task. The domain-specific MedLLaMA-13B achieved lower scores for most tasks except for question-answering tasks. We observed a substantial impact of strategically editing the prompt describing the task and a consistent improvement in performance when including examples semantically similar to the input text in the prompt.\r\n\r\nConclusion These results provide evidence of the potential LLMs may have for medical application and highlight the importance of robust evaluation before adopting LLMs for any specific use cases. Continuing to explore how these emerging technologies can be adapted for the healthcare setting, paired with human expertise, and enhanced through quality control measures will be important research to allow responsible innovation with LLMs in the medical area.", "authors": [ "Christina Mack", "Khaldoun Zine El Abidine", "Jay Nanavati", "Katharine Roth", "Kathryn Rough", "Rodrigo de Oliveira", "Matthew Garber", "Jude LaFleur", "Francesco Ronzano", "Hui Feng" ], "published": "2024-05-17", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "medrxiv-2024-5" }, { "id": "biolay-ak-ss-at-biolaysumm-domain-adaptation", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2024.bionlp-1.69", "url_pdf": "https://aclanthology.org/2024.bionlp-1.69.pdf", "title": "BioLay_AK_SS at BioLaySumm: Domain Adaptation by Two-Stage Fine-Tuning of Large Language Models used for Biomedical Lay Summary Generation", "abstract": "Lay summarization is essential but challenging, as it simplifies scientific information for non-experts and keeps them updated with the latest scientific knowledge. In our participation in the Shared Task: Lay Summarization of Biomedical Research Articles @ BioNLP Workshop (Goldsack et al., 2024), ACL 2024, we conducted a comprehensive evaluation on abstractive summarization of biomedical literature using Large Language Models (LLMs) and assessed the performance using ten metrics across three categories: relevance, readability, and factuality, using eLife and PLOS datasets provided by the organizers. We developed a two-stage framework for lay summarization of biomedical scientific articles. In the first stage, we generated summaries using BART and PEGASUS LLMs by fine-tuning them on the given datasets. In the second stage, we combined the generated summaries and input them to BioBART, and then fine-tuned it on the same datasets. Our findings show that combining general and domain-specific LLMs enhances performance.", "authors": [ "Seba Susan", "Akanksha Karotia" ], "published": "2024-08-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "the-23rd-workshop-on-biomedical-natural" }, { "id": "synthesize-step-by-step-tools-templates-and-1", "arxiv_id": null, "nips_id": null, "url_abs": "http://openaccess.thecvf.com//content/CVPR2024/html/Li_Synthesize_Step-by-Step_Tools_Templates_and_LLMs_as_Data_Generators_for_CVPR_2024_paper.html", "url_pdf": "http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Synthesize_Step-by-Step_Tools_Templates_and_LLMs_as_Data_Generators_for_CVPR_2024_paper.pdf", "title": "Synthesize Step-by-Step: Tools Templates and LLMs as Data Generators for Reasoning-Based Chart VQA", "abstract": " Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions current chart visual question answering (chart VQA) models suffer on complex reasoning questions. In this work we address the lack of reasoning ability by data augmentation. We leverage Large Language Models (LLMs) which have shown to have strong reasoning ability as an automatic data annotator that generates question-answer annotations for chart images. The key innovation in our method lies in the Synthesize Step-by-Step strategy: our LLM-based data generator learns to decompose the complex question into step-by-step sub-questions (rationales) which are then used to derive the final answer using external tools i.e. Python. This step-wise generation procedure is trained on synthetic data generated using a template-based QA generation pipeline. Experimental results highlight the significance of the proposed step-by-step generation. By training with the LLM-augmented data (LAMENDA) we significantly enhance the chart VQA models achieving the state-of-the-art accuracy on the ChartQA and PlotQA datasets. In particular our approach improves the accuracy of the previous state-of-the-art approach from 38% to 54% on the human-written questions in the ChartQA dataset which needs strong reasoning. We hope our work underscores the potential of synthetic data and encourages further exploration of data augmentation using LLMs for reasoning-heavy tasks. ", "authors": [ "Shabnam Ghadar", "Peng Tang", "Bhavan Jasani", "Zhuowan Li" ], "published": "2024-01-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "cvpr-2024-1" }, { "id": "low-rank-softmax-can-have-unargmaxable", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=9lH-J1uPY2i", "url_pdf": "https://openreview.net/pdf?id=9lH-J1uPY2i", "title": "Low rank softmax can have unargmaxable classes in theory but rarely in practice", "abstract": "Classifiers in natural language processing (NLP) often have a large number of output classes. For example, neural language models (LMs) and machine translation (MT) models both predict tokens from a vocabulary of thousands. The softmax output layer of these models typically receives as input a dense feature representation, which has much lower dimensionality than the output. In theory, the result is some words may be impossible to predict via argmax, irrespective of input features, and empirically, this has been shown to happen in small language models (Demeter et al., 2020). In this paper we ask whether it can happen in practical large language models and translation models. To do so, we develop algorithms to detect such unargmaxable tokens in public models. We find that that 13 out of 150 models do indeed have such tokens; however, they are very infrequent and unlikely to impact model quality. We release our algorithms and code to the public.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "unmet-creativity-support-needs-in", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.in2writing-1.11", "url_pdf": "https://aclanthology.org/2022.in2writing-1.11.pdf", "title": "Unmet Creativity Support Needs in Computationally Supported Creative Writing", "abstract": "Large language models (LLMs) enabled by the datasets and computing power of the last decade have recently gained popularity for their capacity to generate plausible natural language text from human-provided prompts. This ability makes them appealing to fiction writers as prospective co-creative agents, addressing the common challenge of writer’s block, or getting unstuck. However, creative writers face additional challenges, including maintaining narrative consistency, developing plot structure, architecting reader experience, and refining their expressive intent, which are not well-addressed by current LLM-backed tools. In this paper, we define these needs by grounding them in cognitive and theoretical literature, then survey previous computational narrative research that holds promise for supporting each of them in a co-creative setting.", "authors": [ "Chris Martens", "Max Kreminski" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "in2writing-acl-2022-5" }, { "id": "pretraining-over-interactions-for-learning", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=f_zJvXNd4e", "url_pdf": "https://openreview.net/pdf?id=f_zJvXNd4e", "title": "Pretraining over Interactions for Learning Grounded Object Representations", "abstract": "Large language models have been criticized for their limited ability to reason about \\textit{affordances} - the actions that can be performed on an object. It has been argued that to accomplish this, models need some form of grounding, i.e., connection, to objects and how they interact in the physical world. Inspired by the way humans learn about the world through interaction, we develop an approach to learning physical properties directly. We introduce a dataset of 200k object interactions in a 3D virtual environment and a self-supervised pretraining objective for learning representations of these objects. We show with probing and clustering experiments that even in the zero-shot setting, derived models learn robust representations of objects and their affordances in an unsupervised manner. Our model outperforms pretrained language and vision models on an affordance prediction baseline, suggesting that pretraining on observed interactions encodes grounded information that is not readily learned in conventional text or vision models.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "when-classifying-grammatical-role-bert-doesn", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=nB4zLyclbom", "url_pdf": "https://openreview.net/pdf?id=nB4zLyclbom", "title": "When classifying grammatical role, BERT doesn't care about word order... except when it matters", "abstract": "Because meaning can often be inferred from lexical semantics alone, word order is often a redundant cue in natural language. For example, the words cut, chef, and onion are more likely used to convey \"The chef cut the onion,\" not \"The onion cut the chef.\" Recent work has shown large language models to be surprisingly word order invariant, but crucially has largely considered natural prototypical inputs, where compositional meaning mostly matches lexical expectations. To overcome this confound, we probe grammatical role representation in BERT and GPT-2 on non-prototypical instances. Such instances are naturally occurring sentences with inanimate subjects or animate objects, or sentences where we systematically swap the arguments to make sentences like \"The onion cut the chef\". We find that, while early layer embeddings are largely lexical, word order is in fact crucial in defining the later-layer representations of words in semantically non-prototypical positions. Our experiments isolate the effect of word order on the contextualization process, and highlight how models use context in the uncommon, but critical, instances where it matters. ", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "how-does-the-pre-training-objective-affect", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=SGgyIY2Xro", "url_pdf": "https://openreview.net/pdf?id=SGgyIY2Xro", "title": "How does the pre-training objective affect what large language models learn about linguistic properties?", "abstract": "Several pre-training objectives, such as masked language modeling (MLM), have been proposed to pre-train language models (e.g. BERT) with the aim of learning better language representations. However, to the best of our knowledge, no previous work so far has investigated how different pre-training objectives affect what BERT learns about linguistics properties. We hypothesize that linguistically motivated objectives (e.g. MLM) should help BERT to acquire better linguistic knowledge compared to using non-linguistically motivated objectives, i.e. hard for humans to guess the association between the input and the label to be predicted. To this end, we pre-train BERT with two linguistically motivated objectives and three non-linguistically motivated ones. We then probe for linguistic characteristics encoded in the representation of the resulting models. We find strong evidence that there is no actual differences in probing performance between the representations learned by the two different types of objectives. These surprising results question the dominant narrative of linguistically informed pre-training.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "caisa-at-wassa-2022-adapter-tuning-for", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.wassa-1.31", "url_pdf": "https://aclanthology.org/2022.wassa-1.31.pdf", "title": "CAISA at WASSA 2022: Adapter-Tuning for Empathy Prediction", "abstract": "We build a system that leverages adapters, a light weight and efficient method for leveraging large language models to perform the task Em- pathy and Distress prediction tasks for WASSA 2022. In our experiments, we find that stacking our empathy and distress adapters on a pre-trained emotion lassification adapter performs best compared to full fine-tuning approaches and emotion feature concatenation. We make our experimental code publicly available", "authors": [ "Lucie Flek", "Charles Welch", "Allison Lahnala" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "wassa-acl-2022-5" }, { "id": "automatic-text-based-speech-overlap", "arxiv_id": null, "nips_id": null, "url_abs": "https://repository.tudelft.nl/record/uuid:d0de72bd-f847-401d-824c-cf3cad7d8e37", "url_pdf": "https://repository.tudelft.nl/file/File_01a9e3ba-712e-40be-9436-9f9a18ea04c2?preview=1", "title": "Automatic text-based speech overlap classification: A novel approach using Large Language Models", "abstract": "Meetings are the keystone of a good company. They allow for quick decision making, multiple-perspective problem solving and effective communication. However, most employees and managers have a negative view on the efficiency and quality of their meetings. High quality meetings where every participant feels equally heard and respected is crucial for having positive meeting sentiment within a company. One of the most influential aspects of meetings are speech overlaps. Overlaps range from short utterances such as backchannels, to follow up questions and clarifications, to complete interruptions. In non-competitive cases, the overlapped speaker feels that the other participants are listening and actively engaging with them during the meeting. In competitive cases, the overlapped speaker can feel interrupted and unimportant. Therefore, competitive overlaps often have a negative impact on the course of the discussion and the overlappee's meeting sentiment. In problematic cases, these overlaps should be reduced to a minimum. In order to do this, overlaps must be classified as either competitive or non-competitive. This paper proposes a novel approach to overlap classification, namely that of text-based classification through Large Language Models. Four different prompt designs are used and tested on the two best performing and publicly available models, GPT-3.5-turbo and GPT-4. The results show that the in-context learning approach using the GPT-4 model results in the most accurate classifications. When comparing the results to previous work, it is observed that the text-based GPT-4 model matches carefully engineered neural networks that even adopt a multi-modular approach.", "authors": [ "J.H. Domhof" ], "published": "2023-06-25", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "a-thesis-submitted-to-eemcs-faculty-delft" }, { "id": "provably-confidential-language-modelling", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=mQyHxnZejkB", "url_pdf": "https://openreview.net/pdf?id=mQyHxnZejkB", "title": "Provably Confidential Language Modelling", "abstract": "Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.", "authors": [ "Anonymous" ], "published": "2022-01-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-january-2022-1" }, { "id": "alephbert-language-model-pre-training-and", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=xicP8EAgXFU", "url_pdf": "https://openreview.net/pdf?id=xicP8EAgXFU", "title": "AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level", "abstract": "Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances.\nWhile advances reported for English using PLMs are unprecedented, reported advances using PLMs for Hebrew are few and far between.\nThe problem is twofold.\nFirst, so far, Hebrew resources for training large language models are not of the same magnitude as their English counterparts.\nSecond, there are no accepted benchmarks to evaluate the progress of Hebrew PLMs on, and in particular, sub-word (morphological) tasks.\nWe aim to remedy both aspects.\nWe present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before.\nMoreover, we introduce a novel language-agnostic architecture that can recover all of the sub-word morphological segments encoded in contextualized word embedding vectors.\nBased on this new morphological component we offer a new PLM evaluation suite consisting of multiple tasks and benchmarks, that cover sentence level word-level and sub-word level analyses.\nOn all tasks, AlephBERT obtains state-of-the-art results beyond contemporary Hebrew baselines. \nWe make our AlephBERT model, the morphological extraction mode, and the Hebrew evaluation suite publicly available, providing a single point of entry for assessing Hebrew PLMs.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "gender-and-representation-bias-in-gpt-3", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.nuse-1.5", "url_pdf": "https://aclanthology.org/2021.nuse-1.5.pdf", "title": "Gender and Representation Bias in GPT-3 Generated Stories", "abstract": "Using topic modeling and lexicon-based word similarity, we find that stories generated by GPT-3 exhibit many known gender stereotypes. Generated stories depict different topics and descriptions depending on GPT-3’s perceived gender of the character in a prompt, with feminine characters more likely to be associated with family and appearance, and described as less powerful than masculine characters, even when associated with high power verbs in a prompt. Our study raises questions on how one can avoid unintended social biases when using large language models for storytelling.", "authors": [ "David Bamman", "Li Lucy" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-nuse-2021-6" }, { "id": "a-data-bootstrapping-recipe-for-low-resource-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.conll-1.45", "url_pdf": "https://aclanthology.org/2021.conll-1.45.pdf", "title": "A Data Bootstrapping Recipe for Low-Resource Multilingual Relation Classification", "abstract": "Relation classification (sometimes called ‘extraction’) requires trustworthy datasets for fine-tuning large language models, as well as for evaluation. Data collection is challenging for Indian languages, because they are syntactically and morphologically diverse, as well as different from resource-rich languages like English. Despite recent interest in deep generative models for Indian languages, relation classification is still not well-served by public data sets. In response, we present IndoRE, a dataset with 39K entity- and relation-tagged gold sentences in three Indian languages, plus English. We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information and provides competitive monolingual relation classification. Using this system, we explore and compare transfer mechanisms between languages. In particular, we study the accuracy-efficiency tradeoff between expensive gold instances vs. translated and aligned ‘silver’ instances.", "authors": [ "Soumen Chakrabarti", "Niloy Ganguly", "Animesh Mukherjee", "Bidisha Samanta", "Arijit Nag" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "conll-emnlp-2021-11" }, { "id": "molreflect-towards-fine-grained-in-context", "arxiv_id": null, "nips_id": null, "url_abs": "https://arxiv.org/html/2411.14721v1", "url_pdf": "https://arxiv.org/pdf/2411.14721v1", "title": "MolReFlect: Towards Fine-grained In-Context Alignment between Molecules and Texts", "abstract": "Molecule discovery is a pivotal research field, impacting everything from the medicines we take to the materials we use. Recently, Large Language Models (LLMs) have been widely adopted in molecule understanding and generation, yet the alignments between molecules and their corresponding captions remain a significant challenge. Previous endeavours often treat the molecule as a general SMILES string or molecular graph, neglecting the fine-grained alignments between the molecular sub-structures and the descriptive textual phrases, which are crucial for accurate and explainable predictions. In this case, we introduce MolReFlect, a novel teacher-student framework designed to contextually perform the molecule-caption alignments in a fine-grained way. Our approach initially leverages a larger teacher LLM to label the detailed alignments by directly extracting critical phrases from molecule captions or SMILES strings and implying them to corresponding sub-structures or characteristics. To refine these alignments, we propose In-Context Selective Reflection, which retrieves previous extraction results as context examples for teacher LLM to reflect and lets a smaller student LLM select from in-context reflection and previous extraction results. Finally, we enhance the learning process of the student LLM through Chain-of-Thought In-Context Molecule Tuning, integrating the fine-grained alignments and the reasoning processes within the Chain-of-Thought format. Our experimental results demonstrate that MolReFlect enables LLMs like Mistral-7B to significantly outperform the previous baselines, achieving SOTA performance on the ChEBI-20 dataset. This advancement not only enhances the generative capabilities of LLMs in the molecule-caption translation task, but also contributes to a more explainable framework.", "authors": [ "Qing Li", "Yuqiang Li", "Dongzhan Zhou", "Wenqi Fan", "Di Zhang", "Jingdi Lei", "Wei Liu", "Yunqing Liu", "Jiatong Li" ], "published": "2024-11-22", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "arxiv-preprint-2024-11" }, { "id": "directing-the-violence-or-admonishing-it-a", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=13LjoyYWcaw", "url_pdf": "https://openreview.net/pdf?id=13LjoyYWcaw", "title": "Directing the violence or admonishing it? A survey of contronymy and androcentrism in Google Translate and some recommendations", "abstract": "The recent raft of high-profile gaffes involving neural machine translation technologies has brought to light the unreliability of this evolving technology. A worrisome\nfacet of the ubiquity of this technology is that it largely operates in a use-it-at-yourown-peril mode where the user is often unaware of either the idiosyncratic brittleness of the underlying neural translation model or when it is, that the translations\nbe deemed trustworthy and when they wouldn’t. These revelations have worryingly\ncoincided with other developments such as the emergence of large language models\nthat now produce biased and erroneous results, albeit with human-like fluency, the\nuse of back-translation as a data-augmentation strategy in so termed ’low-resource’\nsettings and the emergence of ’AI-enhanced legal-tech’ as a panacea that promises\n’disruptive democratization’ of access to legal services. In the backdrop of these\nquandaries, we present this cautionary tale where we shed light on the specifics\nof the risks surrounding cavalier deployment of this technology by exploring two\nspecific failings: Androcentrism and Enantiosemy. In this regard, we empirically\ninvestigate the fate of the pronouns and a list of contronyms when subjected to\nback-translation using Google Translate. Through this, we seek to highlight the\nprevalence of ’defaulting-to-the-masculine’ phenomenon in the context of engendered profession-related translations and also empirically demonstrate the scale and\nnature of threats pertaining to contronymous phrases covering both current-affairs\nand legal issues. Based on these observations, we have collected a series of recommendations that constitute the latter half of this paper. All of the code and datasets\ngenerated in this paper have been open-sourced for the community to build on here:\nhttps://github.com/rteehas/GT_study_recommendations.\n", "authors": [ "Anonymous" ], "published": "2021-08-18", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "alephbert-pre-training-and-end-to-end", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=4IgzCL-ytZs", "url_pdf": "https://openreview.net/pdf?id=4IgzCL-ytZs", "title": "AlephBERT: Pre-training and End-to-End Language Models Evaluation from Sub-Word to Sentence Level", "abstract": "Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances. While advances reported for English using PLMs are unprecedented, reported advances using PLMs in Hebrew are few and far between. The problem is twofold. First, Hebrew resources for training large language models are not at the same order of magnitude as their English counterparts. Second, there are no accepted tasks and benchmarks to evaluate the progress of Hebrew PLMs on, and in particular, evaluation on sub-word (morphological) tasks. We aim to remedy both aspects. We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before. Moreover, we introduce a novel language-agnostic architecture that extracts all of the sub-word morphological segments encoded in contextualized word embedding vectors. Utilizing this new morphological component we offer a new PLM evaluation pipeline of multiple Hebrew tasks and benchmarks, that cover word-level, sub-word level and sentence level tasks. With AlephBERT we achieve state-of-the-art results compared against contemporary baselines. We make our AlephBERT model and evaluation pipeline publicly available, providing a single point of entry for evaluating and comparing Hebrew PLMs.", "authors": [ "Anonymous" ], "published": "2021-07-17", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-july-2021-7" }, { "id": "few-shot-semantic-parsing-with-language-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=_sAJjkoxfuo", "url_pdf": "https://openreview.net/pdf?id=_sAJjkoxfuo", "title": "Few-Shot Semantic Parsing with Language Models Trained On Code", "abstract": "Large language models can perform semantic parsing with little training data, when prompted with in-context examples. It has been shown that this can be improved by formulating the problem as paraphrasing into canonical utterances, which casts the underlying meaning representation into a controlled natural language-like representation. Intuitively, such models can more easily output canonical utterances as they are closer to the natural language used for pre-training. More recently, models also pre-trained on code, like OpenAI Codex, have risen in prominence. Since semantic parsing requires translating natural language into code, such models may prove more adept at it. In this paper, we test this hypothesis and find that Codex performs better at semantic parsing than equivalent GPT-3 models. We find that unlike GPT-3, Codex performs similarly when targeting meaning representations directly, perhaps because meaning representations used in semantic parsing are structured similar to code.", "authors": [ "Anonymous" ], "published": "2022-01-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-january-2022-1" }, { "id": "active-dialogue-simulation-in-conversational", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=5GdS7K37pKN", "url_pdf": "https://openreview.net/pdf?id=5GdS7K37pKN", "title": "Active Dialogue Simulation in Conversational Systems", "abstract": "Semantic parsing helps conversational systems in satisfying users' requests through dialogues. To train these models, collecting annotated dialogues as a dataset is a very expensive and time-consuming process. In this paper, our goal is to utilize large language models and active learning to replace Wizard-of-Oz (WoZ) collection via crowdsourcing for bootstrapping training data for task-driven semantic parsers. We first demonstrate the utility of utterances generated by GPT-3 when seeded with prior training dialogues, as evaluated by human judges. We then explore the use of parser uncertainty on generated outputs as a selection criteria for annotation and contrast this with a strategy based on Core-sets. Our pipeline leads to more useful examples on average, motivating future work on active generation for bootstrapping semantic parsers.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "cog-dqa-chain-of-guiding-learning-with-large", "arxiv_id": null, "nips_id": null, "url_abs": "http://openaccess.thecvf.com//content/CVPR2024/html/Wang_CoG-DQA_Chain-of-Guiding_Learning_with_Large_Language_Models_for_Diagram_Question_CVPR_2024_paper.html", "url_pdf": "http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_CoG-DQA_Chain-of-Guiding_Learning_with_Large_Language_Models_for_Diagram_Question_CVPR_2024_paper.pdf", "title": "CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering", "abstract": " Diagram Question Answering (DQA) is a challenging task requiring models to answer natural language questions based on visual diagram contexts. It serves as a crucial basis for academic tutoring technical support and more practical applications. DQA poses significant challenges such as the demand for domain-specific knowledge and the scarcity of annotated data which restrict the applicability of large-scale deep models. Previous approaches have explored external knowledge integration through pre-training but these methods are costly and can be limited by domain disparities. While Large Language Models (LLMs) show promise in question-answering there is still a gap in how to cooperate and interact with the diagram parsing process. In this paper we introduce the Chain-of-Guiding Learning Model for Diagram Question Answering (CoG-DQA) a novel framework that effectively addresses DQA challenges. CoG-DQA leverages LLMs to guide diagram parsing tools (DPTs) through the guiding chains enhancing the precision of diagram parsing while introducing rich background knowledge. Our experimental findings reveal that CoG-DQA surpasses all comparison models in various DQA scenarios achieving an average accuracy enhancement exceeding 5% and peaking at 11% across four datasets. These results underscore CoG-DQA's capacity to advance the field of visual question answering and promote the integration of LLMs into specialized domains. ", "authors": [ "Jun Liu", "Xinyu Zhang", "Kim-Hui Yap", "Tao Qin", "Longji Zhu", "Lingling Zhang", "Shaowei Wang" ], "published": "2024-01-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "cvpr-2024-1" }, { "id": "data-augmentation-for-intent-classification", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=Yy2sTU8uCak", "url_pdf": "https://openreview.net/pdf?id=Yy2sTU8uCak", "title": "Data Augmentation for Intent Classification with Generic Large Language Models", "abstract": "Data augmentation alleviates the problem of data scarcity when training language models (LMs) by generating new examples based on the existing data. A successful approach to generate new samples is to fine-tune a pretrained LM on the task-specific data and then sample from the label-conditioned LM. However, fine-tuning can be difficult when task-specific data is scarce. In this work, we explore whether large pretrained LMs can be used to generate new useful samples without fine-tuning. For a given class, we propose concatenating few examples and prompt them to GPT-3 to generate new examples. We evaluate this method for few-shot intent classification on CLINC150 and SNIPS and find that data generated by GPT-3 greatly improves the performance of the intent classifiers. Importantly, we find that, without any LM fine-tuning, the gains brought by data augmentation with GPT-3 are similar to those reported in prior work on LM-based data augmentation. Experiments with models of different sizes show that larger LMs generate higher quality samples that yield higher accuracy gains.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "evaluating-pre-trained-language-models-on", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.sdp-1.22", "url_pdf": "https://aclanthology.org/2022.sdp-1.22.pdf", "title": "Evaluating Pre-Trained Language Models on Multi-Document Summarization for Literature Reviews", "abstract": "Systematic literature reviews in the biomedical space are often expensive to conduct. Automation through machine learning and large language models could improve the accuracy and research outcomes from such reviews. In this study, we evaluate a pre-trained LongT5 model on the MSLR22: Multi-Document Summarization for Literature Reviews Shared Task datasets. We weren’t able to make any improvements on the dataset benchmark, but we do establish some evidence that current summarization metrics are insufficient in measuring summarization accuracy. A multi-document summarization web tool was also built to demonstrate the viability of summarization models for future investigators: https://ben-yu.github.io/summarizer", "authors": [ "Benjamin Yu" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "sdp-coling-2022-10" }, { "id": "story-centaur-large-language-model-few-shot", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.eacl-demos.29", "url_pdf": "https://aclanthology.org/2021.eacl-demos.29.pdf", "title": "Story Centaur: Large Language Model Few Shot Learning as a Creative Writing Tool", "abstract": "Few shot learning with large language models has the potential to give individuals without formal machine learning training the access to a wide range of text to text models. We consider how this applies to creative writers and present Story Centaur, a user interface for prototyping few shot models and a set of recombinable web components that deploy them. Story Centaur{'}s goal is to expose creative writers to few shot learning with a simple but powerful interface that lets them compose their own co-creation tools that further their own unique artistic directions. We build out several examples of such tools, and in the process probe the boundaries and issues surrounding generation with large language models.", "authors": [ "Monica Dinalescu", "Sherol Chen", "Ben Pietrzak", "Kory Mathewson", "Ben Swanson" ], "published": "2021-04-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "eacl-2021-2" }, { "id": "dont-forget-about-pronouns-removing-gender-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.gebnlp-1.3", "url_pdf": "https://aclanthology.org/2022.gebnlp-1.3.pdf", "title": "Don’t Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information", "abstract": "The representations in large language models contain multiple types of gender information. We focus on two types of such signals in English texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. We can disentangle the model’s embeddings and identify components encoding both types of information with probing. We aim to diminish the stereotypical bias in the representations while preserving the factual gender signal. Our filtering method shows that it is possible to decrease the bias of gender-neutral profession names without significant deterioration of language modeling capabilities. The findings can be applied to language generation to mitigate reliance on stereotypes while preserving gender agreement in coreferences.", "authors": [ "David Mareček", "Tomasz Limisiewicz" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-gebnlp-2022-7" }, { "id": "zero-shot-on-the-fly-event-schema-induction", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=mJzm4ueUKrV", "url_pdf": "https://openreview.net/pdf?id=mJzm4ueUKrV", "title": "Zero-Shot On-the-Fly Event Schema Induction", "abstract": "What are the events involved in a pandemic outbreak? What steps should be taken when planning a wedding? The answers to these questions can be found by collecting many documents on the complex event of interest, extracting relevant information and analyzing it. We present a new approach in which large language models are utilized to generate source documents that allow predicting, given a high-level event definition, the specific events, arguments, and relations between them to construct a schema that describes the complex event in its entirety. Using our model, complete schemas on any topic can be generated on-the-fly without any data collection needed, i.e., in a zero-shot manner. Moreover, we develop efficient methods to extract pertinent information from texts and demonstrate, in a series of experiments, that these schemas are considered to be more complete than human-curated ones in the majority of examined scenarios. Finally, we show that this framework is comparable in performance with previous supervised schema induction methods that rely on collecting real texts while being more general and flexible by avoiding the need to use a predefined ontology.", "authors": [ "Anonymous" ], "published": "2022-01-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-january-2022-1" }, { "id": "n-gram-counts-and-language-models-from-the", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/L14-1074", "url_pdf": "https://aclanthology.org/L14-1074.pdf", "title": "N-gram Counts and Language Models from the Common Crawl", "abstract": "We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the corpus was processed with emphasis on the problems that arise in working with data at this scale. Our unpruned Kneser-Ney English {\\$}5{\\$}-gram language model, built on 975 billion deduplicated tokens, contains over 500 billion unique n-grams. We show gains of 0.5-1.4 BLEU by using large language models to translate into various languages.", "authors": [ "Christian Buck", "Bas van Ooyen", "Kenneth Heafield" ], "published": "2014-05-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "lrec-2014-5" }, { "id": "ppl-mcts-constrained-textual-generation", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=0BvzMpR0zkJ", "url_pdf": "https://openreview.net/pdf?id=0BvzMpR0zkJ", "title": "PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided MCTS Decoding", "abstract": "Large language models (LM) based on Transformers allow to generate plausible long texts. In this paper, we explore how this generation can be further controlled at decoding time to satisfy certain constraints (eg. being non-toxic, conveying certain emotions, using a specific writing style, etc.) without fine-tuning the LM.Precisely, we formalize constrained generation as a tree exploration process guided by a discriminator that indicates how well the associated sequence respects the constraint. This approach, in addition to being easier and cheaper to train than fine-tuning the LM, allows to apply the constraint more finely and dynamically.We propose several original methods to search this generation tree, notably the Monte Carlo Tree Search (MCTS) which provides theoretical guarantees on the search efficiency, but also simpler methods based on re-ranking a pool of diverse sequences using the discriminator scores. These methods are evaluated, with automatic and human-based metrics, on two types of constraints and languages: review polarity and emotion control in French and English. We show that discriminator-guided MCTS decoding achieves state-of-the-art results without having to tune the language model, in both tasks and languages. We also demonstrate that other proposed decoding methods based on re-ranking can be really effective when diversity among the generated propositions is encouraged.", "authors": [ "Anonymous" ], "published": "2022-01-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-january-2022-1" }, { "id": "surrey-cts-nlp-at-wassa2022-an-experiment-of", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.wassa-1.29", "url_pdf": "https://aclanthology.org/2022.wassa-1.29.pdf", "title": "SURREY-CTS-NLP at WASSA2022: An Experiment of Discourse and Sentiment Analysis for the Prediction of Empathy, Distress and Emotion", "abstract": "This paper summarises the submissions our team, SURREY-CTS-NLP has made for the WASSA 2022 Shared Task for the prediction of empathy, distress and emotion. In this work, we tested different learning strategies, like ensemble learning and multi-task learning, as well as several large language models, but our primary focus was on analysing and extracting emotion-intensive features from both the essays in the training data and the news articles, to better predict empathy and distress scores from the perspective of discourse and sentiment analysis. We propose several text feature extraction schemes to compensate the small size of training examples for fine-tuning pretrained language models, including methods based on Rhetorical Structure Theory (RST) parsing, cosine similarity and sentiment score. Our best submissions achieve an average Pearson correlation score of 0.518 for the empathy prediction task and an F1 score of 0.571 for the emotion prediction task, indicating that using these schemes to extract emotion-intensive information can help improve model performance.", "authors": [ "Félix do Carmo", "Hadeel Saadany", "Diptesh Kanojia", "Constantin Orasan", "Shenbin Qian" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "wassa-acl-2022-5" }, { "id": "hyperparameter-power-impact-in-transformer", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.sustainlp-1.12", "url_pdf": "https://aclanthology.org/2021.sustainlp-1.12.pdf", "title": "Hyperparameter Power Impact in Transformer Language Model Training", "abstract": "Training large language models can consume a large amount of energy. We hypothesize that the language model’s configuration impacts its energy consumption, and that there is room for power consumption optimisation in modern large language models. To investigate these claims, we introduce a power consumption factor to the objective function, and explore the range of models and hyperparameter configurations that affect power. We identify multiple configuration factors that can reduce power consumption during language model training while retaining model quality.", "authors": [ "Leon Derczynski", "Timmie Rantzau", "Mads Guldborg Kjeldgaard Kongsbak", "Lucas Høyberg Puvis de Chavannes" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "emnlp-sustainlp-2021-11" }, { "id": "upstream-mitigation-is-not-all-you-need", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.acl-long.247", "url_pdf": "https://aclanthology.org/2022.acl-long.247.pdf", "title": "Upstream Mitigation Is Not All You Need: Testing the Bias Transfer Hypothesis in Pre-Trained Language Models", "abstract": "A few large, homogenous, pre-trained models undergird many machine learning systems — and often, these models contain harmful stereotypes learned from the internet. We investigate the bias transfer hypothesis: the theory that social biases (such as stereotypes) internalized by large language models during pre-training transfer into harmful task-specific behavior after fine-tuning. For two classification tasks, we find that reducing intrinsic bias with controlled interventions before fine-tuning does little to mitigate the classifier’s discriminatory behavior after fine-tuning. Regression analysis suggests that downstream disparities are better explained by biases in the fine-tuning dataset. Still, pre-training plays a role: simple alterations to co-occurrence rates in the fine-tuning dataset are ineffective when the model has been pre-trained. Our results encourage practitioners to focus more on dataset quality and context-specific harms.", "authors": [ "Michael Wick", "Ari Kobren", "Swetasudha Panda", "Ryan Steed" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-2022-5" }, { "id": "omniparser-a-unified-framework-for-text-1", "arxiv_id": null, "nips_id": null, "url_abs": "http://openaccess.thecvf.com//content/CVPR2024/html/Wan_OmniParser_A_Unified_Framework_for_Text_Spotting_Key_Information_Extraction_CVPR_2024_paper.html", "url_pdf": "http://openaccess.thecvf.com//content/CVPR2024/papers/Wan_OmniParser_A_Unified_Framework_for_Text_Spotting_Key_Information_Extraction_CVPR_2024_paper.pdf", "title": "OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition", "abstract": " Recently visually-situated text parsing (VsTP) has experienced notable advancements driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However due to the diversified targets and heterogeneous schemas previous works usually design task-specific architectures and objectives for individual tasks which inadvertently leads to modal isolation and complex workflow. In this paper we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically we devise a universal model called OmniParser which can simultaneously handle three typical visually-situated text parsing tasks: text spotting key information extraction and table recognition. In OmniParser all tasks share the unified encoder-decoder architecture the unified objective: point-conditioned text generation and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks despite its unified concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery. ", "authors": [ "Zhibo Yang", "Cong Yao", "Xiang Bai", "Fei Huang", "Wenqing Cheng", "Yuliang Liu", "Wenwen Yu", "Sibo Song", "Jianqiang Wan" ], "published": "2024-01-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "cvpr-2024-1" }, { "id": "evaluating-the-validity-of-word-level", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2024.findings-acl.292/", "url_pdf": "https://aclanthology.org/2024.findings-acl.292/", "title": "Evaluating the Validity of Word-level Adversarial Attacks with Large Language Models", "abstract": "Deep neural networks exhibit vulnerability to word-level adversarial attacks in natural language processing. Most of these attack methods adopt synonymous substitutions to perturb original samples for crafting adversarial examples while attempting to maintain semantic consistency with the originals. Some of them claim that they could achieve over 90% attack success rate, thereby raising serious safety concerns. However, our investigation reveals that many purportedly successful adversarial examples are actually invalid due to significant changes in semantic meanings compared to their originals. Even when equipped with semantic constraints such as BERTScore, existing attack methods can generate up to 87.9% invalid adversarial examples. Building on this insight, we first curate a 13K dataset for adversarial validity evaluation with the help of GPT-4. Then, an open-source large language model is fine-tuned to offer an interpretable validity score for assessing the semantic consistency between original and adversarial examples. Finally, this validity score can serve as a guide for existing adversarial attack methods to generate valid adversarial examples. Comprehensive experiments demonstrate the effectiveness of our method in evaluating and refining the quality of adversarial examples.", "authors": [ "Fangyuan Zhang", "Wenhan Mu", "Dongping Chen", "Hongtao Wang", "Zhaoyang Wang", "Huichi Zhou" ], "published": "2024-08-15", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "findings-of-the-association-for-computational-6" }, { "id": "methods-for-estimating-and-improving-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.naacl-srw.6", "url_pdf": "https://aclanthology.org/2022.naacl-srw.6.pdf", "title": "Methods for Estimating and Improving Robustness of Language Models", "abstract": "Despite their outstanding performance, large language models (LLMs) suffer notorious flaws related to their preference for shallow textual relations over full semantic complexity of the problem. This proposal investigates a common denominator of this problem in their weak ability to generalise outside of the training domain. We survey diverse research directions providing estimations of model generalisation ability and find that incorporating some of these measures in the training objectives leads to enhanced distributional robustness of neural models. Based on these findings, we present future research directions enhancing the robustness of LLMs.", "authors": [ "Michal Stefanik" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-acl-2022-7" }, { "id": "contextualized-sensorimotor-norms-multi", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=DkbUnKnnKVk", "url_pdf": "https://openreview.net/pdf?id=DkbUnKnnKVk", "title": "Contextualized Sensorimotor Norms: multi-dimensional measures of sensorimotor strength for ambiguous English words, in context", "abstract": "Most large language models are trained on linguistic input alone, yet humans appear to ground their understanding of words in sensorimotor experience. A natural solution is to augment LM representations with human judgments of a word's sensorimotor associations (e.g., the Lancaster Sensorimotor Norms), but this raises another challenge: most words are ambiguous, and judgments of words in isolation fail to account for this multiplicity of meaning (e.g., \"wooden table\" vs. \"data table\". We attempted to address this problem by building a new lexical resource of contextualized sensorimotor judgments for 112 English words, each rated in four different contexts (448 sentences total). We show that these ratings encode overlapping but distinct information from the Lancaster Sensorimotor Norms, and that they also predict other measures of interest (e.g., relatedness), above and beyond measures derived from BERT. ", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "dont-forget-about-pronouns-removing-gender", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=s-BXA8RAyqG", "url_pdf": "https://openreview.net/pdf?id=s-BXA8RAyqG", "title": "Don’t Forget About Pronouns: Removing Gender Bias in Language Models without Losing Factual Gender Information", "abstract": "The representations in large language models contain various types of gender information. We focus on two types of such signals in English texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. We can disentangle the model’s embeddings and identify components encoding both information with probing. We aim to diminish the representation of stereotypical bias while preserving factual gender signal. Our filtering method shows that it is possible to decrease the bias of gender-neutral profession names without deteriorating language modeling capabilities. The findings can be applied to language generation and understanding to mitigate reliance on stereotypes while preserving gender agreement in coreferences.", "authors": [ "Anonymous" ], "published": "2022-01-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-january-2022-1" }, { "id": "re-a-study-for-restorable-embeddings", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=XVPvYByfPxV", "url_pdf": "https://openreview.net/pdf?id=XVPvYByfPxV", "title": "RE: A Study for Restorable Embeddings", "abstract": "As the number of model parameters increased, large language models achieved linguistic fluency and exhibited high performance in various natural language tasks without gradient updates because the models could retain more knowledge.\nHowever, the large model size makes difficult to apply the model to a task requiring domain knowledge not included in the training corpus, due to the fact that knowledge stored in model parameters is not controllable during generation and model parameter updates are costly.\nTo tackle the problem, we suggest separating the language model and knowledge, and divide the end-to-end language model into three parts: 1) encoding knowledge, 2) processing the encoded knowledge, and 3) restoring the processed knowledge embedding to natural language.\nIn this paper, we propose a model for learning restorable embeddings as a first step toward the study to separate the language model and knowledge.\nThe experimental results shows that the proposed model can restore most knowledge in 1-2 sentences by encoding knowledge in sentence-level embeddings and then restoring the embeddings back to the original sentence.\nWe also verify that the embeddings generated through our method significantly improves performance in the passage retrieval task.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "magic-pyramid-accelerating-inference-with-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=q1IubnXd3tE", "url_pdf": "https://openreview.net/pdf?id=q1IubnXd3tE", "title": "Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning", "abstract": "Pre-training and then fine-tuning large language models is commonly used to achieve state-of-the-art performance in natural language processing (NLP) tasks. However, most pre-trained models suffer from low inference speed. Deploying such large models to applications with latency constraints is challenging. In this work, we focus on accelerating the inference via conditional computations. To achieve this, we propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise computation via token pruning and early exiting for Transformer-based models, particularly BERT. The former manages to save the computation via removing non-salient tokens, while the latter can fulfill the computation reduction by terminating the inference early before reaching the final layer, if the exiting condition is met. Our empirical studies demonstrate that compared to previous state of arts, MP is not only able to achieve a speed-adjustable inference, but also to surpass token pruning and early exiting by reducing up to 70\\% giga floating point operations (GFLOPs) with less than 0.5\\% accuracy drop. Token pruning and early exiting express distinctive preferences to sequences with different lengths. However, MP is capable of achieving an average of 8.06x speedup on two popular text classification tasks, regardless of the sizes of the inputs.", "authors": [ "Anonymous" ], "published": "2022-01-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-january-2022-1" }, { "id": "show-don-t-tell-demonstrations-outperform", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=RNI5LO-axuw", "url_pdf": "https://openreview.net/pdf?id=RNI5LO-axuw", "title": "Show, Don't Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue", "abstract": "Building universal dialogue systems that can seamlessly operate across multiple domains/APIs and can generalize to new ones with minimal supervision and low maintenance is a critical challenge. Recent works have leveraged natural language descriptions for schema elements to build such systems. However, descriptions only provide indirect supervision for downstream tasks, while still requiring effort to construct. In this work, we propose Show, Don't Tell, which uses a short labeled example dialogue to show the semantics of a schema rather than telling the model about the schema elements via descriptions. While requiring similar effort from service developers, we show that using short examples as schema representations with large language models results in stronger performance and better generalization on two popular dialogue state tracking benchmarks: the Schema-Guided Dialogue (SGD) dataset and the MultiWoZ leave-one-out benchmark.", "authors": [ "Anonymous" ], "published": "2022-01-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-january-2022-1" }, { "id": "c3po-a-lightweight-copying-mechanism-for", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.aacl-srw.7/", "url_pdf": "https://aclanthology.org/2022.aacl-srw.7.pdf", "title": "C3PO: A Lightweight Copying Mechanism for Translating Pseudocode to Code", "abstract": "Writing computer programs is a skill that remains inaccessible to most due to the barrier of programming language (PL) syntax. While large language models (LLMs) have been proposed to translate natural language pseudocode to PL code, they are costly in terms of data and compute. We propose a lightweight alternative to LLMs that exploits the property of code wherein most tokens can be simply copied from the pseudocode. We divide the problem into three phases: Copy, Generate, and Combine. In the Copy Phase, a binary classifier is employed to determine and mask the pseudocode tokens that can be directly copied into the code. In the Generate Phase, a Sequence-to-Sequence model is used to generate the masked PL code equivalent. In the Combine Phase, the generated sequence is combined with the tokens that the Copy Phase had masked. We show that our C3PO models achieve similar performance to non-C3PO models while reducing the computational cost of training as well as the vocabulary sizes.", "authors": [ "Mamatha Hr", "Prajwal Anagani", "Vibha Masti", "Vishruth Veerendranath" ], "published": "2022-11-20", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "aacl-ijcnlp-2022-11" }, { "id": "show-dont-tell-demonstrations-outperform", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.naacl-main.336", "url_pdf": "https://aclanthology.org/2022.naacl-main.336.pdf", "title": "Show, Don’t Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue", "abstract": "Building universal dialogue systems that operate across multiple domains/APIs and generalize to new ones with minimal overhead is a critical challenge. Recent works have leveraged natural language descriptions of schema elements to enable such systems; however, descriptions only indirectly convey schema semantics. In this work, we propose Show, Don’t Tell, which prompts seq2seq models with a labeled example dialogue to show the semantics of schema elements rather than tell the model through descriptions. While requiring similar effort from service developers as generating descriptions, we show that using short examples as schema representations with large language models results in state-of-the-art performance on two popular dialogue state tracking benchmarks designed to measure zero-shot generalization - the Schema-Guided Dialogue dataset and the MultiWOZ leave-one-out benchmark.", "authors": [ "Yonghui Wu", "Abhinav Rastogi", "Yuan Cao", "Jeffrey Zhao", "Harrison Lee", "Raghav Gupta" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-2022-7" }, { "id": "hierarchical-transformers-are-more-efficient-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=ZQejhmTreE8", "url_pdf": "https://openreview.net/pdf?id=ZQejhmTreE8", "title": "Hierarchical Transformers Are More Efficient Language Models", "abstract": "Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences, which allows them to produce long coherent outputs: entire paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently. In particular, Hourglass sets new state-of-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark.", "authors": [ "Anonymous" ], "published": "2022-01-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-january-2022-1" }, { "id": "evaluating-the-text-to-sql-capabilities-of", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=lYli-bAuK54", "url_pdf": "https://openreview.net/pdf?id=lYli-bAuK54", "title": "Evaluating the Text-to-SQL Capabilities of Large Language Models", "abstract": "We perform an empirical evaluation of Text-to-SQL capabilities of the Codex language model. We find that, without any finetuning, Codex is a strong baseline on the Spider benchmark; we also analyze the failure modes of Codex in this setting. Furthermore, we demonstrate on the GeoQuery and Scholar benchmarks that a small number of in-domain examples provided in the prompt enables Codex to perform better than state-of-the-art models finetuned on such few-shot examples.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "a-recipe-for-arbitrary-text-style-transfer-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=tTwG1UKHRB1", "url_pdf": "https://openreview.net/pdf?id=tTwG1UKHRB1", "title": "A Recipe For Arbitrary Text Style Transfer with Large Language Models", "abstract": "In this paper, we leverage large language models (LLMs) to perform zero-shot text style transfer. We present a prompting method that we call augmented zero-shot learning, which frames style transfer as a sentence rewriting task and requires only a natural language instruction, without model fine-tuning or exemplars in the target style. Augmented zero-shot learning is simple and demonstrates promising results not just on standard style transfer tasks such as sentiment, but also on arbitrary transformations such as 'make this melodramatic' or 'insert a metaphor.'", "authors": [ "Anonymous" ], "published": "2021-06-16", "conference": null, "conference_url_abs": "https://openreview.net/forum?id=XnwgpvL4PRf", "conference_url_pdf": "https://openreview.net/pdf?id=XnwgpvL4PRf", "proceeding": "acl-arr-october-2021-10" }, { "id": "learning-to-repair-repairing-model-output", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=RlwvZrvcv8", "url_pdf": "https://openreview.net/pdf?id=RlwvZrvcv8", "title": "Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback", "abstract": "Large language models (LMs), while powerful, are not immune to mistakes, but can be difficult to retrain. Our goal is for an LM to continue to improve after deployment, without retraining, using feedback from the user. Our approach pairs an LM with (i) a growing memory of cases where the user identified an output error and provided general feedback on how to correct it (ii) a corrector model, trained to translate this general feedback into specific edits to repair the model output. Given a new, unseen input, our model can then use feedback from similar, past cases to repair output errors that may occur. We instantiate our approach using an existing, fixed model for script generation, that takes a goal (e.g., \"bake a cake\") and generates a partially ordered sequence of actions to achieve that goal, sometimes containing errors. We show that our memory-enhanced system, FBNet, learns to apply user feedback effectively to repair such errors (up to 30 points improvement), while making a start at avoiding similar past mistakes on new, unseen examples (up to 7 points improvement in a controlled setting). This is a first step towards strengthening deployed models, potentially broadening their utility.", "authors": [ "Anonymous" ], "published": "2022-01-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-january-2022-1" }, { "id": "multimodal-large-language-models-for", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.naacl-srw.26", "url_pdf": "https://aclanthology.org/2022.naacl-srw.26.pdf", "title": "Multimodal large language models for inclusive collaboration learning tasks", "abstract": "This PhD project leverages advancements in multimodal large language models to build an inclusive collaboration feedback loop, in order to facilitate the automated detection, modeling, and feedback for participants developing general collaboration skills. This topic is important given the role of collaboration as an essential 21st century skill, the potential to ground large language models within learning theory and real-world practice, and the expressive potential of transformer models to support equity and inclusion. We address some concerns of integrating advances in natural language processing into downstream tasks such as the learning analytics feedback loop.", "authors": [ "Armanda Lewis" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-acl-2022-7" }, { "id": "solving-probability-and-statistics-problems-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=XOI2xQDpzqz", "url_pdf": "https://openreview.net/pdf?id=XOI2xQDpzqz", "title": "Solving Probability and Statistics Problems by Program Synthesis", "abstract": "We solve university level probability and statistics questions by program synthesis using OpenAI's Codex, a Transformer trained on text and fine-tuned on code. We transform course problems from MIT's 18.05 Introduction to Probability and Statistics and Harvard's STAT110 Probability into programming tasks. We then execute the generated code to get a solution. Since these course questions are grounded in probability, we often aim to have Codex generate probabilistic programs that simulate a large number of probabilistic dependencies, to compute its solution. Our approach requires prompt engineering to transform the question from its original form to an explicit, tractable form that results in a correct program and solution. To estimate the amount of work needed to translate an original question into its tractable form, we measure the similarity between original and transformed questions. Our work is the first to introduce a new dataset of university-level probability and statistics problems and solve these problems in a scalable fashion using the program synthesis capabilities of large language models.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "reframing-human-ai-collaboration-for-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=DF0R3AsZ5IB", "url_pdf": "https://openreview.net/pdf?id=DF0R3AsZ5IB", "title": "Reframing Human-AI Collaboration for Generating Free-Text Explanations", "abstract": "Large language models are increasingly capable of generating fluent-appearing text with relatively little task-specific supervision. But can these models accurately explain classification decisions? We consider the task of generating free-text explanations using a small number of human-written examples (i.e., in a few-shot manner). We find that (1) higher-quality, human-authored prompts result in higher quality generations; and (2) surprisingly, in a head-to-head comparison, humans often prefer explanations generated by GPT-3 to crowdsourced explanations in existing datasets. Our human studies also show, however, that while models often produce factual, grammatical, and sufficient explanations, they have room to improve along axes such as providing novel information and supporting the label. We create a pipeline that combines GPT-3 with a supervised filter that incorporates binary acceptability judgments from humans in the loop. Despite significant subjectivity intrinsic to judging acceptability, our approach is able to consistently filter GPT-3 generated explanations deemed acceptable by humans.", "authors": [ "Anonymous" ], "published": "2022-01-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-january-2022-1" }, { "id": "boosting-coherence-of-language-models-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=CJQqdS-fx3K", "url_pdf": "https://openreview.net/pdf?id=CJQqdS-fx3K", "title": "Boosting coherence of language models", "abstract": "Naturality of long-term information structure -- coherence -- remains a challenge in language generation. Large language models have insufficiently learned such structure, as their long-form generations differ from natural text in measures of coherence. To alleviate this divergence, we propose coherence boosting, an inference procedure that increases the effect of distant context on next-token prediction. We show the benefits of coherence boosting with pretrained models by distributional analyses of generated ordinary text and dialog responses. We also find that coherence boosting with state-of-the-art models for various zero-shot NLP tasks yields performance gains with no additional training.", "authors": [ "Anonymous" ], "published": "2021-11-16", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-arr-november-2021-11" }, { "id": "plug-and-play-conversational-models-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=Z4I9PcrWdRI", "url_pdf": "https://openreview.net/pdf?id=Z4I9PcrWdRI", "title": "Plug-and-Play Conversational Models", "abstract": "There has been considerable progress made towards conversational models that generate coherent and fluent responses; however, this often involves training large language models on large dialogue datasets, such as Reddit. These large conversational models provide little control over the generated responses, and this control is further limited in the absence of annotated conversational datasets for attribute specific generation that can be used for fine-tuning the model. In this paper, we first propose and evaluate plug-and-play methods for controllable response generation, which does not require dialogue specific datasets and does not rely on fine-tuning a large model. While effective, the decoding procedure induces considerable computational overhead, rendering the conversational model unsuitable for interactive usage. To overcome this, we introduce an approach that does not require further computation at decoding time, while also does not require any fine-tuning of a large language model. We demonstrate, through extensive automatic and human evaluation, a high degree\nof control over the generated conversational responses with regard to multiple desired attributes, while being fluent.", "authors": [ "Anonymous" ], "published": "2020-07-23", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null } ] }{ "count": 24708, "next": "