Paper List
Return a paginated listing of all papers.
GET /api/v1/papers/?ordering=-published&q=Large+Language+Models
https://paperswithcode.com/api/v1/papers/?ordering=-published&page=2&q=Large+Language+Models", "previous": null, "results": [ { "id": "when-classifying-grammatical-role-bert-doesnt", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.acl-short.71", "url_pdf": "https://aclanthology.org/2022.acl-short.71.pdf", "title": "When classifying grammatical role, BERT doesn’t care about word order... except when it matters", "abstract": "Because meaning can often be inferred from lexical semantics alone, word order is often a redundant cue in natural language. For example, the words chopped, chef, and onion are more likely used to convey “The chef chopped the onion,” not “The onion chopped the chef.” Recent work has shown large language models to be surprisingly word order invariant, but crucially has largely considered natural prototypical inputs, where compositional meaning mostly matches lexical expectations. To overcome this confound, we probe grammatical role representation in English BERT and GPT-2, on instances where lexical expectations are not sufficient, and word order knowledge is necessary for correct classification. Such non-prototypical instances are naturally occurring English sentences with inanimate subjects or animate objects, or sentences where we systematically swap the arguments to make sentences like “The onion chopped the chef”. We find that, while early layer embeddings are largely lexical, word order is in fact crucial in defining the later-layer representations of words in semantically non-prototypical positions. Our experiments isolate the effect of word order on the contextualization process, and highlight how models use context in the uncommon, but critical, instances where it matters.", "authors": [ "Kyle Mahowald", "Richard Futrell", "Isabel Papadimitriou" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-2022-5" }, { "id": "exploring-cross-lingual-text-detoxification", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.acl-srw.26", "url_pdf": "https://aclanthology.org/2022.acl-srw.26.pdf", "title": "Exploring Cross-lingual Text Detoxification with Large Multilingual Language Models.", "abstract": "Detoxification is a task of generating text in polite style while preserving meaning and fluency of the original toxic text. Existing detoxification methods are monolingual i.e. designed to work in one exact language. This work investigates multilingual and cross-lingual detoxification and the behavior of large multilingual models in this setting. Unlike previous works we aim to make large language models able to perform detoxification without direct fine-tuning in a given language. Experiments show that multilingual models are capable of performing multilingual style transfer. However, tested state-of-the-art models are not able to perform cross-lingual detoxification and direct fine-tuning on exact language is currently inevitable and motivating the need of further research in this direction.", "authors": [ "Alexander Panchenko", "Daryna Dementieva", "Daniil Moskovskiy" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-2022-5" }, { "id": "you-reap-what-you-sow-on-the-challenges-of", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.bigscience-1.3", "url_pdf": "https://aclanthology.org/2022.bigscience-1.3.pdf", "title": "You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings", "abstract": "Evaluating bias, fairness, and social impact in monolingual language models is a difficult task. This challenge is further compounded when language modeling occurs in a multilingual context. Considering the implication of evaluation biases for large multilingual language models, we situate the discussion of bias evaluation within a wider context of social scientific research with computational work.We highlight three dimensions of developing multilingual bias evaluation frameworks: (1) increasing transparency through documentation, (2) expanding targets of bias beyond gender, and (3) addressing cultural differences that exist between languages.We further discuss the power dynamics and consequences of training large language models and recommend that researchers remain cognizant of the ramifications of developing such technologies.", "authors": [ "Oskar van der Wal", "Deepak Tunuguntla", "Samson Tan", "Jaesung Tae", "Arjun Subramonian", "Shanya Sharma", "Dragomir Radev", "Margaret Mitchell", "Maraim Masoud", "Sasha Luccioni", "Shayne Longpre", "Manan Dey", "Miruna Clinciu", "Stella Biderman", "Aurélie Névéol", "Zeerak Talat" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "bigscience-acl-2022-5" }, { "id": "evaluating-and-mitigating-inherent-linguistic", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.coling-1.124", "url_pdf": "https://aclanthology.org/2022.coling-1.124.pdf", "title": "Evaluating and Mitigating Inherent Linguistic Bias of African American English through Inference", "abstract": "Recent studies show that NLP models trained on standard English texts tend to produce biased outcomes against underrepresented English varieties. In this work, we conduct a pioneering study of the English variety use of African American English (AAE) in NLI task. First, we propose CodeSwitch, a greedy unidirectional morphosyntactically-informed rule-based translation method for data augmentation. Next, we use CodeSwitch to present a preliminary study to determine if demographic language features do in fact influence models to produce false predictions. Then, we conduct experiments on two popular datasets and propose two simple, yet effective and generalizable debiasing methods. Our findings show that NLI models (e.g. BERT) trained under our proposed frameworks outperform traditional large language models while maintaining or even improving the prediction performance. In addition, we intend to release CodeSwitch, in hopes of promoting dialectal language diversity in training data to both reduce the discriminatory societal impacts and improve model robustness of downstream NLP tasks.", "authors": [ "Jiliang Tang", "Haochen Liu", "Jamell Dacon" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "coling-2022-10" }, { "id": "prefix-embeddings-for-in-context-machine", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.amta-research.4", "url_pdf": "https://aclanthology.org/2022.amta-research.4.pdf", "title": "Prefix Embeddings for In-context Machine Translation", "abstract": "Very large language models have been shown to translate with few-shot in-context examples. However, they have not achieved state-of-art results for translating out of English. In this work, we investigate an extremely lightweight fixed-parameter method for conditioning a large language model to better translate into the target language. Our method introduces additional embeddings, known as prefix embeddings which do not interfere with the existing weights of the model. Using unsupervised and weakly semi-supervised methods that train only 0.0001% of the model parameters, the simple method improves ~0.2-1.3 BLEU points across 3 domains and 3 languages. We analyze the resulting embeddings’ training dynamics, and where they lie in the embedding space, and show that our trained embeddings can be used for both in-context translation, and diverse generation of the target sentence.", "authors": [ "Kevin Duh", "Suzanna Sia" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "amta-2022-9" }, { "id": "surface-form-competition-why-the-highest-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.emnlp-main.564", "url_pdf": "https://aclanthology.org/2021.emnlp-main.564.pdf", "title": "Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right", "abstract": "Large language models have shown promising results in zero-shot settings. For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition—wherein different surface forms compete for probability mass, even if they represent the same underlying concept in a given context, e.g. “computer” and “PC.” Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options). We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to its a priori likelihood within the context of a specific task. It achieves consistent gains in zero-shot performance over both calibrated and uncalibrated scoring functions on all GPT-2 and GPT-3 models on a variety of multiple choice datasets.", "authors": [ "Luke Zettlemoyer", "Yejin Choi", "Vered Shwartz", "Peter West", "Ari Holtzman" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "emnlp-2021-11" }, { "id": "alephbert-language-model-pre-training-and-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.acl-long.4", "url_pdf": "https://aclanthology.org/2022.acl-long.4.pdf", "title": "AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level", "abstract": "Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances. While advances reported for English using PLMs are unprecedented, reported advances using PLMs for Hebrew are few and far between. The problem is twofold. First, so far, Hebrew resources for training large language models are not of the same magnitude as their English counterparts. Second, most benchmarks available to evaluate progress in Hebrew NLP require morphological boundaries which are not available in the output of standard PLMs. In this work we remedy both aspects. We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before. Moreover, we introduce a novel neural architecture that recovers the morphological segments encoded in contextualized embedding vectors. Based on this new morphological component we offer an evaluation suite consisting of multiple tasks and benchmarks that cover sentence-level, word-level and sub-word level analyses. On all tasks, AlephBERT obtains state-of-the-art results beyond contemporary Hebrew baselines. We make our AlephBERT model, the morphological extraction model, and the Hebrew evaluation suite publicly available, for evaluating future Hebrew PLMs.", "authors": [ "Reut Tsarfaty", "Refael Greenfeld", "Idan Brusilovsky", "Dan Bareket", "Elron Bandel", "Amit Seker" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-2022-5" }, { "id": "gender-and-representation-bias-in-gpt-3", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.nuse-1.5", "url_pdf": "https://aclanthology.org/2021.nuse-1.5.pdf", "title": "Gender and Representation Bias in GPT-3 Generated Stories", "abstract": "Using topic modeling and lexicon-based word similarity, we find that stories generated by GPT-3 exhibit many known gender stereotypes. Generated stories depict different topics and descriptions depending on GPT-3’s perceived gender of the character in a prompt, with feminine characters more likely to be associated with family and appearance, and described as less powerful than masculine characters, even when associated with high power verbs in a prompt. Our study raises questions on how one can avoid unintended social biases when using large language models for storytelling.", "authors": [ "David Bamman", "Li Lucy" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-nuse-2021-6" }, { "id": "craft-an-iron-sword-dynamically-generating", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.wordplay-1.3", "url_pdf": "https://aclanthology.org/2022.wordplay-1.3.pdf", "title": "Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code", "abstract": "Non-Player Characters (NPCs) significantly enhance the player experience in many games. Historically, players’ interactions with NPCs have tended to be highly scripted, to be limited to natural language responses to be selected by the player, and to not involve dynamic change in game state. In this work, we demonstrate that use of a few example conversational prompts can power a conversational agent to generate both natural language and novel code. This approach can permit development of NPCs with which players can have grounded conversations that are free-form and less repetitive. We demonstrate our approach using OpenAI Codex (GPT-3 finetuned on GitHub), with Minecraft game development as our test bed. We show that with a few example prompts, a Codex-based agent can generate novel code, hold multi-turn conversations and answer questions about structured data. We evaluate this application using experienced gamers in a Minecraft realm and provide analysis of failure cases and suggest possible directions for solutions.", "authors": [ "Bill Dolan", "Akanksha Malhotra", "Olivia Deng", "Benjamin Van Durme", "Chris Brockett", "Gabriel DesGarennes", "Michael Xu", "Sudha Rao", "Ryan Volum" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-wordplay-2022-7" }, { "id": "testing-large-language-models-on", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.coling-1.359", "url_pdf": "https://aclanthology.org/2022.coling-1.359.pdf", "title": "Testing Large Language Models on Compositionality and Inference with Phrase-Level Adjective-Noun Entailment", "abstract": "Previous work has demonstrated that pre-trained large language models (LLM) acquire knowledge during pre-training which enables reasoning over relationships between words (e.g, hyponymy) and more complex inferences over larger units of meaning such as sentences. Here, we investigate whether lexical entailment (LE, i.e. hyponymy or the is a relation between words) can be generalised in a compositional manner. Accordingly, we introduce PLANE (Phrase-Level Adjective-Noun Entailment), a new benchmark to test models on fine-grained compositional entailment using adjective-noun phrases. Our experiments show that knowledge extracted via In–Context and transfer learning is not enough to solve PLANE. However, a LLM trained on PLANE can generalise well to out–of–distribution sets, since the required knowledge can be stored in the representations of subwords (SW) tokens.", "authors": [ "David Weir", "Julie Weeds", "Lorenzo Bertolini" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "coling-2022-10" }, { "id": "methods-for-estimating-and-improving-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.naacl-srw.6", "url_pdf": "https://aclanthology.org/2022.naacl-srw.6.pdf", "title": "Methods for Estimating and Improving Robustness of Language Models", "abstract": "Despite their outstanding performance, large language models (LLMs) suffer notorious flaws related to their preference for shallow textual relations over full semantic complexity of the problem. This proposal investigates a common denominator of this problem in their weak ability to generalise outside of the training domain. We survey diverse research directions providing estimations of model generalisation ability and find that incorporating some of these measures in the training objectives leads to enhanced distributional robustness of neural models. Based on these findings, we present future research directions enhancing the robustness of LLMs.", "authors": [ "Michal Stefanik" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-acl-2022-7" }, { "id": "upstream-mitigation-is-not-all-you-need", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.acl-long.247", "url_pdf": "https://aclanthology.org/2022.acl-long.247.pdf", "title": "Upstream Mitigation Is Not All You Need: Testing the Bias Transfer Hypothesis in Pre-Trained Language Models", "abstract": "A few large, homogenous, pre-trained models undergird many machine learning systems — and often, these models contain harmful stereotypes learned from the internet. We investigate the bias transfer hypothesis: the theory that social biases (such as stereotypes) internalized by large language models during pre-training transfer into harmful task-specific behavior after fine-tuning. For two classification tasks, we find that reducing intrinsic bias with controlled interventions before fine-tuning does little to mitigate the classifier’s discriminatory behavior after fine-tuning. Regression analysis suggests that downstream disparities are better explained by biases in the fine-tuning dataset. Still, pre-training plays a role: simple alterations to co-occurrence rates in the fine-tuning dataset are ineffective when the model has been pre-trained. Our results encourage practitioners to focus more on dataset quality and context-specific harms.", "authors": [ "Michael Wick", "Ari Kobren", "Swetasudha Panda", "Ryan Steed" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "acl-2022-5" }, { "id": "selecting-context-clozes-for-lightweight", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.bea-1.21", "url_pdf": "https://aclanthology.org/2022.bea-1.21.pdf", "title": "Selecting Context Clozes for Lightweight Reading Compliance", "abstract": "We explore a novel approach to reading compliance, leveraging large language models to select inline challenges that discourage skipping during reading. This lightweight ‘testing’ is accomplished through automatically identified context clozes where the reader must supply a missing word that would be hard to guess if earlier material was skipped. Clozes are selected by scoring each word by the contrast between its likelihood with and without prior sentences as context, preferring to leave gaps where this contrast is high. We report results of an initial human-participant test that indicates this method can find clozes that have this property.", "authors": [ "Michael Littman", "Greg Keim" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-bea-2022-7" }, { "id": "dont-forget-about-pronouns-removing-gender-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.gebnlp-1.3", "url_pdf": "https://aclanthology.org/2022.gebnlp-1.3.pdf", "title": "Don’t Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information", "abstract": "The representations in large language models contain multiple types of gender information. We focus on two types of such signals in English texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. We can disentangle the model’s embeddings and identify components encoding both types of information with probing. We aim to diminish the stereotypical bias in the representations while preserving the factual gender signal. Our filtering method shows that it is possible to decrease the bias of gender-neutral profession names without significant deterioration of language modeling capabilities. The findings can be applied to language generation to mitigate reliance on stereotypes while preserving gender agreement in coreferences.", "authors": [ "David Mareček", "Tomasz Limisiewicz" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-gebnlp-2022-7" }, { "id": "attention-understands-semantic-relations", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.lrec-1.430", "url_pdf": "https://aclanthology.org/2022.lrec-1.430.pdf", "title": "Attention Understands Semantic Relations", "abstract": "Today, natural language processing heavily relies on pre-trained large language models. Even though such models are criticized for the poor interpretability, they still yield state-of-the-art solutions for a wide set of very different tasks. While lots of probing studies have been conducted to measure the models’ awareness of grammatical knowledge, semantic probing is less popular. In this work, we introduce the probing pipeline to study the representedness of semantic relations in transformer language models. We show that in this task, attention scores are nearly as expressive as the layers’ output activations, despite their lesser ability to represent surface cues. This supports the hypothesis that attention mechanisms are focusing not only on the syntactic relational information but also on the semantic one.", "authors": [ "Mikhail Burtsev", "Tatiana Shavrina", "Oleg Serikov", "Sanzhar Murzakhmetov", "Anastasia Chizhikova" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "lrec-2022-6" }, { "id": "emergent-structures-and-training-dynamics-in", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.bigscience-1.11", "url_pdf": "https://aclanthology.org/2022.bigscience-1.11.pdf", "title": "Emergent Structures and Training Dynamics in Large Language Models", "abstract": "Large language models have achieved success on a number of downstream tasks, particularly in a few and zero-shot manner. As a consequence, researchers have been investigating both the kind of information these networks learn and how such information can be encoded in the parameters of the model. We survey the literature on changes in the network during training, drawing from work outside of NLP when necessary, and on learned representations of linguistic features in large language models. We note in particular the lack of sufficient research on the emergence of functional units, subsections of the network where related functions are grouped or organised, within large language models and motivate future work that grounds the study of language models in an analysis of their changing internal structure during training time.", "authors": [ "Aaron Gokaslan", "Shachar Mirkin", "Natasha Seelam", "Eliza Szczechla", "Oleg Serikov", "Miruna Clinciu", "Ryan Teehan" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "bigscience-acl-2022-5" }, { "id": "autosumm-automatic-model-creation-for-text", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.emnlp-main.798", "url_pdf": "https://aclanthology.org/2021.emnlp-main.798.pdf", "title": "AUTOSUMM: Automatic Model Creation for Text Summarization", "abstract": "Recent efforts to develop deep learning models for text generation tasks such as extractive and abstractive summarization have resulted in state-of-the-art performances on various datasets. However, obtaining the best model configuration for a given dataset requires an extensive knowledge of deep learning specifics like model architecture, tuning parameters etc., and is often extremely challenging for a non-expert. In this paper, we propose methods to automatically create deep learning models for the tasks of extractive and abstractive text summarization. Based on the recent advances in Automated Machine Learning and the success of large language models such as BERT and GPT-2 in encoding knowledge, we use a combination of Neural Architecture Search (NAS) and Knowledge Distillation (KD) techniques to perform model search and compression using the vast knowledge provided by these language models to develop smaller, customized models for any given dataset. We present extensive empirical results to illustrate the effectiveness of our model creation methods in terms of inference time and model size, while achieving near state-of-the-art performances in terms of accuracy across a range of datasets.", "authors": [ "Aparna Garimella", "Niyati Chhaya", "Raj Snehal", "Sagnik Mukherjee", "Jay Mundra", "Atharv Tyagi", "Sharmila Reddy Nangi" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "emnlp-2021-11" }, { "id": "shonglap-a-large-bengali-open-domain-dialogue", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.lrec-1.623", "url_pdf": "https://aclanthology.org/2022.lrec-1.623.pdf", "title": "SHONGLAP: A Large Bengali Open-Domain Dialogue Corpus", "abstract": "We introduce SHONGLAP, a large annotated open-domain dialogue corpus in Bengali language. Due to unavailability of high-quality dialogue datasets for low-resource languages like Bengali, existing neural open-domain dialogue systems suffer from data scarcity. We propose a framework to prepare large-scale open-domain dialogue datasets from publicly available multi-party discussion podcasts, talk-shows and label them based on weak-supervision techniques which is particularly suitable for low-resource settings. Using this framework, we prepared our corpus, the first reported Bengali open-domain dialogue corpus (7.7k+ fully annotated dialogues in total) which can serve as a strong baseline for future works. Experimental results show that our corpus improves performance of large language models (BanglaBERT) in case of downstream classification tasks during fine-tuning.", "authors": [ "Shafayat Ahmed", "Md Shahrar Fatemi", "Sakib Chowdhury", "Syed Mostofa Monsur" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "lrec-2022-6" }, { "id": "evaluating-pre-trained-language-models-on", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.sdp-1.22", "url_pdf": "https://aclanthology.org/2022.sdp-1.22.pdf", "title": "Evaluating Pre-Trained Language Models on Multi-Document Summarization for Literature Reviews", "abstract": "Systematic literature reviews in the biomedical space are often expensive to conduct. Automation through machine learning and large language models could improve the accuracy and research outcomes from such reviews. In this study, we evaluate a pre-trained LongT5 model on the MSLR22: Multi-Document Summarization for Literature Reviews Shared Task datasets. We weren’t able to make any improvements on the dataset benchmark, but we do establish some evidence that current summarization metrics are insufficient in measuring summarization accuracy. A multi-document summarization web tool was also built to demonstrate the viability of summarization models for future investigators: https://ben-yu.github.io/summarizer", "authors": [ "Benjamin Yu" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "sdp-coling-2022-10" }, { "id": "a-data-bootstrapping-recipe-for-low-resource-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.conll-1.45", "url_pdf": "https://aclanthology.org/2021.conll-1.45.pdf", "title": "A Data Bootstrapping Recipe for Low-Resource Multilingual Relation Classification", "abstract": "Relation classification (sometimes called ‘extraction’) requires trustworthy datasets for fine-tuning large language models, as well as for evaluation. Data collection is challenging for Indian languages, because they are syntactically and morphologically diverse, as well as different from resource-rich languages like English. Despite recent interest in deep generative models for Indian languages, relation classification is still not well-served by public data sets. In response, we present IndoRE, a dataset with 39K entity- and relation-tagged gold sentences in three Indian languages, plus English. We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information and provides competitive monolingual relation classification. Using this system, we explore and compare transfer mechanisms between languages. In particular, we study the accuracy-efficiency tradeoff between expensive gold instances vs. translated and aligned ‘silver’ instances.", "authors": [ "Soumen Chakrabarti", "Niloy Ganguly", "Animesh Mukherjee", "Bidisha Samanta", "Arijit Nag" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "conll-emnlp-2021-11" }, { "id": "surrey-cts-nlp-at-wassa2022-an-experiment-of", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.wassa-1.29", "url_pdf": "https://aclanthology.org/2022.wassa-1.29.pdf", "title": "SURREY-CTS-NLP at WASSA2022: An Experiment of Discourse and Sentiment Analysis for the Prediction of Empathy, Distress and Emotion", "abstract": "This paper summarises the submissions our team, SURREY-CTS-NLP has made for the WASSA 2022 Shared Task for the prediction of empathy, distress and emotion. In this work, we tested different learning strategies, like ensemble learning and multi-task learning, as well as several large language models, but our primary focus was on analysing and extracting emotion-intensive features from both the essays in the training data and the news articles, to better predict empathy and distress scores from the perspective of discourse and sentiment analysis. We propose several text feature extraction schemes to compensate the small size of training examples for fine-tuning pretrained language models, including methods based on Rhetorical Structure Theory (RST) parsing, cosine similarity and sentiment score. Our best submissions achieve an average Pearson correlation score of 0.518 for the empathy prediction task and an F1 score of 0.571 for the emotion prediction task, indicating that using these schemes to extract emotion-intensive information can help improve model performance.", "authors": [ "Félix do Carmo", "Hadeel Saadany", "Diptesh Kanojia", "Constantin Orasan", "Shenbin Qian" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "wassa-acl-2022-5" }, { "id": "unmet-creativity-support-needs-in", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.in2writing-1.11", "url_pdf": "https://aclanthology.org/2022.in2writing-1.11.pdf", "title": "Unmet Creativity Support Needs in Computationally Supported Creative Writing", "abstract": "Large language models (LLMs) enabled by the datasets and computing power of the last decade have recently gained popularity for their capacity to generate plausible natural language text from human-provided prompts. This ability makes them appealing to fiction writers as prospective co-creative agents, addressing the common challenge of writer’s block, or getting unstuck. However, creative writers face additional challenges, including maintaining narrative consistency, developing plot structure, architecting reader experience, and refining their expressive intent, which are not well-addressed by current LLM-backed tools. In this paper, we define these needs by grounding them in cognitive and theoretical literature, then survey previous computational narrative research that holds promise for supporting each of them in a co-creative setting.", "authors": [ "Chris Martens", "Max Kreminski" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "in2writing-acl-2022-5" }, { "id": "the-covid-that-wasnt-counterfactual", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.latechclfl-1.11", "url_pdf": "https://aclanthology.org/2022.latechclfl-1.11.pdf", "title": "The COVID That Wasn’t: Counterfactual Journalism Using GPT", "abstract": "In this paper, we explore the use of large language models to assess human interpretations of real world events. To do so, we use a language model trained prior to 2020 to artificially generate news articles concerning COVID-19 given the headlines of actual articles written during the pandemic. We then compare stylistic qualities of our artificially generated corpus with a news corpus, in this case 5,082 articles produced by CBC News between January 23 and May 5, 2020. We find our artificially generated articles exhibits a considerably more negative attitude towards COVID and a significantly lower reliance on geopolitical framing. Our methods and results hold importance for researchers seeking to simulate large scale cultural processes via recent breakthroughs in text generation.", "authors": [ "Andrew Piper", "Sil Hamilton" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "latechclfl-coling-2022-10" }, { "id": "show-dont-tell-demonstrations-outperform", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.naacl-main.336", "url_pdf": "https://aclanthology.org/2022.naacl-main.336.pdf", "title": "Show, Don’t Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue", "abstract": "Building universal dialogue systems that operate across multiple domains/APIs and generalize to new ones with minimal overhead is a critical challenge. Recent works have leveraged natural language descriptions of schema elements to enable such systems; however, descriptions only indirectly convey schema semantics. In this work, we propose Show, Don’t Tell, which prompts seq2seq models with a labeled example dialogue to show the semantics of schema elements rather than tell the model through descriptions. While requiring similar effort from service developers as generating descriptions, we show that using short examples as schema representations with large language models results in state-of-the-art performance on two popular dialogue state tracking benchmarks designed to measure zero-shot generalization - the Schema-Guided Dialogue dataset and the MultiWOZ leave-one-out benchmark.", "authors": [ "Yonghui Wu", "Abhinav Rastogi", "Yuan Cao", "Jeffrey Zhao", "Harrison Lee", "Raghav Gupta" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-2022-7" }, { "id": "hyperparameter-power-impact-in-transformer", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2021.sustainlp-1.12", "url_pdf": "https://aclanthology.org/2021.sustainlp-1.12.pdf", "title": "Hyperparameter Power Impact in Transformer Language Model Training", "abstract": "Training large language models can consume a large amount of energy. We hypothesize that the language model’s configuration impacts its energy consumption, and that there is room for power consumption optimisation in modern large language models. To investigate these claims, we introduce a power consumption factor to the objective function, and explore the range of models and hyperparameter configurations that affect power. We identify multiple configuration factors that can reduce power consumption during language model training while retaining model quality.", "authors": [ "Leon Derczynski", "Timmie Rantzau", "Mads Guldborg Kjeldgaard Kongsbak", "Lucas Høyberg Puvis de Chavannes" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "emnlp-sustainlp-2021-11" }, { "id": "nozza-lt-edi-acl2022-ensemble-modeling-for", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.ltedi-1.37", "url_pdf": "https://aclanthology.org/2022.ltedi-1.37.pdf", "title": "Nozza@LT-EDI-ACL2022: Ensemble Modeling for Homophobia and Transphobia Detection", "abstract": "In this paper, we describe our approach for the task of homophobia and transphobia detection in English social media comments. The dataset consists of YouTube comments, and it has been released for the shared task on Homophobia/Transphobia Detection in social media comments. Given the high class imbalance, we propose a solution based on data augmentation and ensemble modeling. We fine-tuned different large language models (BERT, RoBERTa, and HateBERT) and used the weighted majority vote on their predictions.Our proposed model obtained 0.48 and 0.94 for macro and weighted F1-score, respectively, ranking at the third position.", "authors": [ "Debora Nozza" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "ltedi-acl-2022-5" }, { "id": "rnre-nlp-at-semeval-2022-task-4-patronizing", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.semeval-1.49", "url_pdf": "https://aclanthology.org/2022.semeval-1.49.pdf", "title": "RNRE-NLP at SemEval-2022 Task 4: Patronizing and Condescending Language Detection", "abstract": "An understanding of patronizing and condescending language detection is an important part of identifying and addressing discrimination and prejudice in various forms of communication. In this paper, we investigate several methods for detecting patronizing and condescending language in short statements as part of SemEval-2022 Task 4. For Task 1a, we investigate applying both lightweight (tree-based and linear) machine learning classification models and fine-tuned pre-trained large language models. Our final system achieves an F1-score of 0.4321, recall-score of 0.5016, and a precision-score of 0.3795 (ranked 53 / 78) on Task 1a.", "authors": [ "Nathan Chi", "Ethan Chi", "Rylan Yang" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "semeval-naacl-2022-7" }, { "id": "caisa-at-wassa-2022-adapter-tuning-for", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.wassa-1.31", "url_pdf": "https://aclanthology.org/2022.wassa-1.31.pdf", "title": "CAISA at WASSA 2022: Adapter-Tuning for Empathy Prediction", "abstract": "We build a system that leverages adapters, a light weight and efficient method for leveraging large language models to perform the task Em- pathy and Distress prediction tasks for WASSA 2022. In our experiments, we find that stacking our empathy and distress adapters on a pre-trained emotion lassification adapter performs best compared to full fine-tuning approaches and emotion feature concatenation. We make our experimental code publicly available", "authors": [ "Lucie Flek", "Charles Welch", "Allison Lahnala" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "wassa-acl-2022-5" }, { "id": "measuring-harmful-sentence-completion-in", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.ltedi-1.4", "url_pdf": "https://aclanthology.org/2022.ltedi-1.4.pdf", "title": "Measuring Harmful Sentence Completion in Language Models for LGBTQIA+ Individuals", "abstract": "Current language technology is ubiquitous and directly influences individuals’ lives worldwide. Given the recent trend in AI on training and constantly releasing new and powerful large language models (LLMs), there is a need to assess their biases and potential concrete consequences. While some studies have highlighted the shortcomings of these models, there is only little on the negative impact of LLMs on LGBTQIA+ individuals. In this paper, we investigated a state-of-the-art template-based approach for measuring the harmfulness of English LLMs sentence completion when the subjects belong to the LGBTQIA+ community. Our findings show that, on average, the most likely LLM-generated completion is an identity attack 13% of the time. Our results raise serious concerns about the applicability of these models in production environments.", "authors": [ "Dirk Hovy", "Anne Lauscher", "Federico Bianchi", "Debora Nozza" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "ltedi-acl-2022-5" }, { "id": "pipelines-for-social-bias-testing-of-large", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.bigscience-1.6", "url_pdf": "https://aclanthology.org/2022.bigscience-1.6.pdf", "title": "Pipelines for Social Bias Testing of Large Language Models", "abstract": "The maturity level of language models is now at a stage in which many companies rely on them to solve various tasks. However, while research has shown how biased and harmful these models are, systematic ways of integrating social bias tests into development pipelines are still lacking. This short paper suggests how to use these verification techniques in development pipelines. We take inspiration from software testing and suggest addressing social bias evaluation as software testing. We hope to open a discussion on the best methodologies to handle social bias testing in language models.", "authors": [ "Dirk Hovy", "Federico Bianchi", "Debora Nozza" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "bigscience-acl-2022-5" }, { "id": "do-data-based-curricula-work-1", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.insights-1.16", "url_pdf": "https://aclanthology.org/2022.insights-1.16.pdf", "title": "Do Data-based Curricula Work?", "abstract": "Current state-of-the-art NLP systems use large neural networks that require extensive computational resources for training. Inspired by human knowledge acquisition, researchers have proposed curriculum learning - sequencing tasks (task-based curricula) or ordering and sampling the datasets (data-based curricula) that facilitate training. This work investigates the benefits of data-based curriculum learning for large language models such as BERT and T5. We experiment with various curricula based on complexity measures and different sampling strategies. Extensive experiments on several NLP tasks show that curricula based on various complexity measures rarely have any benefits, while random sampling performs either as well or better than curricula.", "authors": [ "Ivan Yamshchikov", "Vladislav Mosin", "Maxim Surkov" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "insights-acl-2022-5" }, { "id": "multimodal-large-language-models-for", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.naacl-srw.26", "url_pdf": "https://aclanthology.org/2022.naacl-srw.26.pdf", "title": "Multimodal large language models for inclusive collaboration learning tasks", "abstract": "This PhD project leverages advancements in multimodal large language models to build an inclusive collaboration feedback loop, in order to facilitate the automated detection, modeling, and feedback for participants developing general collaboration skills. This topic is important given the role of collaboration as an essential 21st century skill, the potential to ground large language models within learning theory and real-world practice, and the expressive potential of transformer models to support equity and inclusion. We address some concerns of integrating advances in natural language processing into downstream tasks such as the learning analytics feedback loop.", "authors": [ "Armanda Lewis" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-acl-2022-7" }, { "id": "you-dont-know-my-favorite-color-preventing", "arxiv_id": null, "nips_id": null, "url_abs": "https://aclanthology.org/2022.naacl-main.429", "url_pdf": "https://aclanthology.org/2022.naacl-main.429.pdf", "title": "You Don’t Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers’ Private Personas", "abstract": "Social chatbots, also known as chit-chat chatbots, evolve rapidly with large pretrained language models. Despite the huge progress, privacy concerns have arisen recently: training data of large language models can be extracted via model inversion attacks. On the other hand, the datasets used for training chatbots contain many private conversations between two individuals. In this work, we further investigate the privacy leakage of the hidden states of chatbots trained by language modeling which has not been well studied yet. We show that speakers’ personas can be inferred through a simple neural network with high accuracy. To this end, we propose effective defense objectives to protect persona leakage from hidden states. We conduct extensive experiments to demonstrate that our proposed defense objectives can greatly reduce the attack accuracy from 37.6% to 0.5%. Meanwhile, the proposed objectives preserve language models’ powerful generation ability.", "authors": [ "Lixin Fan", "Yangqiu Song", "Haoran Li" ], "published": null, "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "naacl-2022-7" }, { "id": "sslam-enhancing-self-supervised-models-with", "arxiv_id": null, "nips_id": null, "url_abs": "https://openreview.net/forum?id=odU59TxdiB", "url_pdf": "https://openreview.net/forum?id=odU59TxdiB", "title": "SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes", "abstract": "Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the self-supervised pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio self-supervised learning (SSL) methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio settings. To address this gap, we introduce Self-Supervised Learning from Audio Mixtures (SSLAM), a novel direction in audio SSL research, designed to improve the model’s ability to learn from polyphonic data while maintaining strong performance on monophonic data. We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against state-of-the-art (SOTA) methods using a range of high-quality, publicly available polyphonic datasets. SSLAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9% improvement on the AudioSet-2M(AS-2M), reaching a mean average precision (mAP) of 50.2. For polyphonic datasets, SSLAM sets new SOTA in both linear evaluation and fine-tuning regimes with performance improvements of up to 9.1%(mAP). These results demonstrate SSLAM's effectiveness in both polyphonic and monophonic soundscapes, significantly enhancing the performance of audio SSL models.", "authors": [ "Philip J B Jackson", "Muhammad Awais", "Armin Mustafa", "Sara Atito", "Tony Alex" ], "published": "2025-04-28", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "iclr-2025-4" }, { "id": "mask-enhanced-autoregressive-prediction-pay", "arxiv_id": "2502.07490", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.07490v1", "url_pdf": "https://arxiv.org/pdf/2502.07490v1.pdf", "title": "Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More", "abstract": "Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.", "authors": [ "Shiwei Liu", "Zheng Cao", "Li Shen", "Zhenyu Zhang", "Jianjin Li", "Zhikai Jia", "Xialie Zhuang" ], "published": "2025-02-11", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "jbshield-defending-large-language-models-from", "arxiv_id": "2502.07557", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.07557v1", "url_pdf": "https://arxiv.org/pdf/2502.07557v1.pdf", "title": "JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation", "abstract": "Despite the implementation of safety alignment strategies, large language models (LLMs) remain vulnerable to jailbreak attacks, which undermine these safety guardrails and pose significant security threats. Some defenses have been proposed to detect or mitigate jailbreaks, but they are unable to withstand the test of time due to an insufficient understanding of jailbreak mechanisms. In this work, we investigate the mechanisms behind jailbreaks based on the Linear Representation Hypothesis (LRH), which states that neural networks encode high-level concepts as subspaces in their hidden representations. We define the toxic semantics in harmful and jailbreak prompts as toxic concepts and describe the semantics in jailbreak prompts that manipulate LLMs to comply with unsafe requests as jailbreak concepts. Through concept extraction and analysis, we reveal that LLMs can recognize the toxic concepts in both harmful and jailbreak prompts. However, unlike harmful prompts, jailbreak prompts activate the jailbreak concepts and alter the LLM output from rejection to compliance. Building on our analysis, we propose a comprehensive jailbreak defense framework, JBShield, consisting of two key components: jailbreak detection JBShield-D and mitigation JBShield-M. JBShield-D identifies jailbreak prompts by determining whether the input activates both toxic and jailbreak concepts. When a jailbreak prompt is detected, JBShield-M adjusts the hidden representations of the target LLM by enhancing the toxic concept and weakening the jailbreak concept, ensuring LLMs produce safe content. Extensive experiments demonstrate the superior performance of JBShield, achieving an average detection accuracy of 0.95 and reducing the average attack success rate of various jailbreak attacks to 2% from 61% across distinct LLMs.", "authors": [], "published": "2025-02-11", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "leveraging-gpt-4o-efficiency-for-detecting", "arxiv_id": "2502.06918", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.06918v1", "url_pdf": "https://arxiv.org/pdf/2502.06918v1.pdf", "title": "Leveraging GPT-4o Efficiency for Detecting Rework Anomaly in Business Processes", "abstract": "This paper investigates the effectiveness of GPT-4o-2024-08-06, one of the Large Language Models (LLM) from OpenAI, in detecting business process anomalies, with a focus on rework anomalies. In our study, we developed a GPT-4o-based tool capable of transforming event logs into a structured format and identifying reworked activities within business event logs. The analysis was performed on a synthetic dataset designed to contain rework anomalies but free of loops. To evaluate the anomaly detection capabilities of GPT 4o-2024-08-06, we used three prompting techniques: zero-shot, one-shot, and few-shot. These techniques were tested on different anomaly distributions, namely normal, uniform, and exponential, to identify the most effective approach for each case. The results demonstrate the strong performance of GPT-4o-2024-08-06. On our dataset, the model achieved 96.14% accuracy with one-shot prompting for the normal distribution, 97.94% accuracy with few-shot prompting for the uniform distribution, and 74.21% accuracy with few-shot prompting for the exponential distribution. These results highlight the model's potential as a reliable tool for detecting rework anomalies in event logs and how anomaly distribution and prompting strategy influence the model's performance.", "authors": [ "Fatemeh Mohammadi", "Paolo Ceravolo", "Mohammad Derakhshan" ], "published": "2025-02-10", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "saving-77-of-the-parameters-in-large-language", "arxiv_id": null, "nips_id": null, "url_abs": "https://www.researchgate.net/publication/388835829_SAVING_77_OF_THE_PARAMETERS_IN_LARGE_LANGUAGE_MODELS_TECHNICAL_REPORT", "url_pdf": "https://www.researchgate.net/publication/388835829_SAVING_77_OF_THE_PARAMETERS_IN_LARGE_LANGUAGE_MODELS_TECHNICAL_REPORT", "title": "Saving 77% of the Parameters in Large Language Models Technical Report", "abstract": "This technical report demonstrates that large language models (LLMs) can maintain their learning capacity while reducing their non-embedding parameters by up to 77%. We achieve this by adapting a parameter reduction technique originally developed for computer vision, replacing dense layers with an optimized subnetwork that contains grouped pointwise convolutions. Using Microsoft's phi-3-mini-4k-instruct as our baseline, we show that our optimized model (kphi-3) achieves comparable validation loss while using only 15-23% of the original non-embedding parameters. All experiments were conducted on a single NVIDIA L2 GPU within a 3-day timeframe, supporting the democratization of AI research. Our findings suggest that current LLM architectures may be substantially overparameterized, opening possibilities for more efficient model training and deployment.", "authors": [ "Alejandra Rojas-Gómez", "Joao Paulo Schwarz Schuler" ], "published": "2025-02-09", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": "researchgate-net-2025-2" }, { "id": "fact-or-fair-a-checklist-for-behavioral", "arxiv_id": "2502.05849", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.05849v1", "url_pdf": "https://arxiv.org/pdf/2502.05849v1.pdf", "title": "Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries", "abstract": "The generation of incorrect images, such as depictions of people of color in Nazi-era uniforms by Gemini, frustrated users and harmed Google's reputation, motivating us to investigate the relationship between accurately reflecting factuality and promoting diversity and equity. In this study, we focus on 19 real-world statistics collected from authoritative sources. Using these statistics, we develop a checklist comprising objective and subjective queries to analyze behavior of large language models (LLMs) and text-to-image (T2I) models. Objective queries assess the models' ability to provide accurate world knowledge. In contrast, the design of subjective queries follows a key principle: statistical or experiential priors should not be overgeneralized to individuals, ensuring that models uphold diversity. These subjective queries are based on three common human cognitive errors that often result in social biases. We propose metrics to assess factuality and fairness, and formally prove the inherent trade-off between these two aspects. Results show that GPT-4o and DALL-E 3 perform notably well among six LLMs and four T2I models. Our code is publicly available at https://github.com/uclanlp/Fact-or-Fair.", "authors": [ "Michael R. Lyu", "Kai-Wei Chang", "Wenxuan Wang", "Yixin Wan", "Linqi Liu", "Yuhang Yan", "Jen-tse Huang" ], "published": "2025-02-09", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "codesim-multi-agent-code-generation-and", "arxiv_id": null, "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.05664", "url_pdf": "https://arxiv.org/pdf/2502.05664", "title": "CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging", "abstract": "Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (https://kagnlp.github.io/codesim.github.io/).", "authors": [ "Md Rizwan Parvez", "Mohammed Eunus Ali", "Md. Ashraful Islam" ], "published": "2025-02-08", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "elmtex-fine-tuning-large-language-models-for", "arxiv_id": "2502.05638", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.05638v1", "url_pdf": "https://arxiv.org/pdf/2502.05638v1.pdf", "title": "ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports", "abstract": "Europe's healthcare systems require enhanced interoperability and digitalization, driving a demand for innovative solutions to process legacy clinical data. This paper presents the results of our project, which aims to leverage Large Language Models (LLMs) to extract structured information from unstructured clinical reports, focusing on patient history, diagnoses, treatments, and other predefined categories. We developed a workflow with a user interface and evaluated LLMs of varying sizes through prompting strategies and fine-tuning. Our results show that fine-tuned smaller models match or surpass larger counterparts in performance, offering efficiency for resource-limited settings. A new dataset of 60,000 annotated English clinical summaries and 24,000 German translations was validated with automated and manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics. The work highlights the approach's viability and outlines future improvements.", "authors": [ "Carlos A Velasco", "Yehya Mohamad", "Jahid Hasan Polash", "Florim Hamiti", "Zeyd Boukhers", "Naguib Heiba", "Aynur Guluzade" ], "published": "2025-02-08", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "multimodal-cognitive-reframing-therapy-via", "arxiv_id": "2502.06873", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.06873v1", "url_pdf": "https://arxiv.org/pdf/2502.06873v1.pdf", "title": "Multimodal Cognitive Reframing Therapy via Multi-hop Psychotherapeutic Reasoning", "abstract": "Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client's facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs' performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.", "authors": [ "Gary Geunbae Lee", "Heejin Do", "Hoonrae Kim", "Subin Kim" ], "published": "2025-02-08", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "related-knowledge-perturbation-matters", "arxiv_id": "2502.06868", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.06868v1", "url_pdf": "https://arxiv.org/pdf/2502.06868v1.pdf", "title": "Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject", "abstract": "Knowledge editing has become a promising approach for efficiently and precisely updating knowledge embedded in large language models (LLMs). In this work, we focus on Same-Subject Editing, which involves modifying multiple attributes of a single entity to ensure comprehensive and consistent updates to entity-centric knowledge. Through preliminary observation, we identify a significant challenge: Current state-of-the-art editing methods struggle when tasked with editing multiple related knowledge pieces for the same subject. To address the lack of relevant editing data for identical subjects in traditional benchmarks, we introduce the $\\text{S}^2\\text{RKE}$(Same-Subject Related Knowledge Editing) benchmark. Our extensive experiments reveal that only mainstream locate-then-edit methods, such as ROME and MEMIT, exhibit \"related knowledge perturbation,\" where subsequent edits interfere with earlier ones. Further analysis reveals that these methods over-rely on subject information, neglecting other critical factors, resulting in reduced editing effectiveness.", "authors": [ "Xueqi Cheng", "HuaWei Shen", "Jie Zhang", "Shaoling Jing", "Yinghan Shen", "Zhiyi Yin", "Wenbin Duan", "Zenghao Duan" ], "published": "2025-02-08", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "the-complexity-of-learning-sparse-superposed", "arxiv_id": "2502.05407", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.05407v1", "url_pdf": "https://arxiv.org/pdf/2502.05407v1.pdf", "title": "The Complexity of Learning Sparse Superposed Features with Feedback", "abstract": "The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative \\textit{triplet comparisons}. These features may represent various constructs, including dictionaries in LLMs or components of a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent's feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machine-trained models and dictionary extraction from sparse autoencoders trained on Large Language Models.", "authors": [ "Akash Kumar" ], "published": "2025-02-08", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "stride-automating-reward-design-deep", "arxiv_id": "2502.04692", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04692v1", "url_pdf": "https://arxiv.org/pdf/2502.04692v1.pdf", "title": "STRIDE: Automating Reward Design, Deep Reinforcement Learning Training and Feedback Optimization in Humanoid Robotics Locomotion", "abstract": "Humanoid robotics presents significant challenges in artificial intelligence, requiring precise coordination and control of high-degree-of-freedom systems. Designing effective reward functions for deep reinforcement learning (DRL) in this domain remains a critical bottleneck, demanding extensive manual effort, domain expertise, and iterative refinement. To overcome these challenges, we introduce STRIDE, a novel framework built on agentic engineering to automate reward design, DRL training, and feedback optimization for humanoid robot locomotion tasks. By combining the structured principles of agentic engineering with large language models (LLMs) for code-writing, zero-shot generation, and in-context optimization, STRIDE generates, evaluates, and iteratively refines reward functions without relying on task-specific prompts or templates. Across diverse environments featuring humanoid robot morphologies, STRIDE outperforms the state-of-the-art reward design framework EUREKA, achieving significant improvements in efficiency and task performance. Using STRIDE-generated rewards, simulated humanoid robots achieve sprint-level locomotion across complex terrains, highlighting its ability to advance DRL workflows and humanoid robotics research.", "authors": [ "Luhui Hu", "Yueting Zhuang", "Yunxin Liu", "Yuxiao Chen", "Jinxiong Lu", "Zhenwei Wu" ], "published": "2025-02-07", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "a-lightweight-method-to-disrupt-memorized", "arxiv_id": "2502.05159", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.05159v1", "url_pdf": "https://arxiv.org/pdf/2502.05159v1.pdf", "title": "A Lightweight Method to Disrupt Memorized Sequences in LLM", "abstract": "Large language models (LLMs) demonstrate impressive capabilities across many tasks yet risk reproducing copyrighted content verbatim, raising legal and ethical concerns. Although methods like differential privacy or neuron editing can reduce memorization, they typically require costly retraining or direct access to model weights and may degrade performance. To address these challenges, we propose TokenSwap, a lightweight, post-hoc approach that replaces the probabilities of grammar-related tokens with those from a small auxiliary model (e.g., DistilGPT-2). We run extensive experiments on commercial grade models such as Pythia-6.9b and LLaMA-3-8b and demonstrate that our method effectively reduces well-known cases of memorized generation by upto 10x with little to no impact on downstream tasks. Our approach offers a uniquely accessible and effective solution to users of real-world systems.", "authors": [ "Babak Salimi", "Kaustubh Ponkshe", "Parjanya Prajakta Prashant" ], "published": "2025-02-07", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "every-software-as-an-agent-blueprint-and-case", "arxiv_id": "2502.04747", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04747v1", "url_pdf": "https://arxiv.org/pdf/2502.04747v1.pdf", "title": "Every Software as an Agent: Blueprint and Case Study", "abstract": "The rise of (multimodal) large language models (LLMs) has shed light on software agent -- where software can understand and follow user instructions in natural language. However, existing approaches such as API-based and GUI-based agents are far from satisfactory at accuracy and efficiency aspects. Instead, we advocate to endow LLMs with access to the software internals (source code and runtime context) and the permission to dynamically inject generated code into software for execution. In such a whitebox setting, one may better leverage the software context and the coding ability of LLMs. We then present an overall design architecture and case studies on two popular web-based desktop applications. We also give in-depth discussion of the challenges and future directions. We deem that such a new paradigm has the potential to fundamentally overturn the existing software agent design, and finally creating a digital world in which software can comprehend, operate, collaborate, and even think to meet complex user needs.", "authors": [ "Mengwei Xu" ], "published": "2025-02-07", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "mobile-network-specialized-large-language", "arxiv_id": "2502.04933", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04933v1", "url_pdf": "https://arxiv.org/pdf/2502.04933v1.pdf", "title": "Mobile Network-specialized Large Language Models for 6G: Architectures, Innovations, Challenges, and Future Trends", "abstract": "Conventional 5G network management mechanisms, that operate in isolated silos across different network segments, will experience significant limitations in handling the unprecedented hyper-complexity and massive scale of the sixth generation (6G). Holistic intelligence and end-to-end automation are, thus, positioned as key enablers of forthcoming 6G networks. The Large Language Model (LLM) technology, a major breakthrough in the Generative Artificial Intelligence (AI) field, enjoys robust human-like language processing, advanced contextual reasoning and multi-modal capabilities. These features foster a holistic understanding of network behavior and an autonomous decision-making. This paper investigates four possible architectural designs for integrated LLM and 6G networks, detailing the inherent technical intricacies, the merits and the limitations of each design. As an internal functional building block of future 6G networks, the LLM will natively benefit from their improved design-driven security policies from the early design and specification stages. An illustrative scenario of slicing conflicts is used to prove the effectiveness of our architectural framework in autonomously dealing with complicated network anomalies. We finally conclude the paper with an overview of the key challenges and the relevant research trends for enabling Mobile Networkspecialized LLMs. This study is intended to provide Mobile Network Operators (MNOs) with a comprehensive guidance in their paths towards embracing the LLM technology.", "authors": [], "published": "2025-02-07", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "enhancing-phishing-email-identification-with", "arxiv_id": "2502.04759", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04759v1", "url_pdf": "https://arxiv.org/pdf/2502.04759v1.pdf", "title": "Enhancing Phishing Email Identification with Large Language Models", "abstract": "Phishing has long been a common tactic used by cybercriminals and continues to pose a significant threat in today's digital world. When phishing attacks become more advanced and sophisticated, there is an increasing need for effective methods to detect and prevent them. To address the challenging problem of detecting phishing emails, researchers have developed numerous solutions, in particular those based on machine learning (ML) algorithms. In this work, we take steps to study the efficacy of large language models (LLMs) in detecting phishing emails. The experiments show that the LLM achieves a high accuracy rate at high precision; importantly, it also provides interpretable evidence for the decisions.", "authors": [ "Catherine Lee" ], "published": "2025-02-07", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "mographgpt-creating-interactive-scenes-using", "arxiv_id": "2502.04983", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04983v1", "url_pdf": "https://arxiv.org/pdf/2502.04983v1.pdf", "title": "MoGraphGPT: Creating Interactive Scenes Using Modular LLM and Graphical Control", "abstract": "Creating interactive scenes often involves complex programming tasks. Although large language models (LLMs) like ChatGPT can generate code from natural language, their output is often error-prone, particularly when scripting interactions among multiple elements. The linear conversational structure limits the editing of individual elements, and lacking graphical and precise control complicates visual integration. To address these issues, we integrate an element-level modularization technique that processes textual descriptions for individual elements through separate LLM modules, with a central module managing interactions among elements. This modular approach allows for refining each element independently. We design a graphical user interface, MoGraphGPT , which combines modular LLMs with enhanced graphical control to generate codes for 2D interactive scenes. It enables direct integration of graphical information and offers quick, precise control through automatically generated sliders. Our comparative evaluation against an AI coding tool, Cursor Composer, as the baseline system and a usability study show MoGraphGPT significantly improves easiness, controllability, and refinement in creating complex 2D interactive scenes with multiple visual elements in a coding-free manner.", "authors": [], "published": "2025-02-07", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null } ] }{ "count": 24708, "next": "