Paper List
Return a paginated listing of all papers.
GET /api/v1/papers/?ordering=-id&q=Large+Language+Models
https://paperswithcode.com/api/v1/papers/?ordering=-id&page=2&q=Large+Language+Models", "previous": null, "results": [ { "id": "codesim-multi-agent-code-generation-and", "arxiv_id": null, "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.05664", "url_pdf": "https://arxiv.org/pdf/2502.05664", "title": "CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging", "abstract": "Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (https://kagnlp.github.io/codesim.github.io/).", "authors": [ "Md Rizwan Parvez", "Mohammed Eunus Ali", "Md. Ashraful Islam" ], "published": "2025-02-08", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "fact-or-fair-a-checklist-for-behavioral", "arxiv_id": "2502.05849", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.05849v1", "url_pdf": "https://arxiv.org/pdf/2502.05849v1.pdf", "title": "Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries", "abstract": "The generation of incorrect images, such as depictions of people of color in Nazi-era uniforms by Gemini, frustrated users and harmed Google's reputation, motivating us to investigate the relationship between accurately reflecting factuality and promoting diversity and equity. In this study, we focus on 19 real-world statistics collected from authoritative sources. Using these statistics, we develop a checklist comprising objective and subjective queries to analyze behavior of large language models (LLMs) and text-to-image (T2I) models. Objective queries assess the models' ability to provide accurate world knowledge. In contrast, the design of subjective queries follows a key principle: statistical or experiential priors should not be overgeneralized to individuals, ensuring that models uphold diversity. These subjective queries are based on three common human cognitive errors that often result in social biases. We propose metrics to assess factuality and fairness, and formally prove the inherent trade-off between these two aspects. Results show that GPT-4o and DALL-E 3 perform notably well among six LLMs and four T2I models. Our code is publicly available at https://github.com/uclanlp/Fact-or-Fair.", "authors": [ "Michael R. Lyu", "Kai-Wei Chang", "Wenxuan Wang", "Yixin Wan", "Linqi Liu", "Yuhang Yan", "Jen-tse Huang" ], "published": "2025-02-09", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "leveraging-gpt-4o-efficiency-for-detecting", "arxiv_id": "2502.06918", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.06918v1", "url_pdf": "https://arxiv.org/pdf/2502.06918v1.pdf", "title": "Leveraging GPT-4o Efficiency for Detecting Rework Anomaly in Business Processes", "abstract": "This paper investigates the effectiveness of GPT-4o-2024-08-06, one of the Large Language Models (LLM) from OpenAI, in detecting business process anomalies, with a focus on rework anomalies. In our study, we developed a GPT-4o-based tool capable of transforming event logs into a structured format and identifying reworked activities within business event logs. The analysis was performed on a synthetic dataset designed to contain rework anomalies but free of loops. To evaluate the anomaly detection capabilities of GPT 4o-2024-08-06, we used three prompting techniques: zero-shot, one-shot, and few-shot. These techniques were tested on different anomaly distributions, namely normal, uniform, and exponential, to identify the most effective approach for each case. The results demonstrate the strong performance of GPT-4o-2024-08-06. On our dataset, the model achieved 96.14% accuracy with one-shot prompting for the normal distribution, 97.94% accuracy with few-shot prompting for the uniform distribution, and 74.21% accuracy with few-shot prompting for the exponential distribution. These results highlight the model's potential as a reliable tool for detecting rework anomalies in event logs and how anomaly distribution and prompting strategy influence the model's performance.", "authors": [ "Fatemeh Mohammadi", "Paolo Ceravolo", "Mohammad Derakhshan" ], "published": "2025-02-10", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "multimodal-cognitive-reframing-therapy-via", "arxiv_id": "2502.06873", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.06873v1", "url_pdf": "https://arxiv.org/pdf/2502.06873v1.pdf", "title": "Multimodal Cognitive Reframing Therapy via Multi-hop Psychotherapeutic Reasoning", "abstract": "Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client's facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs' performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.", "authors": [ "Gary Geunbae Lee", "Heejin Do", "Hoonrae Kim", "Subin Kim" ], "published": "2025-02-08", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "ctr-driven-advertising-image-generation-with", "arxiv_id": "2502.06823", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.06823v1", "url_pdf": "https://arxiv.org/pdf/2502.06823v1.pdf", "title": "CTR-Driven Advertising Image Generation with Multimodal Large Language Models", "abstract": "In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences. Meanwhile, a product-centric preference optimization strategy is developed to ensure that the generated background content aligns with the product characteristics after fine-tuning, enhancing the overall relevance and effectiveness of the advertising images. Extensive experiments have demonstrated that our method achieves state-of-the-art performance in both online and offline metrics. Our code and pre-trained models are publicly available at: https://github.com/Chenguoz/CAIG.", "authors": [ "Nong Sang", "Changxin Gao", "Xinge You", "Yuanjie Shao", "Jingping Shao", "Zhangang Lin", "Junjie Shen", "Jingjing Lv", "Zheng Zhang", "Yu Li", "Jinyuan Zhao", "Yaoyu Li", "Linkai Liu", "Haohan Wang", "Yanyin Chen", "Weizhen Wang", "Zhenbang Du", "Wei Feng", "Xingye Chen" ], "published": "2025-02-05", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "mask-enhanced-autoregressive-prediction-pay", "arxiv_id": "2502.07490", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.07490v1", "url_pdf": "https://arxiv.org/pdf/2502.07490v1.pdf", "title": "Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More", "abstract": "Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.", "authors": [ "Shiwei Liu", "Zheng Cao", "Li Shen", "Zhenyu Zhang", "Jianjin Li", "Zhikai Jia", "Xialie Zhuang" ], "published": "2025-02-11", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "related-knowledge-perturbation-matters", "arxiv_id": "2502.06868", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.06868v1", "url_pdf": "https://arxiv.org/pdf/2502.06868v1.pdf", "title": "Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject", "abstract": "Knowledge editing has become a promising approach for efficiently and precisely updating knowledge embedded in large language models (LLMs). In this work, we focus on Same-Subject Editing, which involves modifying multiple attributes of a single entity to ensure comprehensive and consistent updates to entity-centric knowledge. Through preliminary observation, we identify a significant challenge: Current state-of-the-art editing methods struggle when tasked with editing multiple related knowledge pieces for the same subject. To address the lack of relevant editing data for identical subjects in traditional benchmarks, we introduce the $\\text{S}^2\\text{RKE}$(Same-Subject Related Knowledge Editing) benchmark. Our extensive experiments reveal that only mainstream locate-then-edit methods, such as ROME and MEMIT, exhibit \"related knowledge perturbation,\" where subsequent edits interfere with earlier ones. Further analysis reveals that these methods over-rely on subject information, neglecting other critical factors, resulting in reduced editing effectiveness.", "authors": [ "Xueqi Cheng", "HuaWei Shen", "Jie Zhang", "Shaoling Jing", "Yinghan Shen", "Zhiyi Yin", "Wenbin Duan", "Zenghao Duan" ], "published": "2025-02-08", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "jbshield-defending-large-language-models-from", "arxiv_id": "2502.07557", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.07557v1", "url_pdf": "https://arxiv.org/pdf/2502.07557v1.pdf", "title": "JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation", "abstract": "Despite the implementation of safety alignment strategies, large language models (LLMs) remain vulnerable to jailbreak attacks, which undermine these safety guardrails and pose significant security threats. Some defenses have been proposed to detect or mitigate jailbreaks, but they are unable to withstand the test of time due to an insufficient understanding of jailbreak mechanisms. In this work, we investigate the mechanisms behind jailbreaks based on the Linear Representation Hypothesis (LRH), which states that neural networks encode high-level concepts as subspaces in their hidden representations. We define the toxic semantics in harmful and jailbreak prompts as toxic concepts and describe the semantics in jailbreak prompts that manipulate LLMs to comply with unsafe requests as jailbreak concepts. Through concept extraction and analysis, we reveal that LLMs can recognize the toxic concepts in both harmful and jailbreak prompts. However, unlike harmful prompts, jailbreak prompts activate the jailbreak concepts and alter the LLM output from rejection to compliance. Building on our analysis, we propose a comprehensive jailbreak defense framework, JBShield, consisting of two key components: jailbreak detection JBShield-D and mitigation JBShield-M. JBShield-D identifies jailbreak prompts by determining whether the input activates both toxic and jailbreak concepts. When a jailbreak prompt is detected, JBShield-M adjusts the hidden representations of the target LLM by enhancing the toxic concept and weakening the jailbreak concept, ensuring LLMs produce safe content. Extensive experiments demonstrate the superior performance of JBShield, achieving an average detection accuracy of 0.95 and reducing the average attack success rate of various jailbreak attacks to 2% from 61% across distinct LLMs.", "authors": [], "published": "2025-02-11", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "llm-supported-natural-language-to-bash", "arxiv_id": "2502.06858", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.06858v1", "url_pdf": "https://arxiv.org/pdf/2502.06858v1.pdf", "title": "LLM-Supported Natural Language to Bash Translation", "abstract": "The Bourne-Again Shell (Bash) command-line interface for Linux systems has complex syntax and requires extensive specialized knowledge. Using the natural language to Bash command (NL2SH) translation capabilities of large language models (LLMs) for command composition circumvents these issues. However, the NL2SH performance of LLMs is difficult to assess due to inaccurate test data and unreliable heuristics for determining the functional equivalence of Bash commands. We present a manually verified test dataset of 600 instruction-command pairs and a training dataset of 40,939 pairs, increasing the size of previous datasets by 441% and 135%, respectively. Further, we present a novel functional equivalence heuristic that combines command execution with LLM evaluation of command outputs. Our heuristic can determine the functional equivalence of two Bash commands with 95% confidence, a 16% increase over previous heuristics. Evaluation of popular LLMs using our test dataset and heuristic demonstrates that parsing, in-context learning, in-weight learning, and constrained decoding can improve NL2SH accuracy by up to 32%. Our findings emphasize the importance of dataset quality, execution-based evaluation and translation method for advancing NL2SH translation. Our code is available at https://github.com/westenfelder/NL2SH", "authors": [ "Silviu Chiricescu", "Una-May O'Reilly", "Stephen Moskal", "Miguel Tulla", "Erik Hemberg", "Finnian Westenfelder" ], "published": "2025-02-07", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "on-fairness-of-unified-multimodal-large", "arxiv_id": "2502.03429", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03429v1", "url_pdf": "https://arxiv.org/pdf/2502.03429v1.pdf", "title": "On Fairness of Unified Multimodal Large Language Model for Image Generation", "abstract": "Unified multimodal large language models (U-MLLMs) have demonstrated impressive performance in visual understanding and generation in an end-to-end pipeline. Compared with generation-only models (e.g., Stable Diffusion), U-MLLMs may raise new questions about bias in their outputs, which can be affected by their unified capabilities. This gap is particularly concerning given the under-explored risk of propagating harmful stereotypes. In this paper, we benchmark the latest U-MLLMs and find that most exhibit significant demographic biases, such as gender and race bias. To better understand and mitigate this issue, we propose a locate-then-fix strategy, where we audit and show how the individual model component is affected by bias. Our analysis shows that bias originates primarily from the language model. More interestingly, we observe a \"partial alignment\" phenomenon in U-MLLMs, where understanding bias appears minimal, but generation bias remains substantial. Thus, we propose a novel balanced preference model to balance the demographic distribution with synthetic data. Experiments demonstrate that our approach reduces demographic bias while preserving semantic fidelity. We hope our findings underscore the need for more holistic interpretation and debiasing strategies of U-MLLMs in the future.", "authors": [ "Wensheng Zhang", "Bhiksha Raj Ramakrishnan", "LiWen Wang", "Jindong Wang", "Hao Chen", "Ming Liu" ], "published": "2025-02-05", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "oracular-programming-a-modular-foundation-for", "arxiv_id": "2502.05310", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.05310v1", "url_pdf": "https://arxiv.org/pdf/2502.05310v1.pdf", "title": "Oracular Programming: A Modular Foundation for Building LLM-Enabled Software", "abstract": "Large Language Models have proved surprisingly effective at solving a wide range of tasks from just a handful of examples. However, their lack of reliability and modularity limits their capacity to tackle large problems that require many steps of reasoning. In response, researchers have proposed advanced pipelines that leverage domain-specific knowledge to chain smaller prompts, provide intermediate feedback and improve performance through search. However, the current complexity of writing, tuning, maintaining and improving such pipelines has limited their sophistication. We propose oracular programming, a foundational paradigm for building LLM-enabled applications that lets domain experts express high-level problem-solving strategies as programs with unresolved choice points. These choice points are resolved at runtime by LLMs, which generalize from user-provided examples of correct and incorrect decisions. An oracular program is composed of three orthogonal components: a strategy that consists in a nondeterministic program with choice points that can be reified into a search tree, a policy that specifies how to navigate this tree with the help of LLM oracles, and a set of demonstrations that describe successful and unsuccessful search tree navigation scenarios across diverse problem instances. Each component is expressed in a dedicated programming language and can be independently improved or substituted. We address the key programming language design challenges of modularly composing oracular programs and enforcing consistency between their components as they evolve.", "authors": [ "André Platzer", "Jonathan Laurent" ], "published": "2025-02-07", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "elmtex-fine-tuning-large-language-models-for", "arxiv_id": "2502.05638", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.05638v1", "url_pdf": "https://arxiv.org/pdf/2502.05638v1.pdf", "title": "ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports", "abstract": "Europe's healthcare systems require enhanced interoperability and digitalization, driving a demand for innovative solutions to process legacy clinical data. This paper presents the results of our project, which aims to leverage Large Language Models (LLMs) to extract structured information from unstructured clinical reports, focusing on patient history, diagnoses, treatments, and other predefined categories. We developed a workflow with a user interface and evaluated LLMs of varying sizes through prompting strategies and fine-tuning. Our results show that fine-tuned smaller models match or surpass larger counterparts in performance, offering efficiency for resource-limited settings. A new dataset of 60,000 annotated English clinical summaries and 24,000 German translations was validated with automated and manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics. The work highlights the approach's viability and outlines future improvements.", "authors": [ "Carlos A Velasco", "Yehya Mohamad", "Jahid Hasan Polash", "Florim Hamiti", "Zeyd Boukhers", "Naguib Heiba", "Aynur Guluzade" ], "published": "2025-02-08", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "beyond-prompt-content-enhancing-llm", "arxiv_id": "2502.04295", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04295v2", "url_pdf": "https://arxiv.org/pdf/2502.04295v2.pdf", "title": "Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization", "abstract": "Large Language Models (LLMs) have shown significant capability across various tasks, with their real-world effectiveness often driven by prompt design. While recent research has focused on optimizing prompt content, the role of prompt formatting, a critical but often overlooked dimension, has received limited systematic investigation. In this paper, we introduce Content-Format Integrated Prompt Optimization (CFPO), an innovative methodology that jointly optimizes both prompt content and formatting through an iterative refinement process. CFPO leverages natural language mutations to explore content variations and employs a dynamic format exploration strategy that systematically evaluates diverse format options. Our extensive evaluations across multiple tasks and open-source LLMs demonstrate that CFPO demonstrates measurable performance improvements compared to content-only optimization methods. This highlights the importance of integrated content-format optimization and offers a practical, model-agnostic approach to enhancing LLM performance. Code is available at https://github.com/HenryLau7/CFPO.", "authors": [ "Peng Cheng", "Yuqing Yang", "Zhongxin Guo", "Yang Chen", "Xuan Feng", "Qi Chen", "Li Lyna Zhang", "Jiahang Xu", "Yuanye Liu" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "maqinstruct-instruction-based-unified-event", "arxiv_id": "2502.03954", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03954v1", "url_pdf": "https://arxiv.org/pdf/2502.03954v1.pdf", "title": "MAQInstruct: Instruction-based Unified Event Relation Extraction", "abstract": "Extracting event relations that deviate from known schemas has proven challenging for previous methods based on multi-class classification, MASK prediction, or prototype matching. Recent advancements in large language models have shown impressive performance through instruction tuning. Nevertheless, in the task of event relation extraction, instruction-based methods face several challenges: there are a vast number of inference samples, and the relations between events are non-sequential. To tackle these challenges, we present an improved instruction-based event relation extraction framework named MAQInstruct. Firstly, we transform the task from extracting event relations using given event-event instructions to selecting events using given event-relation instructions, which reduces the number of samples required for inference. Then, by incorporating a bipartite matching loss, we reduce the dependency of the instruction-based method on the generation sequence. Our experimental results demonstrate that MAQInstruct significantly improves the performance of event relation extraction across multiple LLMs.", "authors": [ "Jun Zhou", "Zhiqiang Zhang", "Mengshu Sun", "Jun Xu" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "afrispeech-dialog-a-benchmark-dataset-for", "arxiv_id": "2502.03945", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03945v1", "url_pdf": "https://arxiv.org/pdf/2502.03945v1.pdf", "title": "Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond", "abstract": "Speech technologies are transforming interactions across various sectors, from healthcare to call centers and robots, yet their performance on African-accented conversations remains underexplored. We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations, designed to evaluate automatic speech recognition (ASR) and related technologies. We assess state-of-the-art (SOTA) speaker diarization and ASR systems on long-form, accented speech, comparing their performance with native accents and discover a 10%+ performance degradation. Additionally, we explore medical conversation summarization capabilities of large language models (LLMs) to demonstrate the impact of ASR errors on downstream medical summaries, providing insights into the challenges and opportunities for speech technologies in the Global South. Our work highlights the need for more inclusive datasets to advance conversational AI in low-resource settings.", "authors": [ "Tobi Olatunji", "Boluwatife A. Adewale", "Folafunmi Omofoye", "Lukman E. Ismaila", "Chibuzor Okocha", "Moshood Yekini", "Michael S. Mollel", "Naome A. Etori", "Emmanuel Ayodele", "Devendra D. Kayande", "Tassallah Abdullahi", "Mardhiyah Sanni" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "llm-alignment-as-retriever-optimization-an", "arxiv_id": "2502.03699", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03699v1", "url_pdf": "https://arxiv.org/pdf/2502.03699v1.pdf", "title": "LLM Alignment as Retriever Optimization: An Information Retrieval Perspective", "abstract": "Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative. In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR's retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO's effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.", "authors": [ "Sercan O. Arik", "Jiawei Han", "Yu Meng", "Wei Xiong", "Ziqi Wang", "Zhen Qin", "Jinsung Yoon", "Bowen Jin" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "fairt2i-mitigating-social-bias-in-text-to", "arxiv_id": "2502.03826", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03826v1", "url_pdf": "https://arxiv.org/pdf/2502.03826v1.pdf", "title": "FairT2I: Mitigating Social Bias in Text-to-Image Generation via Large Language Model-Assisted Detection and Attribute Rebalancing", "abstract": "The proliferation of Text-to-Image (T2I) models has revolutionized content creation, providing powerful tools for diverse applications ranging from artistic expression to educational material development and marketing. Despite these technological advancements, significant ethical concerns arise from these models' reliance on large-scale datasets that often contain inherent societal biases. These biases are further amplified when AI-generated content is included in training data, potentially reinforcing and perpetuating stereotypes in the generated outputs. In this paper, we introduce FairT2I, a novel framework that harnesses large language models to detect and mitigate social biases in T2I generation. Our framework comprises two key components: (1) an LLM-based bias detection module that identifies potential social biases in generated images based on text prompts, and (2) an attribute rebalancing module that fine-tunes sensitive attributes within the T2I model to mitigate identified biases. Our extensive experiments across various T2I models and datasets show that FairT2I can significantly reduce bias while maintaining high-quality image generation. We conducted both qualitative user studies and quantitative non-parametric analyses in the generated image feature space, building upon the occupational dataset introduced in the Stable Bias study. Our results show that FairT2I successfully mitigates social biases and enhances the diversity of sensitive attributes in generated images. We further demonstrate, using the P2 dataset, that our framework can detect subtle biases that are challenging for human observers to perceive, extending beyond occupation-related prompts. On the basis of these findings, we introduce a new benchmark dataset for evaluating bias in T2I models.", "authors": [ "Issei Sato", "Jinya Sakurai" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "training-an-llm-as-a-judge-model-pipeline", "arxiv_id": "2502.02988", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.02988v1", "url_pdf": "https://arxiv.org/pdf/2502.02988v1.pdf", "title": "Training an LLM-as-a-Judge Model: Pipeline, Insights, and Practical Lessons", "abstract": "The rapid advancement of large language models (LLMs) has opened new possibilities for their adoption as evaluative judges. This paper introduces Themis, a fine-tuned LLM judge that delivers sophisticated context-aware evaluations. We provide a comprehensive overview of the development pipeline for Themis, highlighting its scenario-dependent evaluation prompts and two novel methods for controlled instruction generation. These designs enable Themis to effectively distill evaluative skills from teacher models, while retaining flexibility for continuous development. We introduce two human-labeled benchmarks for meta-evaluation, demonstrating that Themis can achieve high alignment with human preferences in an economical manner. Additionally, we explore insights into the LLM-as-a-judge paradigm, revealing nuances in performance and the varied effects of reference answers. Notably, we observe that pure knowledge distillation from strong LLMs, though common, does not guarantee performance improvement through scaling. We propose a mitigation strategy based on instruction-following difficulty. Furthermore, we provide practical guidelines covering data balancing, prompt customization, multi-objective training, and metric aggregation. We aim for our method and findings, along with the fine-tuning data, benchmarks, and model checkpoints, to support future research and development in this area.", "authors": [ "Wei Lin", "Xing Shi", "Yi Zong", "Jiaxin Xia", "Libin Meng", "Yi Cheng", "Renjun Hu" ], "published": "2025-02-05", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "bilevel-zofo-bridging-parameter-efficient-and", "arxiv_id": "2502.03604", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03604v1", "url_pdf": "https://arxiv.org/pdf/2502.03604v1.pdf", "title": "Bilevel ZOFO: Bridging Parameter-Efficient and Zeroth-Order Techniques for Efficient LLM Fine-Tuning and Meta-Training", "abstract": "Fine-tuning pre-trained Large Language Models (LLMs) for downstream tasks using First-Order (FO) optimizers presents significant computational challenges. Parameter-Efficient Fine-Tuning(PEFT) methods have been proposed to address these challenges by freezing most model parameters and training only a small subset. While PEFT is efficient, it may not outperform full fine-tuning when high task-specific performance is required. Zeroth-Order (ZO) methods offer an alternative for fine-tuning the entire pre-trained model by approximating gradients using only the forward pass, thus eliminating the computational burden of back-propagation in first-order methods. However, when implementing ZO methods, a hard prompt is crucial, and relying on simple, fixed hard prompts may not be optimal. In this paper, we propose a bilevel optimization framework that complements ZO methods with PEFT to mitigate sensitivity to hard prompts while efficiently and effectively fine-tuning LLMs. Our Bilevel ZOFO (Zeroth-Order-First-Order) method employs a double-loop optimization strategy, where only the gradient of the PEFT model and the forward pass of the base model are required. We provide convergence guarantees for Bilevel ZOFO. Empirically, we demonstrate that Bilevel ZOFO outperforms both PEFT and ZO methods in single-task settings while maintaining similar memory efficiency. Additionally, we show its strong potential for multitask learning. Compared to current first-order meta-training algorithms for multitask learning, our method has significantly lower computational demands while maintaining or improving performance.", "authors": [ "Heng Huang", "Peiran Yu", "Qi He", "Reza Shirkavand" ], "published": "2025-02-05", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "pilaf-optimal-human-preference-sampling-for", "arxiv_id": "2502.04270", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04270v1", "url_pdf": "https://arxiv.org/pdf/2502.04270v1.pdf", "title": "PILAF: Optimal Human Preference Sampling for Reward Modeling", "abstract": "As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.", "authors": [ "Yaqi Duan", "Julia Kempe", "Kunhao Zheng", "Ariel Kwiatkowski", "Yunzhen Feng" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "chamaleonllm-batch-aware-dynamic-low-rank", "arxiv_id": "2502.04315", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04315v2", "url_pdf": "https://arxiv.org/pdf/2502.04315v2.pdf", "title": "ChamaleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters", "abstract": "Recent advances in large language models (LLMs) have shown remarkable performance across diverse tasks. However, these models are typically deployed with fixed weights, which limits their ability to adapt dynamically to the variability inherent in real-world data during inference. This paper introduces ChamaleonLLM, a novel framework that enables inference-time adaptation of LLMs by leveraging batch-aware clustering and on-the-fly generation of low-rank updates. Unlike traditional fine-tuning approaches such as Low-Rank Adaptation (LoRA) or methods that rely on a fixed set of pre-learned uniforms (changeable masks), our method dynamically generates adaptive modifications to the decoder weights based on the aggregated statistics of clustered batches. By intelligently grouping similar inputs and computing context-aware low-rank updates via a hyper-network, ChamaleonLLM achieves significant performance gains, outperforming conventional LoRA methods while eliminating the overhead of maintaining multiple expert models. Our experiments highlight the potential of our approach to serve as a versatile and highly adaptive solution for language model inference. ChamaleonLLM is open-sourced to ensure the reproducibility of our experiments: https://anonymous.4open.science/r/ChamaleonLLM/", "authors": [ "Hassan Sawaf", "Kamer Ali Yuksel" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "maga-massive-genre-audience-reformulation-to", "arxiv_id": "2502.04235", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04235v1", "url_pdf": "https://arxiv.org/pdf/2502.04235v1.pdf", "title": "MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion", "abstract": "Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose \\textbf{MA}ssive \\textbf{G}enre-\\textbf{A}udience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering's impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.", "authors": [ "Chenggang Li", "Ke Shen", "Xintong Hao" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "llms-to-support-a-domain-specific-knowledge", "arxiv_id": "2502.04095", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04095v1", "url_pdf": "https://arxiv.org/pdf/2502.04095v1.pdf", "title": "LLMs to Support a Domain Specific Knowledge Assistant", "abstract": "This work presents a custom approach to developing a domain specific knowledge assistant for sustainability reporting using the International Financial Reporting Standards (IFRS). In this domain, there is no publicly available question-answer dataset, which has impeded the development of a high-quality chatbot to support companies with IFRS reporting. The two key contributions of this project therefore are: (1) A high-quality synthetic question-answer (QA) dataset based on IFRS sustainability standards, created using a novel generation and evaluation pipeline leveraging Large Language Models (LLMs). This comprises 1,063 diverse QA pairs that address a wide spectrum of potential user queries in sustainability reporting. Various LLM-based techniques are employed to create the dataset, including chain-of-thought reasoning and few-shot prompting. A custom evaluation framework is developed to assess question and answer quality across multiple dimensions, including faithfulness, relevance, and domain specificity. The dataset averages a score range of 8.16 out of 10 on these metrics. (2) Two architectures for question-answering in the sustainability reporting domain - a RAG pipeline and a fully LLM-based pipeline. The architectures are developed by experimenting, fine-tuning, and training on the QA dataset. The final pipelines feature an LLM fine-tuned on domain specific data and an industry classification component to improve the handling of complex queries. The RAG architecture achieves an accuracy of 85.32% on single-industry and 72.15% on cross-industry multiple-choice questions, outperforming the baseline approach by 4.67 and 19.21 percentage points, respectively. The LLM-based pipeline achieves an accuracy of 93.45% on single-industry and 80.30% on cross-industry multiple-choice questions, an improvement of 12.80 and 27.36 percentage points over the baseline, respectively.", "authors": [ "Maria-Flavia Lovin" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "the-order-effect-investigating-prompt", "arxiv_id": "2502.04134", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04134v1", "url_pdf": "https://arxiv.org/pdf/2502.04134v1.pdf", "title": "The Order Effect: Investigating Prompt Sensitivity in Closed-Source LLMs", "abstract": "As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in closed-source LLMs by conducting experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation, however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.", "authors": [ "Mehdi Rezagholizadeh", "Peyman Passban", "Tanya Roosta", "Bryan Guan" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "psyplay-personality-infused-role-playing", "arxiv_id": "2502.03821", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03821v1", "url_pdf": "https://arxiv.org/pdf/2502.03821v1.pdf", "title": "PsyPlay: Personality-Infused Role-Playing Conversational Agents", "abstract": "The current research on Role-Playing Conversational Agents (RPCAs) with Large Language Models (LLMs) primarily focuses on imitating specific speaking styles and utilizing character backgrounds, neglecting the depiction of deeper personality traits.~In this study, we introduce personality-infused role-playing for LLM agents, which encourages agents to accurately portray their designated personality traits during dialogues. We then propose PsyPlay, a dialogue generation framework that facilitates the expression of rich personalities among multiple LLM agents. Specifically, PsyPlay enables agents to assume roles with distinct personality traits and engage in discussions centered around specific topics, consistently exhibiting their designated personality traits throughout the interactions. Validation on generated dialogue data demonstrates that PsyPlay can accurately portray the intended personality traits, achieving an overall success rate of 80.31% on GPT-3.5. Notably, we observe that LLMs aligned with positive values are more successful in portraying positive personality roles compared to negative ones. Moreover, we construct a dialogue corpus for personality-infused role-playing, called PsyPlay-Bench. The corpus, which consists of 4745 instances of correctly portrayed dialogues using PsyPlay, aims to further facilitate research in personalized role-playing and dialogue personality detection.", "authors": [ "Qifan Wang", "Cong Liu", "Xiaojun Quan", "Yuhua Zhu", "Tao Yang" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "hierarchical-contextual-manifold-alignment", "arxiv_id": "2502.03766", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03766v1", "url_pdf": "https://arxiv.org/pdf/2502.03766v1.pdf", "title": "Hierarchical Contextual Manifold Alignment for Structuring Latent Representations in Large Language Models", "abstract": "The organization of latent token representations plays a crucial role in determining the stability, generalization, and contextual consistency of language models, yet conventional approaches to embedding refinement often rely on parameter modifications that introduce additional computational overhead. A hierarchical alignment method was introduced to restructure token embeddings without altering core model weights, ensuring that representational distributions maintained coherence across different linguistic contexts. Experimental evaluations demonstrated improvements in rare token retrieval, adversarial robustness, and long-range dependency tracking, highlighting the advantages of hierarchical structuring in mitigating inconsistencies in latent space organization. The comparative analysis against conventional fine-tuning and embedding perturbation methods revealed that hierarchical restructuring maintained computational efficiency while achieving measurable gains in representation quality. Structural refinements introduced through the alignment process resulted in improved contextual stability across varied linguistic tasks, reducing inconsistencies in token proximity relationships and enhancing interpretability in language generation. A detailed computational assessment confirmed that the realignment process introduced minimal inference overhead, ensuring that representational improvements did not compromise model efficiency. The findings reinforced the broader significance of structured representation learning, illustrating that hierarchical embedding modifications could serve as an effective strategy for refining latent space distributions while preserving pre-learned semantic associations.", "authors": [ "Ruoxi Wang", "Jianhong Tang", "Zixuan Feng", "Yan Huang", "Haoran Liu", "Meiquan Dong" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "pixfoundation-are-we-heading-in-the-right", "arxiv_id": "2502.04192", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04192v1", "url_pdf": "https://arxiv.org/pdf/2502.04192v1.pdf", "title": "PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?", "abstract": "Multiple works have emerged to push the boundaries on multi-modal large language models (MLLMs) towards pixel-level understanding. Such approaches have shown strong performance on benchmarks for referring expression segmentation and grounded conversation generation. The current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data. However, we show that such MLLMs when evaluated on recent challenging vision centric benchmarks, exhibit a weak ability in visual question answering. Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such supervision. In this work, we propose two novel challenging benchmarks and show that MLLMs without pixel-level grounding supervision can outperform the state of the art in such tasks when evaluating both the pixel-level grounding and visual question answering. We propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call as PixFoundation. More importantly, we study the research question of ``When does grounding emerge in MLLMs that are not trained with pixel-level grounding supervision?'' We show that grounding can coincide with object parts or location/appearance information. Code repository is at https://github.com/MSiam/PixFoundation/.", "authors": [ "Mennatullah Siam" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "eclair-extracting-content-and-layout-with", "arxiv_id": "2502.04223", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04223v1", "url_pdf": "https://arxiv.org/pdf/2502.04223v1.pdf", "title": "Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents", "abstract": "Optical Character Recognition (OCR) technology is widely used to extract text from images of documents, facilitating efficient digitization and data retrieval. However, merely extracting text is insufficient when dealing with complex documents. Fully comprehending such documents requires an understanding of their structure -- including formatting, formulas, tables, and the reading order of multiple blocks and columns across multiple pages -- as well as semantic information for detecting elements like footnotes and image captions. This comprehensive understanding is crucial for downstream tasks such as retrieval, document question answering, and data curation for training Large Language Models (LLMs) and Vision Language Models (VLMs). To address this, we introduce \\'Eclair, a general-purpose text-extraction tool specifically designed to process a wide range of document types. Given an image, \\'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. To thoroughly evaluate these novel capabilities, we introduce our diverse human-annotated benchmark for document-level OCR and semantic classification. \\'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics. Additionally, we evaluate \\'Eclair on established benchmarks, demonstrating its versatility and strength across several evaluation standards.", "authors": [ "Karan Sapra", "Andrew Tao", "Joseph Jennings", "Jupinder Parmar", "Jarno Seppänen", "Timo Roman", "Kateryna Chumachenko", "Philipp Fischer", "Lukas Voegtle", "Amala Sanjay Deshmukh", "Ilia Karmanov" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "phd-knowledge-not-required-a-reasoning", "arxiv_id": "2502.01584", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.01584v2", "url_pdf": "https://arxiv.org/pdf/2502.01584v2.pdf", "title": "PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models", "abstract": "Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.", "authors": [ "Zixuan Wu", "Francesca Lucchetti", "Arjun Guha", "Molly Q Feldman", "Federico Cassano", "Aleksander Boruch-Gruszecki", "Joydeep Biswas", "Carolyn Jane Anderson" ], "published": "2025-02-03", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "learning-autonomous-code-integration-for-math", "arxiv_id": "2502.00691", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.00691v1", "url_pdf": "https://arxiv.org/pdf/2502.00691v1.pdf", "title": "Learning Autonomous Code Integration for Math Language Models", "abstract": "Recent research on tool integration for math Large Language Models (LLMs) aims to combine complementary strengths of chain-of-thought (CoT) reasoning and code execution. However, we discover a critical limitation: current tool-integrated math LLMs rely on externally dictated instructions to decide whether to use CoT or code, lacking the autonomy to choose the most appropriate method independently. This prompts us to study \\emph{Autonomous Code integration} for math LLMs, which enables models to \\emph{independently} develop their own methodology-selection strategy in the absence of reliable supervision. To address this challenge, we propose an innovative Expectation-Maximization (EM) formulation that refines the model's decision-making through the exploration of its capabilities. This framework alternates between (a) computing a reference strategy that improves the model's belief over its capabilities through self-exploration, and (b) updating the model based on the refined belief. We further enhance this framework with an efficient implementation, incorporating a novel data synthesis strategy and off-policy reinforcement learning. Extensive experiments demonstrate that our approach, using only a public query set, significantly boosts the performance of existing math LLMs, raising accuracy by nearly 20\\% to 65.28\\% on the challenging MATH benchmark, while reducing code executions by up to 65\\% .", "authors": [ "Fangzhen Lin", "Wei Chu", "Weidi Xu", "Fengming Zhu", "Chao Qu", "Long Li", "Haozhe Wang" ], "published": "2025-02-02", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "smollm2-when-smol-goes-big-data-centric", "arxiv_id": "2502.02737", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.02737v1", "url_pdf": "https://arxiv.org/pdf/2502.02737v1.pdf", "title": "SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model", "abstract": "While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art \"small\" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.", "authors": [ "Thomas Wolf", "Leandro von Werra", "Colin Raffel", "Mathieu Morlon", "Cyril Zakka", "Haojun Zhao", "Hugo Larcher", "Ben Burtenshaw", "Clémentine Fourrier", "Xuan-Son Nguyen", "Caleb Fahlgren", "Joshua Lochner", "Vaibhav Srivastav", "Agustín Piqueres Lajarín", "Hynek Kydlíček", "Andrés Marafioti", "Lewis Tunstall", "Guilherme Penedo", "Gabriel Martín Blázquez", "Elie Bakouch", "Anton Lozhkov", "Loubna Ben allal" ], "published": "2025-02-04", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "attention-sinks-and-outlier-features-a-catch", "arxiv_id": "2502.00919", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.00919v1", "url_pdf": "https://arxiv.org/pdf/2502.00919v1.pdf", "title": "Attention Sinks and Outlier Features: A 'Catch, Tag, and Release' Mechanism for Embeddings", "abstract": "Two prominent features of large language models (LLMs) is the presence of large-norm (outlier) features and the tendency for tokens to attend very strongly to a select few tokens. Despite often having no semantic relevance, these select tokens, called attention sinks, along with the large outlier features, have proven important for model performance, compression, and streaming. Consequently, investigating the roles of these phenomena within models and exploring how they might manifest in the model parameters has become an area of active interest. Through an empirical investigation, we demonstrate that attention sinks utilize outlier features to: catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream, where the tagged tokens are eventually retrieved. We prove that simple tasks, like averaging, necessitate the 'catch, tag, release' mechanism hence explaining why it would arise organically in modern LLMs. Our experiments also show that the creation of attention sinks can be completely captured in the model parameters using low-rank matrices, which has important implications for model compression and substantiates the success of recent approaches that incorporate a low-rank term to offset performance degradation.", "authors": [ "Vardan Papyan", "Mustafa Khan", "Stephen Zhang" ], "published": "2025-02-02", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "knowing-when-to-stop-dynamic-context-cutoff", "arxiv_id": "2502.01025", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.01025v1", "url_pdf": "https://arxiv.org/pdf/2502.01025v1.pdf", "title": "Knowing When to Stop: Dynamic Context Cutoff for Large Language Models", "abstract": "Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient in cases where the information required to answer a query is localized within the context. We present dynamic context cutoff, a human-inspired method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode \"sufficiency signals\" - detectable through lightweight classifiers - that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B0-70B) demonstrate 1.33x average token reduction while improving accuracy by 1.3%. Furthermore, our method demonstrates better performance with the same rate of token reduction compared to other context efficiency methods. Additionally, we observe an emergent scaling phenomenon: while smaller models require require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.", "authors": [ "Bhuwan Dhingra", "Zihao Lin", "Bolun Sun", "Chunyuan Deng", "Paul Rosu", "Junlin Wang", "Roy Xie" ], "published": "2025-02-03", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "fastkv-kv-cache-compression-for-fast-long", "arxiv_id": "2502.01068", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.01068v1", "url_pdf": "https://arxiv.org/pdf/2502.01068v1.pdf", "title": "FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation", "abstract": "While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00$\\times$ and 1.40$\\times$ improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at https://github.com/dongwonjo/FastKV.", "authors": [ "Jae-Joon Kim", "Yulhwa Kim", "Jiwon Song", "Dongwon Jo" ], "published": "2025-02-03", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "chunkkv-semantic-preserving-kv-cache", "arxiv_id": "2502.00299", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.00299v1", "url_pdf": "https://arxiv.org/pdf/2502.00299v1.pdf", "title": "ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference", "abstract": "To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10\\% performance improvement under aggressive compression ratios compared to existing methods.", "authors": [ "Xiaowen Chu", "Xuming Hu", "Bo Li", "Zeyu Li", "Peijie Dong", "Zhenheng Tang", "Xiang Liu" ], "published": "2025-02-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "adapt-pruner-adaptive-structural-pruning-for", "arxiv_id": "2502.03460", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03460v1", "url_pdf": "https://arxiv.org/pdf/2502.03460v1.pdf", "title": "Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training", "abstract": "Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons ($\\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200$\\times$ fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks.", "authors": [ "Tong Zhang", "Renjie Pi", "Jipeng Zhang", "Xingyuan Pan", "Shizhe Diao", "Rui Pan", "Boyao Wang" ], "published": "2025-02-05", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "twilight-adaptive-attention-sparsity-with", "arxiv_id": "2502.02770", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.02770v2", "url_pdf": "https://arxiv.org/pdf/2502.02770v2.pdf", "title": "Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning", "abstract": "Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\\times$ acceleration in self-attention operations and $3.9\\times$ acceleration in end-to-end per token latency in long context LLM decoding.", "authors": [ "Mingyu Gao", "Song Han", "Ion Stoica", "Boyu Tian", "Tian Tang", "Hanshuo Wang", "Shuo Yang", "Jiaming Tang", "Chaofan Lin" ], "published": "2025-02-04", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "a-training-free-length-extrapolation-approach", "arxiv_id": "2502.02659", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.02659v1", "url_pdf": "https://arxiv.org/pdf/2502.02659v1.pdf", "title": "A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)", "abstract": "Transformer-based Large Language Models (LLMs) struggle to process inputs exceeding their training context window, with performance degrading due to positional out-of-distribution (O.O.D.) that disrupt attention computations. Existing solutions, fine-tuning and training-free methods, are limited by computational inefficiency, attention logit outliers or loss of local positional information. To address this, we propose Greedy Attention Logit Interpolation (GALI), a training-free length extrapolation method that maximizes the utilization of pretrained positional intervals while avoiding attention logit outliers through attention logit interpolation. The result demonstrates that GALI consistently outperforms state-of-the-art training-free methods. Our findings reveal that LLMs interpret positional intervals unevenly within their training context window, suggesting that extrapolating within a smaller positional interval range yields superior results-even for short-context tasks. GALI represents a significant step toward resolving the positional O.O.D. challenge, enabling more reliable long-text understanding in LLMs. Our implementation of GALI, along with the experiments from our paper, is open-sourced at https://github.com/AcademyCityL/GALI.", "authors": [ "Soyeon Caren Han", "Zechuan Li", "Tianyi Zhang", "Yan Li" ], "published": "2025-02-04", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "bolt-bootstrap-long-chain-of-thought-in", "arxiv_id": "2502.03860", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03860v1", "url_pdf": "https://arxiv.org/pdf/2502.03860v1.pdf", "title": "BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation", "abstract": "Large language models (LLMs), such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.", "authors": [ "Caiming Xiong", "Yingbo Zhou", "Silvio Savarese", "Jiacheng Xu", "Hanze Dong", "Bo Pang" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "identify-critical-kv-cache-in-llm-inference", "arxiv_id": "2502.03805", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03805v1", "url_pdf": "https://arxiv.org/pdf/2502.03805v1.pdf", "title": "Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective", "abstract": "Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large Key-Value (KV) cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. Evaluations on the Needle-in-a-Haystack test and Longbench benchmark show our algorithm enhances state-of-the-art cache eviction methods. Further empirical analysis confirms that our algorithm achieves lower output perturbations in over 92% attention heads in Llama model, thereby providing a significant improvement over existing methods.", "authors": [ "S Kevin Zhou", "Xike Xie", "Yukun Cao", "Junlin Lv", "Yuan Feng" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "context-preserving-gradient-modulation-for", "arxiv_id": "2502.03643", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03643v1", "url_pdf": "https://arxiv.org/pdf/2502.03643v1.pdf", "title": "Context-Preserving Gradient Modulation for Large Language Models: A Novel Approach to Semantic Consistency in Long-Form Text Generation", "abstract": "Maintaining semantic consistency over extended text sequences remains a fundamental challenge in long-form text generation, where conventional training methodologies often struggle to prevent contextual drift and coherence degradation. A novel gradient modulation approach is introduced, designed to adjust parameter updates dynamically in response to contextual relevance, ensuring that generated text remains aligned with prior discourse. By integrating a modulation function that selectively amplifies or attenuates gradients based on learned contextual dependencies, the proposed method enhances the stability of model-generated narratives without imposing significant computational overhead. Comparative evaluations against baseline models reveal improvements in coherence, contextual retention, and long-range dependency tracking, demonstrating the effectiveness of modifying the learning process at the gradient level. The results indicate that sentence structure variability and lexical diversity benefit from this approach, mitigating repetitive phrasing and improving adaptability across diverse linguistic contexts. Statistical validation of coherence metrics further substantiates the observed enhancements, with a significant reduction in inconsistencies emerging as a direct consequence of the modulation mechanism. Computational efficiency assessments confirm that the framework achieves these gains without requiring substantial modifications to the underlying architecture, ensuring compatibility with existing optimization workflows.", "authors": [ "Orlando Wetherby", "Zachary Vanderpoel", "Edmund Weatherstone", "Nirola Kobanov" ], "published": "2025-02-05", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "attentionpredictor-temporal-pattern-matters", "arxiv_id": "2502.04077", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04077v1", "url_pdf": "https://arxiv.org/pdf/2502.04077v1.pdf", "title": "AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference", "abstract": "With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through heuristic ranking with attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the \\textit{temporal patterns} in attention scores, resulting in a noticeable degradation in LLM performance. To address this challenge, we propose AttentionPredictor, which is the first learning-based critical token identification approach. Specifically, AttentionPredictor learns a lightweight convolution model to capture spatiotemporal patterns and predict the next-token attention score. An appealing feature of AttentionPredictor is that it accurately predicts the attention score while consuming negligible memory. Moreover, we propose a cross-token critical cache prefetching framework that hides the token estimation time overhead to accelerate the decoding stage. By retaining most of the attention information, AttentionPredictor achieves 16$\\times$ KV cache compression with comparable LLM performance, significantly outperforming the state-of-the-art.", "authors": [ "Bin Li", "Mingxuan Yuan", "Jianye Hao", "Wulong Liu", "Xianzhi Yu", "Lei Chen", "Chen Chen", "Zhihai Wang", "Xing Li", "Jie Wang", "Qingyue Yang" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "the-complexity-of-learning-sparse-superposed", "arxiv_id": "2502.05407", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.05407v1", "url_pdf": "https://arxiv.org/pdf/2502.05407v1.pdf", "title": "The Complexity of Learning Sparse Superposed Features with Feedback", "abstract": "The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative \\textit{triplet comparisons}. These features may represent various constructs, including dictionaries in LLMs or components of a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent's feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machine-trained models and dictionary extraction from sparse autoencoders trained on Large Language Models.", "authors": [ "Akash Kumar" ], "published": "2025-02-08", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "m-extending-memoryllm-with-scalable-long-term", "arxiv_id": "2502.00592", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.00592v1", "url_pdf": "https://arxiv.org/pdf/2502.00592v1.pdf", "title": "M+: Extending MemoryLLM with Scalable Long-Term Memory", "abstract": "Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead.", "authors": [ "Zexue He", "Rogerio Feris", "Dan Gutfreund", "Julian McAuley", "Wangchunshu Zhou", "Yifan Gao", "Yuanzhe Hu", "Dmitry Krotov", "Yu Wang" ], "published": "2025-02-01", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "fine-i-ll-merge-it-myself-a-multi-fidelity", "arxiv_id": "2502.04030", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.04030v1", "url_pdf": "https://arxiv.org/pdf/2502.04030v1.pdf", "title": "Fine, I'll Merge It Myself: A Multi-Fidelity Framework for Automated Model Merging", "abstract": "Reasoning capabilities represent a critical frontier for large language models (LLMs), but developing them requires extensive proprietary datasets and computational resources. One way to efficiently supplement capabilities with is by model merging, which offers a promising alternative by combining multiple models without retraining. However, current merging approaches rely on manually-designed strategies for merging hyperparameters, limiting the exploration of potential model combinations and requiring significant human effort. We propose an Automated Model Merging Framework that enables fine-grained exploration of merging strategies while reducing costs through multi-fidelity approximations. We support both single and multi-objective optimization and introduce two novel search spaces: layerwise fusion (LFS) and depth-wise integration (DIS). Evaluating across a number of benchmarks, we find that the search autonomously finds 1) Merges that further boost single-objective performance, even on tasks the model has already been finetuned on, and 2) Merges that optimize multi-objective frontiers across tasks. Effective merges are found with limited compute, e.g. within less than 500 search steps.", "authors": [ "Jonas Geiping", "Guinan Su" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "intent-representation-learning-with-large", "arxiv_id": "2502.03307", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03307v1", "url_pdf": "https://arxiv.org/pdf/2502.03307v1.pdf", "title": "Intent Representation Learning with Large Language Model for Recommendation", "abstract": "Intent-based recommender systems have garnered significant attention for uncovering latent fine-grained preferences. Intents, as underlying factors of interactions, are crucial for improving recommendation interpretability. Most methods define intents as learnable parameters updated alongside interactions. However, existing frameworks often overlook textual information (e.g., user reviews, item descriptions), which is crucial for alleviating the sparsity of interaction intents. Exploring these multimodal intents, especially the inherent differences in representation spaces, poses two key challenges: i) How to align multimodal intents and effectively mitigate noise issues; ii) How to extract and match latent key intents across modalities. To tackle these challenges, we propose a model-agnostic framework, Intent Representation Learning with Large Language Model (IRLLRec), which leverages large language models (LLMs) to construct multimodal intents and enhance recommendations. Specifically, IRLLRec employs a dual-tower architecture to learn multimodal intent representations. Next, we propose pairwise and translation alignment to eliminate inter-modal differences and enhance robustness against noisy input features. Finally, to better match textual and interaction-based intents, we employ momentum distillation to perform teacher-student learning on fused intent representations. Empirical evaluations on three datasets show that our IRLLRec framework outperforms baselines. The implementation is available at https://github.com/wangyu0627/IRLLRec.", "authors": [ "Yiwen Zhang", "Yi Zhang", "Lei Sang", "Yu Wang" ], "published": "2025-02-05", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "large-language-models-are-universal", "arxiv_id": "2502.03041", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03041v1", "url_pdf": "https://arxiv.org/pdf/2502.03041v1.pdf", "title": "Large Language Models Are Universal Recommendation Learners", "abstract": "In real-world recommender systems, different tasks are typically addressed using supervised learning on task-specific datasets with carefully designed model architectures. We demonstrate that large language models (LLMs) can function as universal recommendation learners, capable of handling multiple tasks within a unified input-output framework, eliminating the need for specialized model designs. To improve the recommendation performance of LLMs, we introduce a multimodal fusion module for item representation and a sequence-in-set-out approach for efficient candidate generation. When applied to industrial-scale data, our LLM achieves competitive results with expert models elaborately designed for different recommendation tasks. Furthermore, our analysis reveals that recommendation outcomes are highly sensitive to text input, highlighting the potential of prompt engineering in optimizing industrial-scale recommender systems.", "authors": [ "Bo Zheng", "Jian Xu", "Han Zhu", "Ziru Xu", "Xiaoyu Kong", "Bin Liu", "Yanwen Huang", "Junguang Jiang" ], "published": "2025-02-05", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "boosting-knowledge-graph-based", "arxiv_id": "2502.03715", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.03715v1", "url_pdf": "https://arxiv.org/pdf/2502.03715v1.pdf", "title": "Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models", "abstract": "Knowledge Graph-based recommendations have gained significant attention due to their ability to leverage rich semantic relationships. However, constructing and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent advancements in Large Language Models (LLMs) offer a promising way to improve the quality and relevance of KGs for recommendation tasks. Despite this, integrating LLMs into KG-based systems presents challenges, such as efficiently augmenting KGs, addressing hallucinations, and developing effective joint learning methods. In this paper, we propose the Confidence-aware KG-based Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework that combines KGs and LLMs for recommendation task. The framework includes: (1) an LLM-based subgraph augmenter for enriching KGs with high-quality information, (2) a confidence-aware message propagation mechanism to filter noisy triplets, and (3) a dual-view contrastive learning method to integrate user-item interactions and KG data. Additionally, we employ a confidence-aware explanation generation process to guide LLMs in producing realistic explanations for recommendations. Finally, extensive experiments demonstrate the effectiveness of CKG-LLMA across multiple public datasets.", "authors": [ "Hui Xiong", "Dazhong Shen", "Qianyi Cai", "Chao Wang", "Rui Cai" ], "published": "2025-02-06", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "the-dual-use-dilemma-in-llms-do-empowering", "arxiv_id": "2501.13952", "nips_id": null, "url_abs": "https://arxiv.org/abs/2501.13952v1", "url_pdf": "https://arxiv.org/pdf/2501.13952v1.pdf", "title": "The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility?", "abstract": "Recent years have witnessed extensive efforts to enhance Large Language Models (LLMs) across various domains, alongside growing attention to their ethical implications. However, a critical challenge remains largely overlooked: LLMs must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility. This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance by addressing this ethical-utility trade-off, using chemical domain applications as a proof-of-concept. Our alignment pipeline starts with a GPT-assisted three-phase data generation scheme, in which we create LibraChemQA, a chemical question-answering dataset comprising 31.6k triplet instances. By incorporating an innovative balanced seed in the data generation process, our framework systematically considers both legitimate and illegitimate requests. The framework also introduces a rephrasing mechanism for efficient data augmentation that enhances the model's chemical comprehension. We further develop a novel hybrid evaluation scheme with LLM judges for precise assessment of both safety and utility. Experimental results demonstrate our model's substantial improvements in overall performance where both safety and utility are considered - our resulting model, LibraChem, outperforms leading LLMs including Claude-3, GPT-4o, and LLaMA-3 by margins of 13.44%, 7.16%, and 7.10% respectively on our released benchmark.", "authors": [ "Pheng-Ann Heng", "Xilin Dang", "Yuyang Du", "Kexin Chen", "Xingyu Chen", "Yiyi Zhang" ], "published": "2025-01-20", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null }, { "id": "phip-g-physics-guided-text-to-3d", "arxiv_id": "2502.00708", "nips_id": null, "url_abs": "https://arxiv.org/abs/2502.00708v1", "url_pdf": "https://arxiv.org/pdf/2502.00708v1.pdf", "title": "PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation", "abstract": "Text-to-3D asset generation has achieved significant optimization under the supervision of 2D diffusion priors. However, when dealing with compositional scenes, existing methods encounter several challenges: 1). failure to ensure that composite scene layouts comply with physical laws; 2). difficulty in accurately capturing the assets and relationships described in complex scene descriptions; 3). limited autonomous asset generation capabilities among layout approaches leveraging large language models (LLMs). To avoid these compromises, we propose a novel framework for compositional scene generation, PhiP-G, which seamlessly integrates generation techniques with layout guidance based on a world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene description to generate a scene graph, and integrating a multimodal 2D generation agent and a 3D Gaussian generation method for targeted assets creation. For the stage of layout, PhiP-G employs a physical pool with adhesion capabilities and a visual supervision agent, forming a world model for layout prediction and planning. Extensive experiments demonstrate that PhiP-G significantly enhances the generation quality and physical rationality of the compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA) performance in CLIP scores, achieves parity with the leading methods in generation quality as measured by the T$^3$Bench, and improves efficiency by 24x.", "authors": [ "Yan Peng", "Zongjin He", "Chao Wang", "Qixuan Li" ], "published": "2025-02-02", "conference": null, "conference_url_abs": null, "conference_url_pdf": null, "proceeding": null } ] }{ "count": 24708, "next": "