Model Editing
105 papers with code • 0 benchmarks • 1 datasets
Benchmarks
These leaderboards are used to track progress in Model Editing
Libraries
Use these libraries to find Model Editing models and implementationsMost implemented papers
Locating and Editing Factual Associations in GPT
To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME).
Editing Large Language Models: Problems, Methods, and Opportunities
Our objective is to provide valuable insights into the effectiveness and feasibility of each editing technique, thereby assisting the community in making informed decisions on the selection of the most appropriate method for a specific task or context.
Fast Model Editing at Scale
To enable easy post-hoc editing at scale, we propose Model Editor Networks using Gradient Decomposition (MEND), a collection of small auxiliary editing networks that use a single desired input-output pair to make fast, local edits to a pre-trained model's behavior.
Sparse Autoencoders Find Highly Interpretable Features in Language Models
One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons.
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability.
Interpretability, Then What? Editing Machine Learning Models to Reflect Human Knowledge and Values
Machine learning (ML) interpretability techniques can reveal undesirable patterns in data that models exploit to make predictions--potentially causing harms once deployed.
Editing Implicit Assumptions in Text-to-Image Diffusion Models
Our Text-to-Image Model Editing method, TIME for short, receives a pair of inputs: a "source" under-specified prompt for which the model makes an implicit assumption (e. g., "a pack of roses"), and a "destination" prompt that describes the same setting, but with a specified desired attribute (e. g., "a pack of blue roses").
A Comprehensive Study of Knowledge Editing for Large Language Models
In this paper, we first define the knowledge editing problem and then provide a comprehensive review of cutting-edge approaches.
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Furthermore, these tuning-based methods require large-scale preference data for training and are susceptible to noisy preference data.
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
For example, the LLM red-teaming literature has produced a wide variety of 'jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless.