In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important.
In contrast to the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks.
Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts.
We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective.
In this stage, we increase the number of Gaussians by compactness-based densification to enhance continuity and improve fidelity.
We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding unnecessary and redundant tokens.
Ranked #1 on
Video Inpainting
on DAVIS
In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience.
The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters.
Notably, the network inputs for discrete data lie on the probability simplex, and are therefore natively differentiable, paving the way for gradient-based sample guidance and few-step generation in discrete domains such as language modelling.
Ranked #3 on
Image Generation
on Binarized MNIST
Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics.
Ranked #8 on
Math Word Problem Solving
on MATH