We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs.
Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models.
Ranked #19 on Visual Question Answering on MM-Vet
This technical report introduces TripoSR, a 3D reconstruction model leveraging transformer architecture for fast feed-forward 3D generation, producing 3D mesh from a single image in under 0. 5 seconds.
3D Object Reconstruction From A Single Image 3D Reconstruction +1
A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses.
We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation.
Especially, after learning a deep understanding of pure resolution priors, ResAdapter trained on the general dataset, generates resolution-free images with personalized diffusion models while preserving their original style domain.
Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.
To tackle this unified SFDA problem, we propose a novel approach called Latent Causal Factors Discovery (LCFD).
We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.
Ranked #1 on Temporal Action Localization on FineAction
We revisit the "dataset classification" experiment suggested by Torralba and Efros a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures.