The fusion of both visual and LiDAR measurements is based on a single unified voxel map where the LiDAR module constructs the geometric structure for registering new LiDAR scans and the visual module attaches image patches to the LiDAR points.
In this work, we present MatMamba: a state space model which combines Matryoshka-style learning with Mamba2, by modifying the block to contain nested dimensions to enable joint training and adaptive inference.
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.
Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer.
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs).
We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e. g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video.
Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed "agglomerative models."
On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost.
We introduce Mirage, the first multi-level superoptimizer for tensor programs.
Image Matching is a core component of all best-performing algorithms and pipelines in 3D vision.