HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 hours on 8 GPUs.
In this paper, we present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Augmented Language Models (ALMs) blend the reasoning capabilities of Large Language Models (LLMs) with tools that allow for knowledge retrieval and action execution.
Multi-frame algorithms for single-channel speech enhancement are able to take advantage from short-time correlations within the speech signal.
Ranked #8 on
Speech Enhancement
on VoiceBank + DEMAND
Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities.
To analyze video, we use 3D reconstructions from HMR 2. 0 as input to a tracking system that operates in 3D.
Ranked #3 on
Pose Tracking
on PoseTrack2018
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.
Ranked #1 on
Action Recognition
on AVA v2.2
(using extra training data)
We present XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for the downstream text-to-speech (TTS) task.
Ranked #1 on
Speech Synthesis
on North American English
For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities as the pre-trained audio encoder, and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module.
Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis.