Matching Anything by Segmenting Anything

siyuanliii/masa CVPR 2024

The robust association of the same objects across video frames in complex scenes is crucial for many applications, especially Multiple Object Tracking (MOT).

Domain Generalization Multiple Object Tracking +2

Nemotron-4 340B Technical Report

nvidia/nemo-aligner 17 Jun 2024

We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward.

Synthetic Data Generation

Simple and Effective Masked Diffusion Language Models

kuleshov-group/mdlm 11 Jun 2024

While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling.

Language Modelling Masked Language Modeling

Advancing High Resolution Vision-Language Models in Biomedicine

standardmodelbio/llama3-med 12 Jun 2024

Multi-modal learning has significantly advanced generative AI, especially in vision-language modeling.

Language Modelling Question Answering +1

Towards Vision-Language Geo-Foundation Model: A Survey

zytx121/awesome-vlgfm 13 Jun 2024

Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding.

Earth Observation Image Captioning +4

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

tencent/hunyuandit 14 May 2024

For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images.

Image Generation Language Modelling +2

Scaling and evaluating sparse autoencoders

openai/sparse_autoencoder 6 Jun 2024

Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity.

Language Modelling

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

agent-husky/husky-v1 10 Jun 2024

Despite using 7B models, Husky matches or even exceeds frontier LMs such as GPT-4 on these tasks, showcasing the efficacy of our holistic approach in addressing complex reasoning problems.

Multi-hop Question Answering Question Answering

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

ictnlp/streamspeech 5 Jun 2024

Simultaneous speech-to-speech translation (Simul-S2ST, a. k. a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication.

Automatic Speech Recognition (ASR) de-en +11

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

BytedanceSpeech/seed-tts-eval 4 Jun 2024

Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild.

In-Context Learning Language Modelling

