Qwen2-Audio Technical Report

qwenlm/qwen2-audio 15 Jul 2024

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

Instruction Following Language Modelling

3.89 stars / hour

IMAGDressing-v1: Customizable Virtual Dressing

muzishen/imagdressing 17 Jul 2024

Latest advances have achieved realistic virtual try-on (VTON) through localized garment inpainting using latent diffusion models, significantly enhancing consumers' online shopping experience.

Denoising Image Generation +1

2.52 stars / hour

SEED-Story: Multimodal Long Story Generation with Large Language Model

tencentarc/seed-story 11 Jul 2024

We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner.

Image Generation Language Modelling +3

1.73 stars / hour

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

stanford-oval/storm 22 Feb 2024

We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.


1.61 stars / hour

Scaling Diffusion Transformers to 16 Billion Parameters

feizc/dit-moe 16 Jul 2024

In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference.

Attribute Conditional Image Generation +2

1.56 stars / hour

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

funaudiollm/cosyvoice 4 Jul 2024

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs).

Emotion Recognition Event Detection +6

1.49 stars / hour

Cradle: Empowering Foundation Agents Towards General Computer Control

baai-agents/cradle 5 Mar 2024

To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i. e., using screenshots as input and keyboard and mouse actions as output.

Efficient Exploration

1.22 stars / hour

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

chenyirui/gim 24 Jun 2024

The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location(IMDL).

Image Manipulation Image Manipulation Detection

1.20 stars / hour

Grounding Image Matching in 3D with MASt3R

naver/mast3r 14 Jun 2024

Image Matching is a core component of all best-performing algorithms and pipelines in 3D vision.

3D Reconstruction

1.13 stars / hour

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

nvlabs/mambavision 10 Jul 2024

We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications.

Image Classification Instance Segmentation +3

1.08 stars / hour