HunyuanVideo: A Systematic Framework For Large Video Generative Models

tencent/hunyuanvideo 3 Dec 2024

In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models.

Video Alignment Video Generation

5,470
2.58 stars / hour

Papers-in-100-Lines-of-Code

MaximeVandegar/Papers-in-100-Lines-of-Code 22 Dec 2014

Implementation of papers in 100 lines of code.

Stochastic Optimization

1,222
2.37 stars / hour

MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

modelscope/ClearerVoice-Studio 23 Feb 2023

To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence.

 Ranked #1 on Speech Separation on WSJ0-2mix-16k (using extra training data)

Speech Separation

1,654
1.92 stars / hour

Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

FoundationVision/Infinity 5 Dec 2024

We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction.

Image Generation

286
1.66 stars / hour

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

FoundationVision/VAR 3 Apr 2024

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction".

Image Generation Language Modelling +2

5,937
1.40 stars / hour

VisionZip: Longer is Better but Not Necessary in Vision Language Models

dvlab-research/visionzip 5 Dec 2024

To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance.

Video Understanding

168
1.08 stars / hour

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

jiuhaichen/florence-vl 5 Dec 2024

We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model.

Contrastive Learning Hallucination +3

128
0.91 stars / hour

Stem-leaf segmentation and phenotypic trait extraction of maize shoots from three-dimensional point cloud

syau-miao/seg4maize 7 Sep 2020

However, automatic stem-leaf segmentation of maize shoots from three-dimensional (3D) point clouds remains challenging, especially for new emerging leaves that are very close and wrapped together during the seedling stage.

Segmentation

89
0.85 stars / hour

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

showlab/showui 26 Nov 2024

In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.

Instruction Following

573
0.85 stars / hour

GenCast: Diffusion-based ensemble forecasting for medium-range weather

deepmind/graphcast 25 Dec 2023

Weather forecasts are fundamentally uncertain, so predicting the range of probable weather scenarios is crucial for important decisions, from warning the public about hazardous weather, to planning renewable energy use.

Decision Making Weather Forecasting

5,326
0.77 stars / hour