Search Results for author: Yanghua Peng

Found 13 papers, 7 papers with code

Goku: Flow Based Video Generative Foundation Models

no code implementations7 Feb 2025 Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu

This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance.

Text-to-Image Generation Video Generation

HybridFlow: A Flexible and Efficient RLHF Framework

2 code implementations28 Sep 2024 Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu

Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs.

Large Language Model

Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation

no code implementations7 Aug 2024 Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu

Multimodal large language models (MLLMs) have extended the success of large language models (LLMs) to multiple data types, such as image, text and audio, achieving significant performance in various domains, including multimodal translation, visual question answering and content generation.

Question Answering Scheduling +1

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

no code implementations29 Jul 2024 Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, Chuan Wu

In production, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales.

QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices

1 code implementation2 Jul 2024 Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Yibo Zhu, Chuan Wu

A number of production deep learning clusters have attempted to explore inference hardware for DNN training, at the off-peak serving hours with many inference GPUs idling.

Quantization

CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs

1 code implementation16 Nov 2023 Hanpeng Hu, Junwei Su, Juntao Zhao, Yanghua Peng, Yibo Zhu, Haibin Lin, Chuan Wu

Considering the large space of DNN models and devices that impede direct profiling of all combinations, recent efforts focus on building a predictor to model the performance of DNN models on different devices.

Domain Adaptation Prediction

dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training

no code implementations5 May 2022 Hanpeng Hu, Chenyu Jiang, Yuchen Zhong, Yanghua Peng, Chuan Wu, Yibo Zhu, Haibin Lin, Chuanxiong Guo

Distributed training using multiple devices (e. g., GPUs) has been widely adopted for learning DNN models over large datasets.

BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

no code implementations16 Dec 2021 Tianfeng Liu, Yangrui Chen, Dan Li, Chuan Wu, Yibo Zhu, Jun He, Yanghua Peng, Hongzheng Chen, Hongzhi Chen, Chuanxiong Guo

Extensive experiments on various GNN models and large graph datasets show that BGL significantly outperforms existing GNN training systems by 20. 68x on average.

Graph Property Prediction Node Classification +1

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

1 code implementation13 Sep 2019 Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, Chen Meng, Wei. Lin

DL2 is a DL-driven scheduler for DL clusters, targeting global training job expedition by dynamically resizing resources allocated to jobs.

Deep Learning Fairness +4

Online Job Scheduling in Distributed Machine Learning Clusters

no code implementations3 Jan 2018 Yixin Bao, Yanghua Peng, Chuan Wu, Zongpeng Li

In a shared cluster handling multiple training jobs, a fundamental issue is how to efficiently schedule jobs and set the number of concurrent workers to run for each job, such that server resources are maximally utilized and model training can be completed in time.

Distributed, Parallel, and Cluster Computing

Cannot find the paper you are looking for? You can Submit a new open access paper.