Search Results for author: Di Zhang

Found 99 papers, 26 papers with code

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

no code implementations14 Mar 2025 Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, Di Zhang

However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation.

TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs

no code implementations13 Mar 2025 Yunxiao Wang, Meng Liu, Rui Shao, Haoyu Zhang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Liqiang Nie

Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal.

Benchmarking Question Answering +2

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

no code implementations12 Mar 2025 Haoyu Zhang, Qiaohui Chu, Meng Liu, Yunxiao Wang, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, YaoWei Wang, Liqiang Nie

To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding.

Instruction Following Video Understanding

ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

no code implementations9 Mar 2025 Xukun Zhou, Fengxin Li, Ming Chen, Yan Zhou, Pengfei Wan, Di Zhang, Hongyan Liu, Jun He, Zhaoxin Fan

Audio-driven human gesture synthesis is a crucial task with broad applications in virtual avatars, human-computer interaction, and creative content generation.

Diversity Retrieval

RectifiedHR: Enable Efficient High-Resolution Image Generation via Energy Rectification

no code implementations4 Mar 2025 Zhen Yang, Guibao Shen, Liang Hou, Mushui Liu, Luozhou Wang, Xin Tao, Pengfei Wan, Di Zhang, Ying-Cong Chen

In this paper, we propose RectifiedHR, an straightforward and efficient solution for training-free high-resolution image generation.

Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming

no code implementations22 Feb 2025 Rui Li, Peiyi Wang, Jingyuan Ma, Di Zhang, Lei Sha, Zhifang Sui

Large Language Models (LLMs) have gained increasing attention for their remarkable capacity, alongside concerns about safety arising from their potential to produce harmful content.

Diversity In-Context Learning +1

SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

no code implementations19 Feb 2025 Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong liu

Recently, enhancing the numerical and logical reasoning capability of Large Language Models (LLMs) has emerged as a research hotspot.

Logical Reasoning Policy Gradient Methods +2

FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems

no code implementations19 Feb 2025 Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, Di Zhang

Full-Duplex Speech Dialogue Systems (Full-Duplex SDS) have significantly enhanced the naturalness of human-machine interaction by enabling real-time bidirectional communication.

Action Detection Activity Detection +1

Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts

no code implementations18 Feb 2025 Leiyu Pan, Zhenpeng Su, Minxuan Lv, Yizhe Xiong, Xiangwen Zhang, Zijia Lin, Hui Chen, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Deyi Xiong

Moreover, we find that Finedeep achieves optimal results when balancing depth and width, specifically by adjusting the number of expert sub-layers and the number of experts per sub-layer.

Efficient Exploration

iMOVE: Instance-Motion-Aware Video Understanding

no code implementations17 Feb 2025 Jiaze Li, Yaya Shi, Zongyang Ma, Haoran Xu, Feng Cheng, Huihui Xiao, Ruiwen Kang, Fan Yang, Tingting Gao, Di Zhang

Enhancing the fine-grained instance spatiotemporal motion perception capabilities of Video Large Language Models is crucial for improving their temporal and general video understanding.

Computational Efficiency Video Understanding

CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

no code implementations12 Feb 2025 Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai

In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space.

Object Text-to-Video Generation +1

Efficient Triangular Arbitrage Detection via Graph Neural Networks

no code implementations5 Feb 2025 Di Zhang

This work highlights the potential of using GNNs for solving optimization problems in finance and provides a promising approach for real-time arbitrage detection in dynamic financial markets.

Q-Learning

Visualizing the Local Atomic Environment Features of Machine Learning Interatomic Potential

no code implementations27 Jan 2025 Xuqiang Shao, Yuqi Zhang, Di Zhang, Tianxiang Gao, Xinyuan Liu, Zhiran Gan, Fanshun Meng, Hao Li, Weijie Yang

This paper addresses the challenges of creating efficient and high-quality datasets for machine learning potential functions.

Diversity

Improving Video Generation with Human Feedback

no code implementations23 Jan 2025 Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang

Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist.

Video Generation

GameFactory: Creating New Games with Generative Interactive Videos

no code implementations14 Jan 2025 Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu

In this paper, we present GameFactory, a framework focused on exploring scene generalization in game video generation.

Domain Generalization Minecraft +1

ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning

no code implementations8 Jan 2025 Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai

To address these challenges, we introduce ConceptMaster, an innovative framework that effectively tackles the critical issues of identity decoupling while maintaining concept fidelity in customized videos.

Text-to-Video Generation Video Generation

Owl-1: Omni World Model for Consistent Long Video Generation

1 code implementation12 Dec 2024 Yuanhui Huang, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Di Zhang, Jie zhou, Jiwen Lu

As videos are observations of the underlying evolving world, we propose to model the long-term developments in a latent space and use VGMs to film them into videos.

Video Generation

Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models

no code implementations10 Dec 2024 Haoran Lian, Junmin Chen, Wei Huang, Yizhe Xiong, Wenping Hu, Guiguang Ding, Hui Chen, Jianwei Niu, Zijia Lin, Fuzheng Zhang, Di Zhang

In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position Encoding (HARPE), to equip LLMs with long context modeling capabilities while simplifying the training process.

Continual Pretraining Language Modeling +2

3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

no code implementations10 Dec 2024 Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, Dahua Lin

Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved remarkable synthesis results.

Video Generation

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

1 code implementation10 Dec 2024 Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, Di Zhang

Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency.

4D reconstruction Video Generation

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

no code implementations27 Nov 2024 Di Zhang, Junxian Li, Jingdi Lei, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, Peng Ye, Wanli Ouyang, Dongzhan Zhou

In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic.

Autonomous Driving Multimodal Reasoning

Towards Precise Scaling Laws for Video Diffusion Transformers

no code implementations25 Nov 2024 Yuanyang Yin, Yaqi Zhao, Mingwu Zheng, Ke Lin, Jiarong Ou, Rui Chen, Victor Shea-Jay Huang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Baoqun Yin, Wentao Zhang, Kun Gai

Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs.

Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models

no code implementations25 Nov 2024 Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong liu

To address these issues, we propose a high-quality VQA preference dataset, called \textit{\textbf{M}ultiple \textbf{M}ultimodal \textbf{A}rtificial \textbf{I}ntelligence \textbf{P}reference Datasets in \textbf{V}QA} (\textbf{MMAIP-V}), which is constructed by sampling from the response distribution set and using an external scoring function for response evaluation.

Visual Question Answering (VQA)

VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing

no code implementations22 Nov 2024 Jiahao Hu, Tianxiong Zhong, Xuebo Wang, Boyuan Jiang, Xingye Tian, Fei Yang, Pengfei Wan, Di Zhang

VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model training costs, which comprises 9. 7M samples that encompass a wide range of video editing tasks.

Video Editing

MolReFlect: Towards Fine-grained In-Context Alignment between Molecules and Texts

no code implementations arXiv preprint 2024 Jiatong Li, Yunqing Liu, Wei Liu, Jingdi Lei, Di Zhang, Wenqi Fan, Dongzhan Zhou, Yuqiang Li, Qing Li

Previous endeavours often treat the molecule as a general SMILES string or molecular graph, neglecting the fine-grained alignments between the molecular sub-structures and the descriptive textual phrases, which are crucial for accurate and explainable predictions.

Descriptive Molecule Captioning +1

MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts

no code implementations22 Nov 2024 Jiatong Li, Yunqing Liu, Wei Liu, Jingdi Le, Di Zhang, Wenqi Fan, Dongzhan Zhou, Yuqiang Li, Qing Li

Previous endeavours often treat the molecule as a general SMILES string or molecular graph, neglecting the fine-grained alignments between the molecular sub-structures and the descriptive textual phrases, which are crucial for accurate and explainable predictions.

Descriptive

Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

no code implementations21 Nov 2024 Zhuoman Liu, Weicai Ye, Yan Luximon, Pengfei Wan, Di Zhang

Realistic simulation of dynamic scenes requires accurately capturing diverse material properties and modeling complex object interactions grounded in physical principles.

Optical Flow Estimation

DMQR-RAG: Diverse Multi-Query Rewriting for RAG

no code implementations20 Nov 2024 Zhicong Li, Jiahao Wang, Zhishu Jiang, Hangyu Mao, Zhongxia Chen, Jiazhen Du, Yuanxing Zhang, Fuzheng Zhang, Di Zhang, Yong liu

In this paper, we introduce DMQR-RAG, a Diverse Multi-Query Rewriting framework designed to improve the performance of both document retrieval and final responses in RAG.

RAG Retrieval

Kwai-STaR: Transform LLMs into State-Transition Reasoners

no code implementations7 Nov 2024 Xingyu Lu, Yuhang Hu, Changyi Liu, Tianke Zhang, Zhenyu Yang, Zhixiang Ding, Shengsheng Qian, Meng Du, Ruiwen Kang, Kaiyu Tang, Fan Yang, Tingting Gao, Di Zhang, Hai-Tao Zheng, Bin Wen

In this work, we define mathematical problem-solving as a process of transiting from an initial unsolved state to the final resolved state, and propose Kwai-STaR framework, which transforms LLMs into State-Transition Reasoners to improve their intuitive reasoning capabilities.

GSM8K Mathematical Problem-Solving +1

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

no code implementations10 Oct 2024 Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, Di Zhang

As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models.

Video Alignment Video Generation

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

1 code implementation3 Oct 2024 Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, Dongzhan Zhou

This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs).

Efficient Exploration Mathematical Problem-Solving +1

Focus On What Matters: Separated Models For Visual-Based RL Generalization

no code implementations29 Sep 2024 Di Zhang, Bowen Lv, Hai Zhang, Feifan Yang, Junqiao Zhao, Hang Yu, Chang Huang, Hongtu Zhou, Chen Ye, Changjun Jiang

Perceiving the pre-eminence of image reconstruction in representation learning, we propose SMG (Separated Models for Generalization), a novel approach that exploits image reconstruction for generalization.

Image Reconstruction Reinforcement Learning (RL) +1

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

no code implementations29 Sep 2024 Xiao Wang, Jianlong Wu, Zijia Lin, Fuzheng Zhang, Di Zhang, Liqiang Nie

For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset.

Diversity Question Answering +2

Towards Unified 3D Hair Reconstruction from Single-View Portraits

no code implementations25 Sep 2024 Yujian Zheng, Yuda Qiu, Leyang Jin, Chongyang Ma, Haibin Huang, Di Zhang, Pengfei Wan, Xiaoguang Han

Our experiments demonstrate that reconstructing braided and un-braided 3D hair from single-view images via a unified approach is possible and our method achieves the state-of-the-art performance in recovering complex hairstyles.

3D Reconstruction Single-View 3D Reconstruction

ERABAL: Enhancing Role-Playing Agents through Boundary-Aware Learning

no code implementations23 Sep 2024 Yihong Tang, Jiao Ou, Che Liu, Fuzheng Zhang, Di Zhang, Kun Gai

Role-playing is an emerging application in the field of Human-Computer Interaction (HCI), primarily implemented through the alignment training of a large language model (LLM) with assigned characters.

Language Modeling Language Modelling +1

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

no code implementations21 Aug 2024 Yuanyang Yin, Yaqi Zhao, YaJie Zhang, Ke Lin, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Baoqun Yin, Wentao Zhang

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities, typically comprising a Vision Encoder, an Adapter, and a Large Language Model (LLM).

Contrastive Learning Language Modeling +3

Asymmetric Graph Error Control with Low Complexity in Causal Bandits

no code implementations20 Aug 2024 Chen Peng, Di Zhang, Urbashi Mitra

In this paper, the causal bandit problem is investigated, in which the objective is to select an optimal sequence of interventions on nodes in a causal graph.

Change Detection Graph Learning

ViMo: Generating Motions from Casual Videos

no code implementations13 Aug 2024 Liangdong Qiu, Chengxing Yu, Yanran Li, Zhao Wang, Haibin Huang, Chongyang Ma, Di Zhang, Pengfei Wan, Xiaoguang Han

Although humans have the innate ability to imagine multiple possible actions from videos, it remains an extraordinary challenge for computers due to the intricate camera movements and montages.

Motion Generation

The Economic Analysis of the Common Pool Method through the HARA Utility Functions

no code implementations9 Aug 2024 Mu Lin, Di Zhang, Ben Chen, Hang Zheng

Water market is a contemporary marketplace for water trading and is deemed to one of the most efficient instruments to improve the social welfare.

The Stochastic Conjugate Subgradient Algorithm For Kernel Support Vector Machines

no code implementations30 Jul 2024 Di Zhang, Suvrajeet Sen

Moreover, it significantly enhances both speed and accuracy of the optimization process.

EVLM: An Efficient Vision-Language Model for Visual Understanding

no code implementations19 Jul 2024 Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, JiaHong Wu, Fan Yang, Size Li, Di Zhang

However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead.

Image Captioning Language Modeling +2

PlacidDreamer: Advancing Harmony in Text-to-3D Generation

1 code implementation19 Jul 2024 Shuo Huang, Shikun Sun, Zixuan Wang, Xiaoyu Qin, Yanmin Xiong, Yuan Zhang, Pengfei Wan, Di Zhang, Jia Jia

Previous methods utilize end-to-end 3D generation models to initialize 3D Gaussians, multi-view diffusion models to enforce multi-view consistency, and text-to-image diffusion models to refine details with score distillation algorithms.

3D Generation Text to 3D

A Comprehensive Survey on Kolmogorov Arnold Networks (KAN)

no code implementations13 Jul 2024 Tianrui Ji, Yuntian Hou, Di Zhang

Through this comprehensive survey of Kolmogorov-Arnold Networks(KAN), we have gained a thorough understanding of its theoretical foundation, architectural design, application scenarios, and current research progress.

Kolmogorov-Arnold Networks Survey

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

1 code implementation3 Jul 2024 Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, Di Zhang

Instead of following mainstream diffusion-based methods, we explore and extend the potential of the implicit-keypoint-based framework, which effectively balances computational efficiency and controllability.

Computational Efficiency Face Reenactment +3

Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector

1 code implementation17 Jun 2024 Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Hongzhi Zhang, Fuzheng Zhang, Di Zhang, Kun Gai, Ji-Rong Wen

Hallucination detection is a challenging task for large language models (LLMs), and existing studies heavily rely on powerful closed-source LLMs such as GPT-4.

2k Hallucination

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

1 code implementation11 Jun 2024 Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang

This paper introduces the MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS), designed to enhance performance in complex mathematical reasoning tasks.

Decision Making GSM8K +2

A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies

no code implementations31 May 2024 Jinchao Zhu, Yuxuan Wang, Siyuan Pan, Pengfei Wan, Di Zhang, Gao Huang

1) For the tuning method, we design a model assembly strategy to reconstruct a lightweight model while preserving performance through distillation.

Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs

no code implementations24 May 2024 Chenxi Sun, Hongzhi Zhang, Zijia Lin, Jingyuan Zhang, Fuzheng Zhang, Zhongyuan Wang, Bin Chen, Chengru Song, Di Zhang, Kun Gai, Deyi Xiong

The core of our approach is the observation that a pre-trained language model can confidently predict multiple contiguous tokens, forming the basis for a \textit{lexical unit}, in which these contiguous tokens could be decoded in parallel.

Code Generation Language Modeling +4

Learning Multi-dimensional Human Preference for Text-to-Image Generation

1 code implementation CVPR 2024 Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang

Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans.

Text-to-Image Generation

Evaluating Readability and Faithfulness of Concept-based Explanations

1 code implementation29 Apr 2024 Meng Li, Haoran Jin, Ruixuan Huang, Zhihao Xu, Defu Lian, Zijia Lin, Di Zhang, Xiting Wang

Based on this, we quantify the faithfulness of a concept explanation via perturbation.

Inductive-Deductive Strategy Reuse for Multi-Turn Instructional Dialogues

no code implementations17 Apr 2024 Jiao Ou, Jiayu Wu, Che Liu, Fuzheng Zhang, Di Zhang, Kun Gai

In this paper, we propose to explicitly capture the complex rules to help the user simulator pose diverse and in-depth instruction.

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

no code implementations15 Apr 2024 Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang

To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment.

Language Modelling Large Language Model

End-to-end training of Multimodal Model and ranking Model

1 code implementation9 Apr 2024 Xiuqi Deng, Lu Xu, Xiyao Li, Jinkai Yu, Erpeng Xue, Zhongyuan Wang, Di Zhang, Zhaojie Liu, Guorui Zhou, Yang song, Na Mou, Shen Jiang, Han Li

In this paper, we propose an industrial multimodal recommendation framework named EM3: End-to-end training of Multimodal Model and ranking Model, which sufficiently utilizes multimodal information and allows personalized ranking tasks to directly train the core modules in the multimodal model to obtain more task-oriented content features, without overburdening resource consumption.

Contrastive Learning model +1

Motion Inversion for Video Customization

1 code implementation29 Mar 2024 Luozhou Wang, Ziyang Mai, Guibao Shen, Yixun Liang, Xin Tao, Pengfei Wan, Di Zhang, Yijun Li, Yingcong Chen

In this work, we present a novel approach for motion customization in video generation, addressing the widespread gap in the exploration of motion representation within video generative models.

Video Generation

DouRN: Improving DouZero by Residual Neural Networks

no code implementations21 Mar 2024 Yiquan Chen, Yingchao Lyu, Di Zhang

Deep reinforcement learning has made significant progress in games with imperfect information, but its performance in the card game Doudizhu (Chinese Poker/Fight the Landlord) remains unsatisfactory.

Deep Reinforcement Learning

DragAnything: Motion Control for Anything using Entity Representation

2 code implementations12 Mar 2024 Weijia Wu, Zhuang Li, YuChao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, Di Zhang

We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation.

Object Video Generation

A Survey on Applications of Reinforcement Learning in Spatial Resource Allocation

no code implementations6 Mar 2024 Di Zhang, Moyang Wang, Joseph Mango, Xiang Li, Xianrui Xu

Given these advancements, there has been a surge in novel methods employing reinforcement learning to tackle spatial resource allocation problems.

Decision Making reinforcement-learning +2

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

1 code implementation26 Feb 2024 Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang

The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner.

Enhancing Role-playing Systems through Aggressive Queries: Evaluation and Improvement

no code implementations16 Feb 2024 Yihong Tang, Jiao Ou, Che Liu, Fuzheng Zhang, Di Zhang, Kun Gai

Experiments on models improved by RoleAD indicate that our adversarial dataset ameliorates this deficiency, with the improvements demonstrating a degree of generalizability in ordinary scenarios.

Dialogue Generation

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

1 code implementation5 Feb 2024 Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang song, Kun Gai, Yadong Mu

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos.

Science Question Answering Text-to-Video Generation +3

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

no code implementations5 Feb 2024 Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, Jing Liao

In practice, users often desire the ability to control object motion and camera movement independently for customized video creation.

Object Video Generation

Rethinking Cross-Attention for Infrared and Visible Image Fusion

no code implementations22 Jan 2024 Lihua Jian, Songlei Xiong, Han Yan, Xiaoguang Niu, Shaowu Wu, Di Zhang

The DIIM is designed by modifying the vanilla cross-attention mechanism, which can promote the extraction of the discrepancy information of the source images.

Infrared And Visible Image Fusion

Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint

1 code implementation11 Jan 2024 Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, Ji-Rong Wen

To address it, we propose a new RL method named RLMEC that incorporates a generative model as the reward model, which is trained by the erroneous solution rewriting task under the minimum editing constraint, and can produce token-level rewards for RL training.

Question Answering Reinforcement Learning (RL)

I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models

2 code implementations27 Dec 2023 Xun Guo, Mingwu Zheng, Liang Hou, Yuan Gao, Yufan Deng, Pengfei Wan, Di Zhang, Yufan Liu, Weiming Hu, ZhengJun Zha, Haibin Huang, Chongyang Ma

I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism, maintaining the identity of the input image without any changes to the pretrained T2V model.

Video Generation

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

1 code implementation24 Nov 2023 Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang

In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.

Image Generation Language Modeling +2

Just Ask One More Time! Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios

no code implementations14 Nov 2023 Lei Lin, Jiayi Fu, Pengli Liu, Qingyang Li, Yan Gong, Junchen Wan, Fuzheng Zhang, Zhongyuan Wang, Di Zhang, Kun Gai

Although chain-of-thought (CoT) prompting combined with language models has achieved encouraging results on complex reasoning tasks, the naive greedy decoding used in CoT prompting usually causes the repetitiveness and local optimality.

All Decoder +1

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

1 code implementation3 Nov 2023 Jiao Ou, Junda Lu, Che Liu, Yihong Tang, Fuzheng Zhang, Di Zhang, Kun Gai

In this paper, we propose DialogBench, a dialogue evaluation benchmark that contains 12 dialogue tasks to probe the capabilities of LLMs as human-like dialogue systems should have.

Dialogue Evaluation

USDC: Unified Static and Dynamic Compression for Visual Transformer

no code implementations17 Oct 2023 Huan Yuan, Chao Liao, Jianchao Tan, Peng Yao, Jiyuan Jia, Bin Chen, Chengru Song, Di Zhang

To alleviate two disadvantages of two categories of methods, we propose to unify the static compression and dynamic compression techniques jointly to obtain an input-adaptive compressed model, which can further better balance the total compression ratios and the model performances.

Model Compression

ASP: Automatic Selection of Proxy dataset for efficient AutoML

no code implementations17 Oct 2023 Peng Yao, Chao Liao, Jiyuan Jia, Jianchao Tan, Bin Chen, Chengru Song, Di Zhang

Deep neural networks have gained great success due to the increasing amounts of data, and diverse effective neural network designs.

Neural Architecture Search

KwaiYiiMath: Technical Report

no code implementations11 Oct 2023 Jiayi Fu, Lei Lin, Xiaoyang Gao, Pengli Liu, Zhengzong Chen, Zhirui Yang, ShengNan Zhang, Xue Zheng, Yan Li, Yuliang Liu, Xucheng Ye, Yiqiao Liao, Chao Liao, Bin Chen, Chengru Song, Junchen Wan, Zijia Lin, Fuzheng Zhang, Zhongyuan Wang, Di Zhang, Kun Gai

Recent advancements in large language models (LLMs) have demonstrated remarkable abilities in handling a variety of natural language processing (NLP) downstream tasks, even on mathematical tasks requiring multi-step reasoning.

Ranked #95 on Arithmetic Reasoning on GSM8K (using extra training data)

Arithmetic Reasoning GSM8K +1

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

1 code implementation9 Sep 2023 Yang Jin, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, Di Zhang, Wenwu Ou, Kun Gai, Yadong Mu

Specifically, we introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language that LLM can read.

Language Modelling Large Language Model +1

Resource Constrained Model Compression via Minimax Optimization for Spiking Neural Networks

1 code implementation9 Aug 2023 Jue Chen, Huan Yuan, Jianchao Tan, Bin Chen, Chengru Song, Di Zhang

We propose an improved end-to-end Minimax optimization method for this sparse learning problem to better balance the model performance and the computation efficiency.

Model Compression Sparse Learning

Cyclic Delay-Doppler Shift: A Simple Transmit Diversity Technique for Delay-Doppler Waveforms in Doubly Selective Channels

no code implementations22 Feb 2023 Haoran Yin, Jiaojiao Xiong, Yu Zhou, Chi Zhang, Di Zhang, Xizhang Wei, Yanqun Tang

Delay-Doppler waveform design has been considered as a promising solution to achieve reliable communication under high-mobility channels for the space-air-ground-integrated networks (SAGIN).

Diversity

ClusterLog: Clustering Logs for Effective Log-based Anomaly Detection

no code implementations19 Jan 2023 Chris Egersdoerfer, Dong Dai, Di Zhang

With the increasing prevalence of scalable file systems in the context of High Performance Computing (HPC), the importance of accurate anomaly detection on runtime logs is increasing.

Anomaly Detection Clustering +2

Optimal Settings for Cryptocurrency Trading Pairs

no code implementations20 Oct 2022 Di Zhang, Youzhou Zhou

2) It satisfies the connectivity constraint, that is, all currencies are guaranteed to be tradable.

Management Missing Values

Modeling Randomly Walking Volatility with Chained Gamma Distributions

no code implementations4 Jul 2022 Di Zhang, Qiang Niu, Youzhou Zhou

2) If the variational inference(VI) is used for state estimation, it runs much faster than Monte Carlo(MC) methods since the calculation of the posterior uses only basic arithmetic operations.

Time Series Analysis Variational Inference

PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems

1 code implementation11 Apr 2022 Yuanxing Zhang, Langshi Chen, Siran Yang, Man Yuan, Huimin Yi, Jie Zhang, Jiamang Wang, Jianbo Dong, Yunlong Xu, Yue Song, Yong Li, Di Zhang, Wei Lin, Lin Qu, Bo Zheng

However, we observe that GPU devices in training recommender systems are underutilized, and they cannot attain an expected throughput improvement as what it has achieved in CV and NLP areas.

Marketing Recommendation Systems

AMCAD: Adaptive Mixed-Curvature Representation based Advertisement Retrieval System

no code implementations28 Mar 2022 Zhirong Xu, Shiyang Wen, Junshan Wang, Guojun Liu, Liang Wang, Zhi Yang, Lei Ding, Yan Zhang, Di Zhang, Jian Xu, Bo Zheng

Moreover, to deploy AMCAD in Taobao, one of the largest ecommerce platforms with hundreds of million users, we design an efficient two-layer online retrieval framework for the task of graph based advertisement retrieval.

Graph Embedding Information Retrieval +1

Juvenile state hypothesis: What we can learn from lottery ticket hypothesis researches?

no code implementations8 Sep 2021 Di Zhang

The original lottery ticket hypothesis performs pruning and weight resetting after training convergence, exposing it to the problem of forgotten learning knowledge and potential high cost of training.

M6-T: Exploring Sparse Expert Models and Beyond

no code implementations31 May 2021 An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang, Yong Li, Di Zhang, Wei Lin, Lin Qu, Jingren Zhou, Hongxia Yang

Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling.

Playing the Game of 2048

Graph Intention Network for Click-through Rate Prediction in Sponsored Search

no code implementations30 Mar 2021 Feng Li, Zhenrui Chen, Pengjie Wang, Yi Ren, Di Zhang, Xiaoyu Zhu

Moreover, it is difficult for user to jump out of their specific historical behaviors for possible interest exploration, namely weak generalization problem.

Click-Through Rate Prediction Graph Learning

SCMA Codebook Design Based on Uniquely Decomposable Constellation Groups

no code implementations27 Feb 2021 Xuewan Zhang, Dalong Zhang, Liuqing Yang, Gangtao Han, Hsiao-Hwa Chen, Di Zhang

Thus, BER performance of the proposed codebook design approach outperforms that of the existing codebook design schemes in both uncoded and coded SCMA systems, especially for large-size codebooks.

Tunable ferroelectricity in hBN intercalated twisted double-layer graphene

no code implementations24 Feb 2021 Yibo Wang, Siqi Jiang, Jingkuan Xiao, Xiaofan Cai, Di Zhang, Ping Wang, Guodong Ma, Yaqing Han, Jiabei Huang, Kenji Watanabe, Takashi Taniguchi, Alexander S. Mayorov, Geliang Yu

Van der Waals (vdW) assembly of two-dimensional materials has been long recognized as a powerful tool to create unique systems with properties that cannot be found in natural compounds.

Mesoscale and Nanoscale Physics Materials Science

Singlino-dominated dark matter in $Z_3$-NMSSM

no code implementations10 Feb 2021 Haijing Zhou, Junjie Cao, Jingwei Lian, Di Zhang

Approximate analytical formulas describing the dark matter abundance and cross section in the scattering with nucleons are used to illustrate a dependence on theoretical parameters in neutralino and Higgs sectors.

High Energy Physics - Phenomenology

Radiative Decays of Charged Leptons in the Seesaw Effective Field Theory with One-loop Matching

no code implementations9 Feb 2021 Di Zhang, Shun Zhou

For the first time, the Wilson coefficients of all the relevant six-dimensional operators are computed by carrying out the one-loop matching between the effective theory and full seesaw model, and applied to calculate the total rates of radiative decays of charged leptons.

High Energy Physics - Phenomenology High Energy Physics - Experiment

A bi-diffusion based layer-wise sampling method for deep learning in large graphs

no code implementations25 Sep 2019 Yu He, Shiyang Wen, Wenjin Wu, Yan Zhang, Siran Yang, Yuan Wei, Di Zhang, Guojie Song, Wei Lin, Liang Wang, Bo Zheng

The Graph Convolutional Network (GCN) and its variants are powerful models for graph representation learning and have recently achieved great success on many graph-based applications.

Graph Representation Learning

Projecting "better than randomly": How to reduce the dimensionality of very large datasets in a way that outperforms random projections

no code implementations3 Jan 2019 Michael Wojnowicz, Di Zhang, Glenn Chisholm, Xuan Zhao, Matt Wolff

However, the recent development of randomized principal component analysis (RPCA) has opened up the possibility of obtaining approximate principal components on very large datasets.

Dimensionality Reduction General Classification +1

Gear Training: A new way to implement high-performance model-parallel training

no code implementations11 Jun 2018 Hao Dong, Shuai Li, Dongchang Xu, Yi Ren, Di Zhang

The training of Deep Neural Networks usually needs tremendous computing resources.

Cannot find the paper you are looking for? You can Submit a new open access paper.