1 code implementation • 9 Oct 2023 • Chen Pan, Fan Zhou, Xuanwei Hu, Xinxin Zhu, Wenxin Ning, Zi Zhuang, Siqiao Xue, James Zhang, Yunhua Hu
Deciding the best future execution time is a critical task in many business activities while evolving time series forecasting, and optimal timing strategy provides such a solution, which is driven by observed data.
no code implementations • 16 Jun 2023 • Shuai Xiao, Chen Pan, Min Wang, Xinxin Zhu, Siqiao Xue, Jing Wang, Yunhua Hu, James Zhang, Jinghua Feng
To this end, we formulate the problem as a partially observable Markov decision problem (POMDP) and employ an environment correction algorithm based on the characteristics of the business.
1 code implementation • NeurIPS 2023 • Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu
Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).
Ranked #1 on Image Captioning on COCO Captions (SPICE metric, using extra training data)
1 code implementation • 25 May 2023 • Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, Jing Liu
We show that only language-paired two-modality data is sufficient to connect all modalities.
1 code implementation • 17 Apr 2023 • Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, Jing Liu
Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
Ranked #1 on Video Captioning on VATEX (using extra training data)
1 code implementation • 29 Mar 2023 • Jiawei Liu, Weining Wang, Sihan Chen, Xinxin Zhu, Jing Liu
In this work, we concentrate on a rarely investigated problem of text guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals.
2 code implementations • CVPR 2023 • Mingzhen Sun, Weining Wang, Xinxin Zhu, Jing Liu
Experimental results demonstrate that our method achieves new state-of-the-art performance on five challenging benchmarks for video prediction and unconditional video generation: BAIR, RoboNet, KTH, KITTI and UCF101.
no code implementations • 11 Feb 2023 • Fan Zhou, Chen Pan, Lintao Ma, Yu Liu, Shiyu Wang, James Zhang, Xinxin Zhu, Xuanwei Hu, Yunhua Hu, Yangfei Zheng, Lei Lei, Yun Hu
Moreover, unlike most previous reconciliation methods which either rely on strong assumptions or focus on coherent constraints only, we utilize deep neural optimization networks, which not only achieve coherency without any assumptions, but also allow more flexible and realistic constraints to achieve task-based targets, e. g., lower under-estimation penalty and meaningful decision-making loss to facilitate the subsequent downstream tasks.
2 code implementations • 1 Jul 2021 • Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.
Ranked #1 on Image Retrieval on Localized Narratives
no code implementations • 26 Jan 2021 • Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, Jing Liu
Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture.
no code implementations • 26 Jan 2021 • Sihan Chen, Xinxin Zhu, Wei Liu, Xingjian He, Jing Liu
Depth information matters in RGB-D semantic segmentation task for providing additional geometric information to color images.
no code implementations • 24 Jan 2021 • Longteng Guo, Jing Liu, Xinxin Zhu, Hanqing Lu
These models are autoregressive in that they generate each word by conditioning on previously generated words, which leads to heavy latency during inference.
no code implementations • 16 Dec 2020 • Xinxin Zhu, Weining Wang, Longteng Guo, Jing Liu
The whole process involves a visual understanding module and a language generation module, which brings more challenges to the design of deep neural networks than other tasks.
no code implementations • 10 May 2020 • Longteng Guo, Jing Liu, Xinxin Zhu, Xingjian He, Jie Jiang, Hanqing Lu
In this paper, we propose a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL).
no code implementations • CVPR 2020 • Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, Hanqing Lu
First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA.
no code implementations • 17 Oct 2019 • Xinxin Zhu, Longteng Guo, Peng Yao, Shichen Lu, Wei Liu, Jing Liu
This report describes our solution for the VATEX Captioning Challenge 2020, which requires generating descriptions for the videos in both English and Chinese languages.