no code implementations • 31 May 2023 • Ali Vosoughi, Shijian Deng, Songyang Zhang, Yapeng Tian, Chenliang Xu, Jiebo Luo
In this paper, we first model a confounding effect that causes language and vision bias simultaneously, then propose a counterfactual inference to remove the influence of this effect.
no code implementations • 24 May 2023 • Songyang Zhang, Tianhang Yu, Brian Choi, Feng Ouyang, Zhi Ding
Providing rich and useful information regarding spectrum activities and propagation channels, radiomaps characterize the detailed distribution of power spectral density (PSD) and are important tools for network planning in modern wireless systems.
no code implementations • 19 May 2023 • Achintha Wijesinghe, Songyang Zhang, Zhi Ding
Our analysis demonstrates the convergence and privacy benefits of the proposed PS-FEdGAN framework.
no code implementations • 17 May 2023 • Hao Li, Peng Jin, Zesen Cheng, Songyang Zhang, Kai Chen, Zhennan Wang, Chang Liu, Jie Chen
Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them.
no code implementations • 17 Apr 2023 • Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, Xi Yin
In contrast, we propose a parameter-free temporal shift module that can leverage the spatial U-Net as is for video generation.
2 code implementations • 12 Apr 2023 • Jiahao Wang, Songyang Zhang, Yong liu, Taiqiang Wu, Yujiu Yang, Xihui Liu, Kai Chen, Ping Luo, Dahua Lin
Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy.
1 code implementation • 4 Mar 2023 • YuAn Liu, Songyang Zhang, Jiacheng Chen, Kai Chen, Dahua Lin
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
no code implementations • 25 Feb 2023 • Zhichao Liu, Leshan Wang, Desen Zhou, Jian Wang, Songyang Zhang, Yang Bai, Errui Ding, Rui Fan
To deal with these issues, we propose an attention based approach which we call \textit{temporal segment transformer}, for joint segment relation modeling and denoising.
1 code implementation • NeurIPS 2021 • Lin Song, Songyang Zhang, Songtao Liu, Zeming Li, Xuming He, Hongbin Sun, Jian Sun, Nanning Zheng
Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
no code implementations • CVPR 2023 • Jiahao Wang, Songyang Zhang, Yong liu, Taiqiang Wu, Yujiu Yang, Xihui Liu, Kai Chen, Ping Luo, Dahua Lin
Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy.
no code implementations • 24 Dec 2022 • Songyang Zhang, Achintha Wijesinghe, Zhi Ding
A practical goal is to estimate fine-resolution radio maps from sparse radio strength measurements.
1 code implementation • 22 Oct 2022 • Songyang Zhang, Linfeng Song, Lifeng Jin, Haitao Mi, Kun Xu, Dong Yu, Jiebo Luo
While previous work focuses on building systems for inducing grammars on text that are well-aligned with video content, we investigate the scenario, in which text and video are only in loose correspondence.
1 code implementation • 29 Sep 2022 • Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman
We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V).
Ranked #1 on
Text-to-Video Generation
on MSR-VTT
no code implementations • 10 Sep 2022 • Songyang Zhang, Tianhang Yu, Jonathan Tivald, Brian Choi, Feng Ouyang, Zhi Ding
Radio map describes network coverage and is a practically important tool for network planning in modern wireless systems.
no code implementations • 15 Aug 2022 • Shuaiyi Huang, Luyu Yang, Bo He, Songyang Zhang, Xuming He, Abhinav Shrivastava
In this paper, we aim to address the challenge of label sparsity in semantic correspondence by enriching supervision signals from sparse keypoint annotations.
2 code implementations • 4 Aug 2022 • Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling
Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios.
Ranked #5 on
Zero-Shot Action Recognition
on Kinetics
no code implementations • 3 Aug 2022 • Xingchen Li, Long Chen, Jian Shao, Shaoning Xiao, Songyang Zhang, Jun Xiao
Current Scene Graph Generation (SGG) methods tend to predict frequent predicate categories and fail to recognize rare ones due to the severe imbalanced distribution of predicates.
1 code implementation • 19 Jul 2022 • Yang Bai, Desen Zhou, Songyang Zhang, Jian Wang, Errui Ding, Yu Guan, Yang Long, Jingdong Wang
Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences.
1 code implementation • CVPR 2022 • Lin Li, Long Chen, Yifeng Huang, Zhimeng Zhang, Songyang Zhang, Jun Xiao
Then, in Pos-NSD, we use a clustering-based algorithm to divide all positive samples into multiple sets, and treat the samples in the noisiest set as noisy positive samples.
1 code implementation • 17 Apr 2022 • Thomas Hayes, Songyang Zhang, Xi Yin, Guan Pang, Sasha Sheng, Harry Yang, Songwei Ge, Qiyuan Hu, Devi Parikh
Altogether, MUGEN can help progress research in many tasks in multimodal understanding and generation.
no code implementations • 7 Jan 2022 • Shipeng Yan, Songyang Zhang, Xuming He
In this work, we introduce a new budget-aware few-shot learning problem that not only aims to learn novel object categories, but also needs to select informative examples to annotate in order to achieve data efficiency.
1 code implementation • CVPR 2022 • Rongjie Li, Songyang Zhang, Xuming He
Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property.
no code implementations • 29 Nov 2021 • Songyang Zhang, Qinwen Deng, Zhi Ding
One important task of hyperspectral image (HSI) processing is the extraction of spectral-spatial features.
no code implementations • 29 Sep 2021 • Rongjie Li, Songyang Zhang, Xuming He
We develop a decoding-and-assembling paradigm for the end-to-end scene graph generation.
no code implementations • 3 Sep 2021 • Jiahui Li, Kun Kuang, Lin Li, Long Chen, Songyang Zhang, Jian Shao, Jun Xiao
Deep neural networks have demonstrated remarkable performance in many data-driven and prediction-oriented applications, and sometimes even perform better than humans.
1 code implementation • 31 Aug 2021 • Songyang Zhang, Qinwen Deng, Zhi Ding
To generalize traditional graph signal processing (GSP) over multilayer graphs for analyzing multi-level signal features and their interactions, this work proposes a tensor-based framework of multilayer graph signal processing (M-GSP).
no code implementations • 31 Aug 2021 • Songyang Zhang, Qinwen Deng, Zhi Ding
Graph signal processing (GSP) has become an important tool in image processing because of its ability to reveal underlying data structures.
1 code implementation • 8 Aug 2021 • Shipeng Yan, Jiale Zhou, Jiangwei Xie, Songyang Zhang, Xuming He
Incremental learning of semantic segmentation has emerged as a promising strategy for visual scene interpretation in the open- world setting.
1 code implementation • 27 Jul 2021 • Songyang Zhang, Lin Song, Songtao Liu, Zheng Ge, Zeming Li, Xuming He, Jian Sun
In this report, we introduce our real-time 2D object detection system for the realistic autonomous driving scenario.
1 code implementation • ICCV 2021 • Zhengyuan Yang, Songyang Zhang, LiWei Wang, Jiebo Luo
3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.
1 code implementation • 11 May 2021 • Songyang Zhang, Jiale Zhou, Xuming He
Few-shot video classification aims to learn new video categories with only a few labeled examples, alleviating the burden of costly annotation in real-world applications.
1 code implementation • NAACL 2021 • Songyang Zhang, Linfeng Song, Lifeng Jin, Kun Xu, Dong Yu, Jiebo Luo
We investigate video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video.
3 code implementations • CVPR 2021 • Rongjie Li, Songyang Zhang, Bo Wan, Xuming He
Scene graph generation is an important visual understanding task with a broad range of vision applications.
1 code implementation • CVPR 2021 • Songyang Zhang, Zeming Li, Shipeng Yan, Xuming He, Jian Sun
Motivated by our discovery, we propose a unified distribution alignment strategy for long-tail visual recognition.
Ranked #16 on
Long-tail Learning
on Places-LT
no code implementations • 15 Mar 2021 • Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, Jun Xiao
State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e. g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment.
no code implementations • 11 Mar 2021 • Qinwen Deng, Songyang Zhang, Zhi Ding
Efficient processing and feature extraction of largescale point clouds are important in related computer vision and cyber-physical systems.
no code implementations • 12 Feb 2021 • Qinwen Deng, Songyang Zhang, Zhi Ding
By directly estimating hypergraph spectrum based on hypergraph stationary processing, we design a spectral kernel-based filter to capture high-dimensional interactions among point signal nodes and to better preserve object surface outlines.
1 code implementation • 4 Dec 2020 • Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, Jiebo Luo
It is a challenging problem because a target moment may take place in the context of other temporal moments in the untrimmed video.
no code implementations • 3 Nov 2020 • Li Sun, Haoqi Zhang, Songyang Zhang, Jiebo Luo
Short-form video social media shifts away from the traditional media paradigm by telling the audience a dynamic story to attract their attention.
no code implementations • 11 Aug 2020 • Xi Chen, Songyang Zhang, Dandan song, Peng Ouyang, Shouyi Yin
To demonstrate our proposed speech transformer with a bidirectional decoder(STBD), we conduct extensive experiments on the AISHELL-1 dataset.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
2 code implementations • ECCV 2020 • Yongfei Liu, Xiangyi Zhang, Songyang Zhang, Xuming He
In this paper, we propose a novel few-shot semantic segmentation framework based on the prototype representation.
Ranked #3 on
Few-Shot Semantic Segmentation
on Pascal5i
Few-Shot Semantic Segmentation
Semi-Supervised Semantic Segmentation
no code implementations • 1 Jul 2020 • Songyang Zhang, Han Zhang, Shuguang Cui, Zhi Ding
Graph convolutional networks (GCN) have been recently utilized to extract the underlying structures of datasets with some labeled data and high-dimensional features.
no code implementations • 22 Jun 2020 • Jie An, Tianlang Chen, Songyang Zhang, Jiebo Luo
This work proposes a novel framework consisting of a reference image retrieval step and a global sentiment transfer step to transfer sentiments of images according to a given sentiment tag.
no code implementations • 27 Jan 2020 • Songyang Zhang, Tolga Aktas, Jiebo Luo
In this study, we explore culture preferences among countries using the thumbnails of YouTube trending videos.
no code implementations • 21 Jan 2020 • Songyang Zhang, Shuguang Cui, Zhi Ding
Hypergraph spectral analysis has emerged as an effective tool processing complex data structures in data analysis.
no code implementations • 8 Jan 2020 • Songyang Zhang, Shuguang Cui, Zhi Ding
Along with increasingly popular virtual reality applications, the three-dimensional (3D) point cloud has become a fundamental data structure to characterize 3D objects and surroundings.
3 code implementations • 8 Dec 2019 • Songyang Zhang, Houwen Peng, Jianlong Fu, Jiebo Luo
We address the problem of retrieving a specific moment from an untrimmed video by a query sentence.
2 code implementations • 8 Dec 2019 • Songyang Zhang, Houwen Peng, Le Yang, Jianlong Fu, Jiebo Luo
In this report, we introduce the Winner method for HACS Temporal Action Localization Challenge 2019.
1 code implementation • 9 Sep 2019 • Songyang Zhang
The congestion control algorithm bring such importance that it avoids the network link into severe congestion and guarantees network normal operation.
Networking and Internet Architecture
1 code implementation • ICCV 2019 • Shuaiyi Huang, Qiuyue Wang, Songyang Zhang, Shipeng Yan, Xuming He
We instantiate our strategy by designing an end-to-end learnable deep network, named as Dynamic Context Correspondence Network (DCCNet).
1 code implementation • 11 Aug 2019 • Songyang Zhang, Jinsong Su, Jiebo Luo
We address the problem of video moment localization with natural language, i. e. localizing a video segment described by a natural language sentence.
no code implementations • 22 Jul 2019 • Songyang Zhang, Zhi Ding, Shuguang Cui
Signal processing over graphs has recently attracted significant attentions for dealing with structured data.
1 code implementation • 28 May 2019 • Songyang Zhang, Shipeng Yan, Xuming He
A promising strategy is to model the feature context by a fully-connected graph neural network (GNN), which augments traditional convolutional features with an estimated non-local context representation.
2 code implementations • 2 Sep 2018 • Songyang Zhang
To develop low latency congestion control algorithm for real time taffic has been gained attention recently.
Networking and Internet Architecture
1 code implementation • 27 Oct 2017 • Yuhang Song, Main Xu, Songyang Zhang, Liangyu Huo
However, the conventional deep neural network architecture is limited in learning representations for multi-task RL (MT-RL), as multiple tasks can refer to different kinds of representations.
1 code implementation • CVPR 2017 • Yufan Liu, Songyang Zhang, Mai Xu, Xuming He
On the other hand, we find that the attention of different subjects consistently focuses on a single face in each frame of videos involving multiple faces.
1 code implementation • 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) 2017 • Songyang Zhang, Xiaoming Liu, Jun Xiao
RNN-based approaches have achieved outstanding performance on action recognition with skeleton inputs.
Ranked #1 on
Skeleton Based Action Recognition
on SBU