no code implementations • 5 Jul 2024 • Weibo Gong, Xiaojie Jin, Xin Li, Dongliang He, Xinglong Wu
To address this task, we propose VCoME, a general framework that employs a large multimodal model to generate editing effects for video composition.
no code implementations • 30 Jun 2024 • Yiqin Wang, Haoji Zhang, Yansong Tang, Yong liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin
This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA).
1 code implementation • 12 Jun 2024 • Haoji Zhang, Yiqin Wang, Yansong Tang, Yong liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin
Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously.
Ranked #1 on Zero-Shot Video Question Answer on MSRVTT-QA
no code implementations • 27 May 2024 • Jian Zhao, Lei Jin, Jianshu Li, Zheng Zhu, Yinglei Teng, Jiaojiao Zhao, Sadaf Gulshad, Zheng Wang, Bo Zhao, Xiangbo Shu, Yunchao Wei, Xuecheng Nie, Xiaojie Jin, Xiaodan Liang, Shin'ichi Satoh, Yandong Guo, Cewu Lu, Junliang Xing, Jane Shen Shengmei
The SkatingVerse Workshop & Challenge aims to encourage research in developing novel and accurate methods for human action understanding.
no code implementations • CVPR 2024 • Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, Yi Yang
This amplifies the effect of visual tokens on text generation especially when the relative distance is longer between visual and text tokens.
1 code implementation • CVPR 2024 • Mingfei Han, Linjie Yang, Xiaojie Jin, Jiashi Feng, Xiaojun Chang, Heng Wang
While existing datasets mainly comprise landscape mode videos, our paper seeks to introduce portrait mode videos to the research community and highlight the unique challenges associated with this video format.
no code implementations • 12 Dec 2023 • Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, Yi Yang
This amplifies the effect of visual tokens on text generation, especially when the relative distance is longer between visual and text tokens.
Ranked #2 on Question Answering on NExT-QA (Open-ended VideoQA)
1 code implementation • CVPR 2024 • Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jin
PixelLM excels across various pixel-level image reasoning and understanding tasks, outperforming well-established methods in multiple benchmarks, including MUSE, single- and multi-referring segmentation.
no code implementations • 3 Oct 2023 • Xueqing Deng, Qi Fan, Xiaojie Jin, Linjie Yang, Peng Wang
Specifically, SFA consists of external adapters and internal adapters which are sequentially operated over a transformer model.
1 code implementation • ICCV 2023 • Xiaozheng Zheng, Zhuo Su, Chao Wen, Zhou Xue, Xiaojie Jin
To bridge the physical and virtual worlds for rapidly developed VR/AR applications, the ability to realistically drive 3D full-body avatars is of great significance.
1 code implementation • 15 Jun 2023 • Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, Jing Liu
Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations.
Ranked #1 on TGIF-Frame on TGIF-QA (using extra training data)
no code implementations • 24 May 2023 • Cheng-Ze Lu, Xiaojie Jin, Qibin Hou, Jun Hao Liew, Ming-Ming Cheng, Jiashi Feng
The study reveals that: 1) MIM can be viewed as an effective method to improve the model capacity when the scale of the training data is relatively small; 2) Strong reconstruction targets can endow the models with increased capacities on downstream tasks; 3) MIM pre-training is data-agnostic under most scenarios, which means that the strategy of sampling pre-training data is non-critical.
no code implementations • 22 May 2023 • Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang Liu, Dongmei Fu, Yi Yang, Jing Liu, Jiashi Feng
Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
Ranked #1 on Visual Question Answering (VQA) on MSVD-QA (using extra training data)
1 code implementation • CVPR 2024 • Xiaojie Jin, BoWen Zhang, Weibo Gong, Kai Xu, Xueqing Deng, Peng Wang, Zhao Zhang, Xiaohui Shen, Jiashi Feng
The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts.
no code implementations • 18 Jan 2023 • Fan Ma, Xiaojie Jin, Heng Wang, Jingjia Huang, Linchao Zhu, Jiashi Feng, Yi Yang
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description, and text localization which matches the subset of texts with the video features.
no code implementations • 15 Jan 2023 • Cheng-Ze Lu, Xiaojie Jin, Zhicheng Huang, Qibin Hou, Ming-Ming Cheng, Jiashi Feng
Contrastive Masked Autoencoder (CMAE), as a new self-supervised framework, has shown its potential of learning expressive feature representations in visual image recognition.
1 code implementation • 16 Nov 2022 • Taojiannan Yang, Linjie Yang, Xiaojie Jin, Chen Chen
In this paper, we revisit these training-free metrics and find that: (1) the number of parameters (\#Param), which is the most straightforward training-free metric, is overlooked in previous works but is surprisingly effective, (2) recent training-free metrics largely rely on the \#Param information to rank networks.
no code implementations • 9 Nov 2022 • Zhao Zhang, Suiyi Zhao, Xiaojie Jin, Mingliang Xu, Yi Yang, Shuicheng Yan, Meng Wang
Specifically, we propose Noise SElf-Regression (NoiSER) without access to any task-related data, simply learns a convolutional neural network equipped with an instance-normalization layer by taking a random noise image, $\mathcal{N}(0,\sigma^2)$ for each pixel, as both input and output for each training pair, and then the low-light image is fed to the trained network for predicting the normal-light image.
no code implementations • 4 Nov 2022 • Bo wang, Zhao Zhang, Mingbo Zhao, Xiaojie Jin, Mingliang Xu, Meng Wang
To obtain rich features, we use the Swin Transformer to calculate multi-level features, and then feed them into a novel dynamic multi-sight embedding module to exploit both global structure and local texture of input images.
1 code implementation • 12 Sep 2022 • Sen Pei, Shixiong Xu, Xiaojie Jin
However, most VHD methods are based on the closed world assumption, i. e., a fixed number of highlight categories is defined in advance and all training data are available beforehand.
1 code implementation • 27 Jul 2022 • Zhicheng Huang, Xiaojie Jin, Chengze Lu, Qibin Hou, Ming-Ming Cheng, Dongmei Fu, Xiaohui Shen, Jiashi Feng
The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart.
1 code implementation • 27 Jul 2022 • Yaojie Shen, Libo Zhang, Kai Xu, Xiaojie Jin
First we learn the embedding of video transitions through a video transition classification task.
no code implementations • 3 Nov 2021 • Yi-Wen Chen, Xiaojie Jin, Xiaohui Shen, Ming-Hsuan Yang
Video salient object detection aims to find the most visually distinctive objects in a video.
4 code implementations • NeurIPS 2021 • Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, Qiang Liu
The goal of multi-task learning is to enable more efficient learning than single task learning by sharing model structures for a diverse set of tasks.
1 code implementation • CVPR 2021 • Mingyu Ding, Xiaochen Lian, Linjie Yang, Peng Wang, Xiaojie Jin, Zhiwu Lu, Ping Luo
Last, we proposed an efficient fine-grained search strategy to train HR-NAS, which effectively explores the search space, and finds optimal architectures given various tasks and computation resources.
1 code implementation • 7 Jun 2021 • Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li, Xiaojie Jin, Qibin Hou, Jiashi Feng
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs.
Ranked #183 on Image Classification on ImageNet
no code implementations • 27 Apr 2021 • Chaosheng Dong, Xiaojie Jin, Weihao Gao, Yijia Wang, Hongyi Zhang, Xiang Wu, Jianchao Yang, Xiaobing Liu
Deep learning models in large-scale machine learning systems are often continuously trained with enormous data from production environments.
7 code implementations • NeurIPS 2021 • Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, Jiashi Feng
In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs).
Ranked #3 on Efficient ViTs on ImageNet-1K (With LV-ViT-S)
5 code implementations • 22 Mar 2021 • Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, Jiashi Feng
In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
Ranked #466 on Image Classification on ImageNet
1 code implementation • ICCV 2021 • Daquan Zhou, Xiaojie Jin, Xiaochen Lian, Linjie Yang, Yujing Xue, Qibin Hou, Jiashi Feng
Current neural architecture search (NAS) algorithms still require expert knowledge and effort to design a search space for network construction.
1 code implementation • 10 Nov 2020 • Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, Dong Xu
HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization.
2 code implementations • CVPR 2020 • Yingwei Li, Xiaojie Jin, Jieru Mei, Xiaochen Lian, Linjie Yang, Cihang Xie, Qihang Yu, Yuyin Zhou, Song Bai, Alan Yuille
However, it has been rarely explored to embed the NL blocks in mobile neural networks, mainly due to the following challenges: 1) NL blocks generally have heavy computation cost which makes it difficult to be applied in applications where computational resources are limited, and 2) it is an open problem to discover an optimal configuration to embed NL blocks into mobile neural networks.
Ranked #60 on Neural Architecture Search on ImageNet
no code implementations • 30 Dec 2019 • Xiaojie Jin, Jiang Wang, Joshua Slocum, Ming-Hsuan Yang, Shengyang Dai, Shuicheng Yan, Jiashi Feng
In this paper, we propose the resource constrained differentiable architecture search (RC-DARTS) method to learn architectures that are significantly smaller and faster while achieving comparable accuracy.
1 code implementation • ICLR 2020 • Jieru Mei, Yingwei Li, Xiaochen Lian, Xiaojie Jin, Linjie Yang, Alan Yuille, Jianchao Yang
We propose a fine-grained search space comprised of atomic blocks, a minimal search unit that is much smaller than the ones used in recent NAS algorithms.
Ranked #61 on Neural Architecture Search on ImageNet
no code implementations • ICLR 2020 • Daquan Zhou, Xiaojie Jin, Qibin Hou, Kaixin Wang, Jianchao Yang, Jiashi Feng
The recent WSNet [1] is a new model compression method through sampling filterweights from a compact set and has demonstrated to be effective for 1D convolutionneural networks (CNNs).
no code implementations • ICLR 2018 • Xiaojie Jin, Yingzhen Yang, Ning Xu, Jianchao Yang, Jiashi Feng, Shuicheng Yan
We present a new approach and a novel architecture, termed WSNet, for learning compact and efficient deep neural networks.
no code implementations • ICML 2018 • Xiaojie Jin, Yingzhen Yang, Ning Xu, Jianchao Yang, Nebojsa Jojic, Jiashi Feng, Shuicheng Yan
We present a new approach and a novel architecture, termed WSNet, for learning compact and efficient deep neural networks.
no code implementations • NeurIPS 2017 • Xiaojie Jin, Huaxin Xiao, Xiaohui Shen, Jimei Yang, Zhe Lin, Yunpeng Chen, Zequn Jie, Jiashi Feng, Shuicheng Yan
The ability of predicting the future is important for intelligent systems, e. g. autonomous vehicles and robots to plan early and make decisions accordingly.
17 code implementations • NeurIPS 2017 • Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, Jiashi Feng
In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally.
no code implementations • CVPR 2017 • Zequn Jie, Yunchao Wei, Xiaojie Jin, Jiashi Feng, Wei Liu
To overcome this issue, we propose a deep self-taught learning approach, which makes the detector learn the object-level features reliable for acquiring tight positive samples and afterwards re-train itself based on them.
no code implementations • NeurIPS 2016 • Zequn Jie, Xiaodan Liang, Jiashi Feng, Xiaojie Jin, Wen Feng Lu, Shuicheng Yan
Therefore, Tree-RL can better cover different objects with various scales which is quite appealing in the context of object proposal.
no code implementations • 24 Jan 2017 • Yunpeng Chen, Xiaojie Jin, Jiashi Feng, Shuicheng Yan
Learning rich and diverse representations is critical for the performance of deep convolutional neural networks (CNNs).
no code implementations • ICCV 2017 • Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, Shuicheng Yan
In this way, the network can effectively learn to capture video dynamics and temporal context, which are critical clues for video scene parsing, without requiring extra manual annotations.
no code implementations • 27 Aug 2016 • Xiaojie Jin, Yunpeng Chen, Jiashi Feng, Zequn Jie, Shuicheng Yan
In this paper, we consider the scene parsing problem and propose a novel Multi-Path Feedback recurrent neural network (MPF-RNN) for parsing scene images.
no code implementations • 19 Jul 2016 • Xiaojie Jin, Yunpeng Chen, Jian Dong, Jiashi Feng, Shuicheng Yan
In this paper, we propose a layer-wise discriminative learning method to enhance the discriminative capability of a deep network by allowing its layers to work collaboratively for classification.
no code implementations • 19 Jul 2016 • Xiaojie Jin, Xiao-Tong Yuan, Jiashi Feng, Shuicheng Yan
In this paper, we propose an iterative hard thresholding (IHT) approach to train Skinny Deep Neural Networks (SDNNs).
1 code implementation • 22 Dec 2015 • Xiaojie Jin, Chunyan Xu, Jiashi Feng, Yunchao Wei, Junjun Xiong, Shuicheng Yan
Rectified linear activation units are important components for state-of-the-art deep convolutional networks.