Search Results for author: Jianlong Fu

Found 78 papers, 46 papers with code

AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

no code implementations30 May 2023 Chuhao Jin, Wenhui Tan, Jiange Yang, Bei Liu, Ruihua Song, LiMin Wang, Jianlong Fu

We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks, such as making a smiley face using building blocks.

Robot Manipulation

Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution

no code implementations24 May 2023 Yiyang Ma, Huan Yang, Wenhan Yang, Jianlong Fu, Jiaying Liu

Diffusion models, as a kind of powerful generative model, have given impressive results on image super-resolution (SR) tasks.

Efficient Exploration Image Super-Resolution

Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution

no code implementations17 Mar 2023 Zixi Tuo, Huan Yang, Jianlong Fu, Yujie Dun, Xueming Qian

In particular, we propose a multi-scale Vector-Quantized Degradation model for animation video Super-Resolution (VQD-SR) to decompose the local details from global structures and transfer the degradation priors in real-world animation videos to a learned vector-quantized codebook for degradation modeling.

Video Super-Resolution

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

1 code implementation CVPR 2023 Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo

To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i. e., MM-Diffusion), with two-coupled denoising autoencoders.

Denoising FAD +1

Weakly-supervised Pre-training for 3D Human Pose Estimation via Perspective Knowledge

no code implementations22 Nov 2022 Zhongwei Qiu, Kai Qiu, Jianlong Fu, Dongmei Fu

Based on MCPC, we propose a weakly-supervised pre-training (WSP) strategy to distinguish the depth relationship between two points in an image.

3D Human Pose Estimation 3D Pose Estimation

Fine-Grained Image Style Transfer with Visual Transformers

1 code implementation11 Oct 2022 Jianbo Wang, Huan Yang, Jianlong Fu, Toshihiko Yamasaki, Baining Guo

Such a design usually destroys the spatial information of the input images and fails to transfer fine-grained style patterns into style transfer results.

Style Transfer

AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation

1 code implementation7 Sep 2022 Yiyang Ma, Huan Yang, Bei Liu, Jianlong Fu, Jiaying Liu

To address this issue, we propose a Prompt-based Cross-Modal Generation Framework (PCM-Frame) to leverage two powerful pre-trained models, including CLIP and StyleGAN.

Image Generation

4D LUT: Learnable Context-Aware 4D Lookup Table for Image Enhancement

no code implementations5 Sep 2022 Chengxu Liu, Huan Yang, Jianlong Fu, Xueming Qian

In particular, we first introduce a lightweight context encoder and a parameter encoder to learn a context map for the pixel-level category and a group of image-adaptive coefficients, respectively.

Image Enhancement

Language-Guided Face Animation by Recurrent StyleGAN-based Generator

1 code implementation11 Aug 2022 Tiankai Hang, Huan Yang, Bei Liu, Jianlong Fu, Xin Geng, Baining Guo

Specifically, we propose a recurrent motion generator to extract a series of semantic and motion information from the language and feed it along with visual information to a pre-trained StyleGAN to generate high-quality frames.

Image Manipulation

Exploring Anchor-based Detection for Ego4D Natural Language Query

no code implementations10 Aug 2022 Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu

In this paper we provide the technique report of Ego4D natural language query challenge in CVPR 2022.

Video Understanding

GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training

1 code implementation8 Aug 2022 Jaeseok Byun, Taebaek Hwang, Jianlong Fu, Taesup Moon

In contrast to the mainstream VLP methods, we highlight that two routinely applied steps during pre-training have crucial impact on the performance of the pre-trained model: in-batch hard negative sampling for image-text matching (ITM) and assigning the large masking probability for the masked language modeling (MLM).

Language Modelling Masked Language Modeling +1

Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution

1 code implementation5 Aug 2022 Zhongwei Qiu, Huan Yang, Jianlong Fu, Dongmei Fu

First, we divide a video frame into patches, and transform each patch into DCT spectral maps in which each channel represents a frequency band.

Video Enhancement Video Super-Resolution

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

1 code implementation21 Jul 2022 Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, Lu Yuan

It achieves a top-1 accuracy of 84. 8% on ImageNet-1k with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k while using 4. 2 times fewer parameters.

Image Classification Knowledge Distillation

TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation

no code implementations19 Jul 2022 Chengxu Liu, Huan Yang, Jianlong Fu, Xueming Qian

In particular, we formulate the warped features with inconsistent motions as query tokens, and formulate relevant regions in a motion trajectory from two original consecutive frames into keys and values.

Video Frame Interpolation

Degradation-Guided Meta-Restoration Network for Blind Super-Resolution

no code implementations3 Jul 2022 Fuzhi Yang, Huan Yang, Yanhong Zeng, Jianlong Fu, Hongtao Lu

The extractor estimates the degradations in LR inputs and guides the meta-restoration modules to predict restoration parameters for different degradations on-the-fly.

Blind Super-Resolution Image Restoration +1

Learning Trajectory-Aware Transformer for Video Super-Resolution

1 code implementation CVPR 2022 Chengxu Liu, Huan Yang, Jianlong Fu, Xueming Qian

Existing approaches usually align and aggregate video frames from limited adjacent frames (e. g., 5 or 7 frames), which prevents these approaches from satisfactory results.

Video Super-Resolution

Searching the Search Space of Vision Transformer

1 code implementation NeurIPS 2021 Minghao Chen, Kan Wu, Bolin Ni, Houwen Peng, Bei Liu, Jianlong Fu, Hongyang Chao, Haibin Ling

Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures.

Neural Architecture Search object-detection +4

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

1 code implementation CVPR 2022 Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo

To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts.

Retrieval Super-Resolution +3

Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers

no code implementations NeurIPS 2021 Yanhong Zeng, Huan Yang, Hongyang Chao, Jianbo Wang, Jianlong Fu

Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content tokens by attention mechanism with a Transformer.

Image Generation

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation

1 code implementation19 Oct 2021 Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu

In this work, we demonstrate such an AI creation system to produce both diverse captions and rich images.

Learning Fine-Grained Motion Embedding for Landscape Animation

no code implementations6 Sep 2021 Hongwei Xue, Bei Liu, Huan Yang, Jianlong Fu, Houqiang Li, Jiebo Luo

To tackle this problem, we propose a model named FGLA to generate high-quality and realistic videos by learning Fine-Grained motion embedding for Landscape Animation.

Domain-Aware Universal Style Transfer

1 code implementation ICCV 2021 Kibeom Hong, Seogkyu Jeon, Huan Yang, Jianlong Fu, Hyeran Byun

To this end, we design a novel domainness indicator that captures the domainness value from the texture and structural features of reference images.

Style Transfer

Reference-based Defect Detection Network

no code implementations10 Aug 2021 Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao

To solve the partial visual confusion issue, we propose to leverage the carried context information of context reference, which is the concentric bigger box of each region proposal, to perform more accurate region classification and regression.

Defect Detection object-detection +2

AutoFormer: Searching Transformers for Visual Recognition

2 code implementations ICCV 2021 Minghao Chen, Houwen Peng, Jianlong Fu, Haibin Ling

Specifically, the performance of these subnets with weights inherited from the supernet is comparable to those retrained from scratch.

AutoML Fine-Grained Image Classification

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

2 code implementations CVPR 2021 Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu

As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.

Representation Learning Retrieval +3

3D Human Body Reshaping with Anthropometric Modeling

1 code implementation5 Apr 2021 Yanhong Zeng, Jianlong Fu, Hongyang Chao

First, we calculate full-body anthropometric parameters from limited user inputs by imputation technique, and thus essential anthropometric parameters for 3D body reshaping can be obtained.

feature selection Imputation +1

Aggregated Contextual Transformations for High-Resolution Image Inpainting

2 code implementations3 Apr 2021 Yanhong Zeng, Jianlong Fu, Hongyang Chao, Baining Guo

For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task.

Image Inpainting Texture Synthesis +1

One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking

1 code implementation CVPR 2021 Minghao Chen, Houwen Peng, Jianlong Fu, Haibin Ling

In this paper, we propose a one-shot neural ensemble architecture search (NEAS) solution that addresses the two challenges.

Neural Architecture Search

Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with Natural Language

1 code implementation4 Dec 2020 Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, Jiebo Luo

It is a challenging problem because a target moment may take place in the context of other temporal moments in the untrimmed video.

Learning Semantic-aware Normalization for Generative Adversarial Networks

1 code implementation NeurIPS 2020 Heliang Zheng, Jianlong Fu, Yanhong Zeng, Jiebo Luo, Zheng-Jun Zha

Such a model disentangles latent factors according to the semantic of feature channels by channel-/group- wise fusion of latent codes and feature channels.

Image Inpainting Unconditional Image Generation

Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search

2 code implementations NeurIPS 2020 Houwen Peng, Hao Du, Hongyuan Yu, Qi Li, Jing Liao, Jianlong Fu

The experiments on ImageNet verify such path distillation method can improve the convergence ratio and performance of the hypernetwork, as well as boosting the training of subnetworks.

Neural Architecture Search object-detection +1

Revisiting Anchor Mechanisms for Temporal Action Localization

1 code implementation22 Aug 2020 Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, Junwei Han

To address this problem, this paper proposes a novel anchor-free action localization module that assists action localization by temporal points.

Temporal Action Localization

Cyclic Differentiable Architecture Search

3 code implementations18 Jun 2020 Hongyuan Yu, Houwen Peng, Yan Huang, Jianlong Fu, Hao Du, Liang Wang, Haibin Ling

First, the search network generates an initial architecture for evaluation, and the weights of the evaluation network are optimized.

Neural Architecture Search

Learning Texture Transformer Network for Image Super-Resolution

1 code implementation CVPR 2020 Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, Baining Guo

In this paper, we propose a novel Texture Transformer Network for Image Super-Resolution (TTSR), in which the LR and Ref images are formulated as queries and keys in a transformer, respectively.

Hard Attention Image Generation +2

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

1 code implementation2 Apr 2020 Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu

We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks.

Language Modelling Question Answering +5

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

3 code implementations8 Dec 2019 Songyang Zhang, Houwen Peng, Jianlong Fu, Jiebo Luo

We address the problem of retrieving a specific moment from an untrimmed video by a query sentence.

Learning Sparse 2D Temporal Adjacent Networks for Temporal Action Localization

2 code implementations8 Dec 2019 Songyang Zhang, Houwen Peng, Le Yang, Jianlong Fu, Jiebo Luo

In this report, we introduce the Winner method for HACS Temporal Action Localization Challenge 2019.

Temporal Action Localization

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

no code implementations24 Nov 2019 Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, Jin Zhou

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products.

Learning Deep Bilinear Transformation for Fine-grained Image Representation

1 code implementation NeurIPS 2019 Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, Jiebo Luo

However, the computational cost to learn pairwise interactions between deep feature channels is prohibitively expensive, which restricts this powerful transformation to be used in deep neural networks.

Fine-Grained Image Recognition

Learning Rich Image Region Representation for Visual Question Answering

no code implementations29 Oct 2019 Bei Liu, Zhicheng Huang, Zhaoyang Zeng, Zheyu Chen, Jianlong Fu

We propose to boost VQA by leveraging more powerful feature extractors by improving the representation ability of both visual and text features and the ensemble of models.

Language Modelling Question Answering +1

360-Indoor: Towards Learning Real-World Objects in 360° Indoor Equirectangular Images

no code implementations3 Oct 2019 Shih-Han Chou, Cheng Sun, Wen-Yen Chang, Wan-Ting Hsu, Min Sun, Jianlong Fu

In this paper, our goal is to provide a standard dataset to facilitate the vision and machine learning communities in 360{\deg} domain.

object-detection Object Detection

WSOD^2: Learning Bottom-up and Top-down Objectness Distillation for Weakly-supervised Object Detection

1 code implementation11 Sep 2019 Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, Lei Zhang

We study on weakly-supervised object detection (WSOD) which plays a vital role in relieving human involvement from object-level annotations.

object-detection Region Proposal +2

Learn to Scale: Generating Multipolar Normalized Density Maps for Crowd Counting

1 code implementation ICCV 2019 Chenfeng Xu, Kai Qiu, Jianlong Fu, Song Bai, Yongchao Xu, Xiang Bai

Dense crowd counting aims to predict thousands of human instances from an image, by calculating integrals of a density map over image pixels.

Crowd Counting Density Estimation

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

no code implementations11 Jul 2019 Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann

The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.

Dense Captioning Dense Video Captioning

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

no code implementations3 Jun 2019 Shizhe Chen, Qin Jin, Jianlong Fu

However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning.

Machine Translation Translation +1

Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting

2 code implementations CVPR 2019 Yanhong Zeng, Jianlong Fu, Hongyang Chao, Baining Guo

As the missing content can be filled by attention transfer from deep to shallow in a pyramid fashion, both visual and semantic coherence for image inpainting can be ensured.

Image Inpainting Vocal Bursts Intensity Prediction

Deep Attention Neural Tensor Network for Visual Question Answering

no code implementations ECCV 2018 Yalong Bai, Jianlong Fu, Tiejun Zhao, Tao Mei

First, we model one of the pairwise interaction (e. g., image and question) by bilinear features, which is further encoded with the third dimension (e. g., answer) to be a triplet by bilinear tensor product.

Deep Attention Question Answering +1

DA-GAN: Instance-Level Image Translation by Deep Attention Generative Adversarial Networks

no code implementations CVPR 2018 Shuang Ma, Jianlong Fu, Chang Wen Chen, Tao Mei

Specifically, we jointly learn a deep attention encoder, and the instance-level correspondences could be consequently discovered through attending on the learned instances.

Data Augmentation Deep Attention +2

Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training

3 code implementations23 Apr 2018 Bei Liu, Jianlong Fu, Makoto P. Kato, Masatoshi Yoshikawa

Extensive experiments are conducted with 8K images, among which 1. 5K image are randomly picked for evaluation.

DA-GAN: Instance-level Image Translation by Deep Attention Generative Adversarial Networks (with Supplementary Materials)

no code implementations CVPR 2018 Shuang Ma, Jianlong Fu, Chang Wen Chen, Tao Mei

Specifically, we jointly learn a deep attention encoder, and the instancelevel correspondences could be consequently discovered through attending on the learned instance pairs.

Data Augmentation Deep Attention +1

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions

no code implementations EMNLP 2018 Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, Jiebo Luo

Most existing approaches adopt the pipeline of representing an image via pre-trained CNNs, and then using the uninterpretable CNN features in conjunction with the question to predict the answer.

Image Captioning Question Answering +1

Self-view Grounding Given a Narrated 360° Video

1 code implementation23 Nov 2017 Shih-Han Chou, Yi-Chun Chen, Kuo-Hao Zeng, Hou-Ning Hu, Jianlong Fu, Min Sun

The negative log reconstruction loss of the reverse sentence (referred to as "irrelevant loss") is jointly minimized to encourage the reverse sentence to be different from the given sentence.

Visual Grounding

Learning Multi-Attention Convolutional Neural Network for Fine-Grained Image Recognition

3 code implementations ICCV 2017 Heliang Zheng, Jianlong Fu, Tao Mei, Jiebo Luo

Two losses are proposed to guide the multi-task learning of channel grouping and part classification, which encourages MA-CNN to generate more discriminative parts from feature channels and learn better fine-grained features from parts in a mutual reinforced way.

Fine-Grained Image Classification Fine-Grained Image Recognition +2

Multi-Level Attention Networks for Visual Question Answering

no code implementations CVPR 2017 Dongfei Yu, Jianlong Fu, Tao Mei, Yong Rui

To solve the challenges, we propose a multi-level attention network for visual question answering that can simultaneously reduce the semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention.

Question Answering Visual Question Answering

Storytelling of Photo Stream with Bidirectional Multi-thread Recurrent Neural Network

no code implementations2 Jun 2016 Yu Liu, Jianlong Fu, Tao Mei, Chang Wen Chen

Second, by using sGRU as basic units, the BMRNN is trained to align the local storylines into the global sequential timeline.

Video Captioning Visual Storytelling

Relaxing From Vocabulary: Robust Weakly-Supervised Deep Learning for Vocabulary-Free Image Tagging

no code implementations ICCV 2015 Jianlong Fu, Yue Wu, Tao Mei, Jinqiao Wang, Hanqing Lu, Yong Rui

The development of deep learning has empowered machines with comparable capability of recognizing limited image categories to human beings.

Cannot find the paper you are looking for? You can Submit a new open access paper.