Search Results for author: ShiZhe Chen

Found 37 papers, 19 papers with code

ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

no code implementations24 Apr 2024 Zerui Chen, ShiZhe Chen, Cordelia Schmid, Ivan Laptev

In this work, we aim to learn a unified vision-based policy for a multi-fingered robot hand to manipulate different objects in diverse poses.

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

no code implementations23 Apr 2024 Qingrong He, Kejun Lin, ShiZhe Chen, Anwen Hu, Qin Jin

The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules.

Visual Reasoning

SUGAR: Pre-training 3D Visual Representations for Robotics

no code implementations CVPR 2024 ShiZhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks, namely cross-modal knowledge distillation for semantic learning, masked point modeling to understand geometry structures, grasping pose synthesis for object affordance, 3D instance segmentation and referring expression grounding to analyze cluttered scenes.

3D Instance Segmentation 3D Object Recognition +5

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

1 code implementation27 Sep 2023 ShiZhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev

The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics.

Multi-Task Learning Robot Manipulation

Explore and Tell: Embodied Visual Captioning in 3D Environments

no code implementations ICCV 2023 Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints.

Image Captioning Navigate +1

Object Goal Navigation with Recursive Implicit Maps

no code implementations10 Aug 2023 ShiZhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid

Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments.

Navigate Object

Robust Visual Sim-to-Real Transfer for Robotic Manipulation

no code implementations28 Jul 2023 Ricardo Garcia, Robin Strudel, ShiZhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks.

object-detection Object Detection +1

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

1 code implementation10 May 2023 Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative.

Benchmarking Image Captioning

TeViS:Translating Text Synopses to Video Storyboards

no code implementations31 Dec 2022 Xu Gu, Yuchong Sun, Feiyue Ni, ShiZhe Chen, Xihua Wang, Ruihua Song, Boyuan Li, Xiang Cao

In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis.

Language Modelling Quantization

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

1 code implementation17 Nov 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.

Object Relation

Instruction-driven history-aware policies for robotic manipulations

2 code implementations11 Sep 2022 Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.

Ranked #2 on Robot Manipulation on RLBench (Succ. Rate (10 tasks, 100 demos/task) metric)

Robot Manipulation

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

1 code implementation24 Aug 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.

Language Modelling Navigate +3

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

1 code implementation CVPR 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.

Efficient Exploration Navigate +2

VRDFormer: End-to-End Video Visual Relation Detection With Transformers

no code implementations CVPR 2022 Sipeng Zheng, ShiZhe Chen, Qin Jin

Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency.

Object Relation +3

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

1 code implementation25 Aug 2021 Yuqing Song, ShiZhe Chen, Qin Jin, Wei Luo, Jun Xie, Fei Huang

Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image.

Machine Translation Translation

Airbert: In-domain Pretraining for Vision-and-Language Navigation

2 code implementations ICCV 2021 Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Navigate Referring Expression +1

Elaborative Rehearsal for Zero-shot Action Recognition

1 code implementation ICCV 2021 ShiZhe Chen, Dong Huang

However, due to the complexity and diversity of actions, it remains challenging to semantically represent action classes and transfer knowledge from seen data.

Action Recognition Few-Shot Learning +4

Question-controlled Text-aware Image Captioning

1 code implementation4 Aug 2021 Anwen Hu, ShiZhe Chen, Qin Jin

To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap).

Decoder Image Captioning +1

ICECAP: Information Concentrated Entity-aware Image Captioning

1 code implementation4 Aug 2021 Anwen Hu, ShiZhe Chen, Qin Jin

In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image.

Image Captioning Retrieval +1

Sketch, Ground, and Refine: Top-Down Dense Video Captioning

no code implementations CVPR 2021 Chaorui Deng, ShiZhe Chen, Da Chen, Yuan He, Qi Wu

The dense video captioning task aims to detect and describe a sequence of events in a video for detailed and coherent storytelling.

Dense Video Captioning Sentence

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

1 code implementation11 Jun 2021 Ludan Ruan, Jieting Chen, Yuqing Song, ShiZhe Chen, Qin Jin

For the object grounding, we fine-tune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful.

Caption Generation Object +1

Towards Diverse Paragraph Captioning for Untrimmed Videos

1 code implementation CVPR 2021 Yuqing Song, ShiZhe Chen, Qin Jin

Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs.

Descriptive Event Detection

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

1 code implementation12 Apr 2020 Shizhe Chen, Weiying Wang, Ludan Ruan, Linli Yao, Qin Jin

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos e. g. makeup instructional videos.

Action Understanding Question Answering +2

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

1 code implementation CVPR 2020 Shizhe Chen, Qin Jin, Peng Wang, Qi Wu

From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions according to the graph structure.

Attribute Caption Generation +1

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

4 code implementations CVPR 2020 Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

Cross-Modal Retrieval Retrieval +3

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

no code implementations24 Nov 2019 Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, Jin Zhou

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products.

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

no code implementations15 Oct 2019 Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu

This notebook paper presents our model in the VATEX video captioning challenge.

Video Captioning

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

no code implementations11 Jul 2019 Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann

The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.

Dense Captioning Dense Video Captioning

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

no code implementations3 Jun 2019 Shizhe Chen, Qin Jin, Jianlong Fu

However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning.

Machine Translation Sentence +2

Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data

no code implementations2 Jun 2019 Shizhe Chen, Qin Jin, Alexander Hauptmann

The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant.

Bilingual Lexicon Induction Sentence +2

RUC+CMU: System Report for Dense Captioning Events in Videos

no code implementations22 Jun 2018 Shizhe Chen, Yuqing Song, Yida Zhao, Jiarong Qiu, Qin Jin, Alexander Hauptmann

This notebook paper presents our system in the ActivityNet Dense Captioning in Video task (task 3).

Caption Generation Dense Captioning +1

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

no code implementations4 Sep 2017 Shizhe Chen, Qin Jin

Continuous dimensional emotion prediction is a challenging task where the fusion of various modalities usually achieves state-of-the-art performance such as early fusion or late fusion.

Video Captioning with Guidance of Multimodal Latent Topics

no code implementations31 Aug 2017 Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann

For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos.

Caption Generation Decoder +2

Generating Video Descriptions with Topic Guidance

no code implementations31 Aug 2017 Shizhe Chen, Jia Chen, Qin Jin

In addition to predefined topics, i. e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model.

Decoder Image Captioning +1

Cannot find the paper you are looking for? You can Submit a new open access paper.