Search Results for author: Yifei HUANG

Found 40 papers, 20 papers with code

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

1 code implementation24 Mar 2024 Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao

Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

2 code implementations22 Mar 2024 Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

 Ranked #1 on Audio Classification on ESC-50 (using extra training data)

Action Classification Action Recognition +12

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

1 code implementation14 Mar 2024 Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang

We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.

Moment Retrieval Temporal Action Localization +1

FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation

no code implementations1 Feb 2024 Takuma Yagi, Misaki Ohashi, Yifei HUANG, Ryosuke Furuta, Shungo Adachi, Toutai Mitsuyama, Yoichi Sato

The dataset consists of multi-view videos of 32 participants performing mock biological experiments with a total duration of 14. 5 hours.

Object object-detection +1

Retrieval-Augmented Egocentric Video Captioning

no code implementations1 Jan 2024 Jilan Xu, Yifei HUANG, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie

In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos.

Representation Learning Retrieval +1

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding

no code implementations8 Dec 2023 Hongjie Zhang, Yi Liu, Lu Dong, Yifei HUANG, Zhen-Hua Ling, Yali Wang, LiMin Wang, Yu Qiao

While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding.

Question Answering Video Question Answering +1

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

no code implementations30 Nov 2023 Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.

Video Understanding

Pretraining Language Models with Text-Attributed Heterogeneous Graphs

1 code implementation19 Oct 2023 Tao Zou, Le Yu, Yifei HUANG, Leilei Sun, Bowen Du

In many real-world scenarios (e. g., academic networks, social platforms), different types of entities are not only associated with texts but also connected by various relationships, which can be abstracted as Text-Attributed Heterogeneous Graphs (TAHGs).

Link Prediction Node Classification +1

Proposal-based Temporal Action Localization with Point-level Supervision

no code implementations9 Oct 2023 Yuan Yin, Yifei HUANG, Ryosuke Furuta, Yoichi Sato

Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data.

Action Classification Multiple Instance Learning +1

Memory-and-Anticipation Transformer for Online Action Understanding

1 code implementation ICCV 2023 Jiahao Wang, Guo Chen, Yifei HUANG, LiMin Wang, Tong Lu

Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks.

Action Understanding Online Action Detection

VideoLLM: Modeling Video Sequence with Large Language Models

1 code implementation22 May 2023 Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang

Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.

Video Understanding

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction

no code implementations CVPR 2023 Mingfang Zhang, Jinglu Wang, Xiao Li, Yifei HUANG, Yoichi Sato, Yan Lu

The Multiplane Image (MPI), containing a set of fronto-parallel RGBA layers, is an effective and efficient representation for view synthesis from sparse inputs.

3D Reconstruction

Fine-grained Affordance Annotation for Egocentric Hand-Object Interaction Videos

1 code implementation7 Feb 2023 Zecheng Yu, Yifei HUANG, Ryosuke Furuta, Takuma Yagi, Yusuke Goutsu, Yoichi Sato

Object affordance is an important concept in hand-object interaction, providing information on action possibilities based on human motor capacity and objects' physical property thus benefiting tasks such as action anticipation and robot imitation learning.

Action Anticipation Action Recognition +3

Weakly Supervised Temporal Sentence Grounding With Uncertainty-Guided Self-Training

no code implementations CVPR 2023 Yifei HUANG, Lijin Yang, Yoichi Sato

The task of weakly supervised temporal sentence grounding aims at finding the corresponding temporal moments of a language description in the video, given video-language correspondence only at video-level.

Data Augmentation Sentence +2

Compound Prototype Matching for Few-shot Action Recognition

no code implementations12 Jul 2022 Yifei HUANG, Lijin Yang, Yoichi Sato

Each global prototype is encouraged to summarize a specific aspect from the entire video, for example, the start/evolution of the action.

Few-Shot action recognition Few Shot Action Recognition +1

Precise Affordance Annotation for Egocentric Action Video Datasets

no code implementations11 Jun 2022 Zecheng Yu, Yifei HUANG, Ryosuke Furuta, Takuma Yagi, Yusuke Goutsu, Yoichi Sato

Object affordance is an important concept in human-object interaction, providing information on action possibilities based on human motor capacity and objects' physical property thus benefiting tasks such as action anticipation and robot imitation learning.

Action Anticipation Affordance Recognition +2

CLRNet: Cross Layer Refinement Network for Lane Detection

3 code implementations CVPR 2022 Tu Zheng, Yifei HUANG, Yang Liu, Wenjian Tang, Zheng Yang, Deng Cai, Xiaofei He

In this way, we can exploit more contextual information to detect lanes while leveraging local detailed lane features to improve localization accuracy.

Lane Detection

Leveraging Human Selective Attention for Medical Image Analysis with Limited Training Data

no code implementations2 Dec 2021 Yifei HUANG, Xiaoxiao Li, Lijin Yang, Lin Gu, Yingying Zhu, Hirofumi Seo, Qiuming Meng, Tatsuya Harada, Yoichi Sato

Then we design a novel Auxiliary Attention Block (AAB) to allow information from SAN to be utilized by the backbone encoder to focus on selective areas.

Tumor Segmentation

Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips

no code implementations2 Dec 2021 Lijin Yang, Yifei HUANG, Yusuke Sugano, Yoichi Sato

Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video, which is critical for determining the relatively significant parts.

Action Recognition Video Understanding

Ego4D: Around the World in 3,000 Hours of Egocentric Video

5 code implementations CVPR 2022 Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.

De-identification Ethics

Spatio-Temporal Perturbations for Video Attribution

1 code implementation1 Sep 2021 Zhenqiang Li, Weimin WANG, Zuoyue Li, Yifei HUANG, Yoichi Sato

The attribution method provides a direction for interpreting opaque neural networks in a visual way by identifying and visualizing the input regions/pixels that dominate the output of a network.

Video Understanding

FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning

1 code implementation ICCV 2021 Chenxu Zhang, Yifan Zhao, Yifei HUANG, Ming Zeng, Saifeng Ni, Madhukar Budagavi, Xiaohu Guo

In this paper, we propose a talking face generation method that takes an audio signal as input and a short target video clip as reference, and synthesizes a photo-realistic video of the target face with natural lip motions, head poses, and eye blinks that are in-sync with the input audio signal.

3D Face Animation Attribute +2

EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2021: Team M3EM Technical Report

no code implementations18 Jun 2021 Lijin Yang, Yifei HUANG, Yusuke Sugano, Yoichi Sato

In this report, we describe the technical details of our submission to the 2021 EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition.

Action Recognition Unsupervised Domain Adaptation

Goal-Oriented Gaze Estimation for Zero-Shot Learning

1 code implementation CVPR 2021 Yang Liu, Lei Zhou, Xiao Bai, Yifei HUANG, Lin Gu, Jun Zhou, Tatsuya Harada

Therefore, we introduce a novel goal-oriented gaze estimation module (GEM) to improve the discriminative attribute localization based on the class-level attributes for ZSL.

Attribute Gaze Estimation +1

Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling

no code implementations5 Feb 2021 Hong Chen, Yifei HUANG, Hiroya Takamura, Hideki Nakayama

To enrich the candidate concepts, a commonsense knowledge graph is created for each image sequence from which the concept candidates are proposed.

Informativeness Visual Storytelling

Adversarial Robustness of Stabilized NeuralODEs Might be from Obfuscated Gradients

1 code implementation28 Sep 2020 Yifei Huang, Yaodong Yu, Hongyang Zhang, Yi Ma, Yuan YAO

Even replacing only the first layer of a ResNet by such a ODE block can exhibit further improvement in robustness, e. g., under PGD-20 ($\ell_\infty=0. 031$) attack on CIFAR-10 dataset, it achieves 91. 57\% and natural accuracy and 62. 35\% robust accuracy, while a counterpart architecture of ResNet trained with TRADES achieves natural and robust accuracy 76. 29\% and 45. 24\%, respectively.

Adversarial Defense Adversarial Robustness

Improving Action Segmentation via Graph-Based Temporal Reasoning

no code implementations CVPR 2020 Yifei Huang, Yusuke Sugano, Yoichi Sato

In this paper, we propose a network module called Graph-based Temporal Reasoning Module (GTRM) that can be built on top of existing action segmentation models to learn the relation of multiple action segments in various time spans.

Action Segmentation Relation +1

Towards Visually Explaining Video Understanding Networks with Perturbation

2 code implementations1 May 2020 Zhenqiang Li, Weimin WANG, Zuoyue Li, Yifei HUANG, Yoichi Sato

''Making black box models explainable'' is a vital problem that accompanies the development of deep learning networks.

Video Understanding

Discovery of Bias and Strategic Behavior in Crowdsourced Performance Assessment

no code implementations5 Aug 2019 Yifei Huang, Matt Shum, Xi Wu, Jason Zezhong Xiao

With the industry trend of shifting from a traditional hierarchical approach to flatter management structure, crowdsourced performance assessment gained mainstream popularity.

Fairness Management

ON BREIMAN’S DILEMMA IN NEURAL NETWORKS: SUCCESS AND FAILURE OF NORMALIZED MARGINS

no code implementations ICLR 2019 Yifei HUANG, Yuan YAO, Weizhi Zhu

A belief persists long in machine learning that enlargement of margins over training data accounts for the resistance of models to overfitting by increasing the robustness.

Generalization Bounds

An Evaluation of Transfer Learning for Classifying Sales Engagement Emails at Large Scale

no code implementations19 Apr 2019 Yong Liu, Pavel Dmitriev, Yifei HUANG, Andrew Brooks, Li Dong

Our results show that fine-tuning of the BERT model outperforms with as few as 300 labeled samples, but underperforms with fewer than 300 labeled samples, relative to all the feature-based approaches using different embeddings.

Language Modelling Transfer Learning

Manipulation-skill Assessment from Videos with Spatial Attention Network

no code implementations9 Jan 2019 Zhenqiang Li, Yifei Huang, Minjie Cai, Yoichi Sato

Recent advances in computer vision have made it possible to automatically assess from videos the manipulation skills of humans in performing a task, which breeds many important applications in domains such as health rehabilitation and manufacturing.

Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions

no code implementations7 Jan 2019 Yifei Huang, Zhenqiang Li, Minjie Cai, Yoichi Sato

In this work, we address two coupled tasks of gaze prediction and action recognition in egocentric videos by exploring their mutual context.

Action Recognition Gaze Prediction +1

Differentiable Fine-grained Quantization for Deep Neural Network Compression

1 code implementation NIPS Workshop CDNNRIA 2018 Hsin-Pai Cheng, Yuanjun Huang, Xuyang Guo, Yifei HUANG, Feng Yan, Hai Li, Yiran Chen

Thus judiciously selecting different precision for different layers/structures can potentially produce more efficient models compared to traditional quantization methods by striking a better balance between accuracy and compression rate.

Neural Network Compression Quantization

Semantic Aware Attention Based Deep Object Co-segmentation

3 code implementations16 Oct 2018 Hong Chen, Yifei HUANG, Hideki Nakayama

Object co-segmentation is the task of segmenting the same objects from multiple images.

Object Segmentation

Rethinking Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics

1 code implementation8 Oct 2018 Weizhi Zhu, Yifei HUANG, Yuan YAO

In this paper, we revisit Breiman's dilemma in deep neural networks with recently proposed spectrally normalized margins, from a novel perspective based on phase transitions of normalized margin distributions in training dynamics.

Generalization Bounds

Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition

2 code implementations ECCV 2018 Yifei Huang, Minjie Cai, Zhenqiang Li, Yoichi Sato

We present a new computational model for gaze prediction in egocentric videos by exploring patterns in temporal shift of gaze fixations (attention transition) that are dependent on egocentric manipulation tasks.

Gaze Prediction Saliency Prediction

Cannot find the paper you are looking for? You can Submit a new open access paper.