1 code implementation • ECCV 2020 • Guangming Wu, Yinqiang Zheng, Zhiling Guo, Zekun Cai, Xiaodan Shi, Xin Ding, Yifei HUANG, Yimin Guo, Ryosuke Shibasaki
In silicon sensors, the interference between visible and near-infrared (NIR) signals is a crucial problem.
1 code implementation • 30 Dec 2024 • Yifei HUANG, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Yali Wang, Yu Qiao, LiMin Wang
We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model.
no code implementations • 16 Dec 2024 • Guo Chen, Yicheng Liu, Yifei HUANG, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, LiMin Wang
However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combining short video understanding with elimination, without genuinely understanding the video content.
no code implementations • 10 Oct 2024 • Jianxin Bi, Kelvin Lim, Kaiqi Chen, Yifei HUANG, Harold Soh
Recent advances in diffusion-based robot policies have demonstrated significant potential in imitating multi-modal behaviors.
no code implementations • 15 Sep 2024 • Nie Lin, Takehiko Ohkawa, Mingfang Zhang, Yifei HUANG, Ryosuke Furuta, Yoichi Sato
Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs sorely from a single image with data augmentation.
1 code implementation • 10 Jul 2024 • Liangyang Ouyang, Ruicong Liu, Yifei HUANG, Ryosuke Furuta, Yoichi Sato
Experimental results on VISOR dataset reveal that ActionVOS significantly reduces the mis-segmentation of inactive objects, confirming that actions help the ActionVOS model understand objects' involvement.
no code implementations • 9 Jul 2024 • Mingfang Zhang, Yifei HUANG, Ruicong Liu, Yoichi Sato
Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion.
1 code implementation • 26 Jun 2024 • Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei HUANG, Yali Wang, Tong Lu, LiMin Wang, Yu Qiao
In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Ranked #1 on Long Term Action Anticipation on Ego4D (using extra training data)
no code implementations • 18 Apr 2024 • Tianyi Liang, Jiangqi Liu, Yifei HUANG, Shiqi Jiang, Sicheng Song, Jianshen Shi, Changbo Wang, Chenhui Li
These results demonstrate the efficacy of TextCenGen in creating more harmonious and integrated text-image compositions.
1 code implementation • CVPR 2024 • Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao
Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.
Ranked #1 on Action Anticipation on EgoExoLearn (using extra training data)
2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue.
Ranked #1 on Action Classification on MIT
1 code implementation • 14 Mar 2024 • Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang
We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.
Ranked #2 on Temporal Action Localization on FineAction
no code implementations • 1 Feb 2024 • Takuma Yagi, Misaki Ohashi, Yifei HUANG, Ryosuke Furuta, Shungo Adachi, Toutai Mitsuyama, Yoichi Sato
The dataset consists of multi-view videos of 32 participants performing mock biological experiments with a total duration of 14. 5 hours.
no code implementations • CVPR 2024 • Jilan Xu, Yifei HUANG, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos.
no code implementations • 8 Dec 2023 • Hongjie Zhang, Yi Liu, Lu Dong, Yifei HUANG, Zhen-Hua Ling, Yali Wang, LiMin Wang, Yu Qiao
While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding.
2 code implementations • CVPR 2024 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
1 code implementation • 19 Oct 2023 • Tao Zou, Le Yu, Yifei HUANG, Leilei Sun, Bowen Du
In many real-world scenarios (e. g., academic networks, social platforms), different types of entities are not only associated with texts but also connected by various relationships, which can be abstracted as Text-Attributed Heterogeneous Graphs (TAHGs).
no code implementations • 9 Oct 2023 • Yuan Yin, Yifei HUANG, Ryosuke Furuta, Yoichi Sato
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data.
1 code implementation • ICCV 2023 • Jiahao Wang, Guo Chen, Yifei HUANG, LiMin Wang, Tong Lu
Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks.
Ranked #1 on Action Detection on THUMOS' 14
1 code implementation • 22 May 2023 • Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang
Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.
no code implementations • CVPR 2023 • Mingfang Zhang, Jinglu Wang, Xiao Li, Yifei HUANG, Yoichi Sato, Yan Lu
The Multiplane Image (MPI), containing a set of fronto-parallel RGBA layers, is an effective and efficient representation for view synthesis from sparse inputs.
1 code implementation • 7 Feb 2023 • Zecheng Yu, Yifei HUANG, Ryosuke Furuta, Takuma Yagi, Yusuke Goutsu, Yoichi Sato
Object affordance is an important concept in hand-object interaction, providing information on action possibilities based on human motor capacity and objects' physical property thus benefiting tasks such as action anticipation and robot imitation learning.
no code implementations • CVPR 2023 • Yifei HUANG, Lijin Yang, Yoichi Sato
The task of weakly supervised temporal sentence grounding aims at finding the corresponding temporal moments of a language description in the video, given video-language correspondence only at video-level.
2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao
In this report, we present our champion solutions to five tracks at Ego4D challenge.
Ranked #1 on State Change Object Detection on Ego4D
no code implementations • 12 Jul 2022 • Yifei HUANG, Lijin Yang, Yoichi Sato
Each global prototype is encouraged to summarize a specific aspect from the entire video, for example, the start/evolution of the action.
no code implementations • 11 Jun 2022 • Zecheng Yu, Yifei HUANG, Ryosuke Furuta, Takuma Yagi, Yusuke Goutsu, Yoichi Sato
Object affordance is an important concept in human-object interaction, providing information on action possibilities based on human motor capacity and objects' physical property thus benefiting tasks such as action anticipation and robot imitation learning.
4 code implementations • CVPR 2022 • Tu Zheng, Yifei HUANG, Yang Liu, Wenjian Tang, Zheng Yang, Deng Cai, Xiaofei He
In this way, we can exploit more contextual information to detect lanes while leveraging local detailed lane features to improve localization accuracy.
Ranked #1 on Lane Detection on LLAMAS
no code implementations • CVPR 2022 • Lijin Yang, Yifei HUANG, Yusuke Sugano, Yoichi Sato
Different from previous works, we find that the cross-domain alignment can be more effectively done by using cross-modal interaction first.
no code implementations • 2 Dec 2021 • Lijin Yang, Yifei HUANG, Yusuke Sugano, Yoichi Sato
Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video, which is critical for determining the relatively significant parts.
no code implementations • 2 Dec 2021 • Yifei HUANG, Xiaoxiao Li, Lijin Yang, Lin Gu, Yingying Zhu, Hirofumi Seo, Qiuming Meng, Tatsuya Harada, Yoichi Sato
Then we design a novel Auxiliary Attention Block (AAB) to allow information from SAN to be utilized by the backbone encoder to focus on selective areas.
8 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.
1 code implementation • 1 Sep 2021 • Zhenqiang Li, Weimin WANG, Zuoyue Li, Yifei HUANG, Yoichi Sato
The attribution method provides a direction for interpreting opaque neural networks in a visual way by identifying and visualizing the input regions/pixels that dominate the output of a network.
1 code implementation • ICCV 2021 • Chenxu Zhang, Yifan Zhao, Yifei HUANG, Ming Zeng, Saifeng Ni, Madhukar Budagavi, Xiaohu Guo
In this paper, we propose a talking face generation method that takes an audio signal as input and a short target video clip as reference, and synthesizes a photo-realistic video of the target face with natural lip motions, head poses, and eye blinks that are in-sync with the input audio signal.
no code implementations • 18 Jun 2021 • Lijin Yang, Yifei HUANG, Yusuke Sugano, Yoichi Sato
In this report, we describe the technical details of our submission to the 2021 EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition.
1 code implementation • CVPR 2021 • Yang Liu, Lei Zhou, Xiao Bai, Yifei HUANG, Lin Gu, Jun Zhou, Tatsuya Harada
Therefore, we introduce a novel goal-oriented gaze estimation module (GEM) to improve the discriminative attribute localization based on the class-level attributes for ZSL.
no code implementations • 5 Feb 2021 • Hong Chen, Yifei HUANG, Hiroya Takamura, Hideki Nakayama
To enrich the candidate concepts, a commonsense knowledge graph is created for each image sequence from which the concept candidates are proposed.
Ranked #19 on Visual Storytelling on VIST
1 code implementation • 28 Sep 2020 • Yifei Huang, Yaodong Yu, Hongyang Zhang, Yi Ma, Yuan YAO
Even replacing only the first layer of a ResNet by such a ODE block can exhibit further improvement in robustness, e. g., under PGD-20 ($\ell_\infty=0. 031$) attack on CIFAR-10 dataset, it achieves 91. 57\% and natural accuracy and 62. 35\% robust accuracy, while a counterpart architecture of ResNet trained with TRADES achieves natural and robust accuracy 76. 29\% and 45. 24\%, respectively.
no code implementations • CVPR 2020 • Yifei Huang, Yusuke Sugano, Yoichi Sato
In this paper, we propose a network module called Graph-based Temporal Reasoning Module (GTRM) that can be built on top of existing action segmentation models to learn the relation of multiple action segments in various time spans.
Ranked #29 on Action Segmentation on Breakfast
2 code implementations • 1 May 2020 • Zhenqiang Li, Weimin WANG, Zuoyue Li, Yifei HUANG, Yoichi Sato
''Making black box models explainable'' is a vital problem that accompanies the development of deep learning networks.
no code implementations • 5 Aug 2019 • Yifei Huang, Matt Shum, Xi Wu, Jason Zezhong Xiao
With the industry trend of shifting from a traditional hierarchical approach to flatter management structure, crowdsourced performance assessment gained mainstream popularity.
no code implementations • ICLR 2019 • Yifei HUANG, Yuan YAO, Weizhi Zhu
A belief persists long in machine learning that enlargement of margins over training data accounts for the resistance of models to overfitting by increasing the robustness.
no code implementations • 19 Apr 2019 • Yong Liu, Pavel Dmitriev, Yifei HUANG, Andrew Brooks, Li Dong
Our results show that fine-tuning of the BERT model outperforms with as few as 300 labeled samples, but underperforms with fewer than 300 labeled samples, relative to all the feature-based approaches using different embeddings.
no code implementations • 9 Jan 2019 • Zhenqiang Li, Yifei Huang, Minjie Cai, Yoichi Sato
Recent advances in computer vision have made it possible to automatically assess from videos the manipulation skills of humans in performing a task, which breeds many important applications in domains such as health rehabilitation and manufacturing.
no code implementations • 7 Jan 2019 • Yifei Huang, Zhenqiang Li, Minjie Cai, Yoichi Sato
In this work, we address two coupled tasks of gaze prediction and action recognition in egocentric videos by exploring their mutual context.
1 code implementation • NIPS Workshop CDNNRIA 2018 • Hsin-Pai Cheng, Yuanjun Huang, Xuyang Guo, Yifei HUANG, Feng Yan, Hai Li, Yiran Chen
Thus judiciously selecting different precision for different layers/structures can potentially produce more efficient models compared to traditional quantization methods by striking a better balance between accuracy and compression rate.
3 code implementations • 16 Oct 2018 • Hong Chen, Yifei HUANG, Hideki Nakayama
Object co-segmentation is the task of segmenting the same objects from multiple images.
1 code implementation • 8 Oct 2018 • Weizhi Zhu, Yifei HUANG, Yuan YAO
In this paper, we revisit Breiman's dilemma in deep neural networks with recently proposed spectrally normalized margins, from a novel perspective based on phase transitions of normalized margin distributions in training dynamics.
2 code implementations • ECCV 2018 • Yifei Huang, Minjie Cai, Zhenqiang Li, Yoichi Sato
We present a new computational model for gaze prediction in egocentric videos by exploring patterns in temporal shift of gaze fixations (attention transition) that are dependent on egocentric manipulation tasks.