1 code implementation • 7 Jan 2025 • Julia Lee Romero, Kyle Min, Subarna Tripathi, Morteza Karimzadeh
Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition.
no code implementations • 12 Aug 2024 • Utkarsh Nath, Rajeev Goel, Eun Som Jeon, Changhoon Kim, Kyle Min, Yezhou Yang, Yingzhen Yang, Pavan Turaga
To address the data scarcity associated with 3D assets, 2D-lifting techniques such as Score Distillation Sampling (SDS) have become a widely adopted practice in text-to-3D generation pipelines.
no code implementations • 28 Jul 2024 • Tz-Ying Wu, Kyle Min, Subarna Tripathi, Nuno Vasconcelos
Video understanding typically requires fine-tuning the large backbone when adapting to new domains.
no code implementations • 13 Jun 2024 • Hector A. Valdez, Kyle Min, Subarna Tripathi
Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks.
no code implementations • 4 Jun 2024 • Hengyue Liu, Kyle Min, Hector A. Valdez, Subarna Tripathi
We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning.
no code implementations • 25 May 2024 • Changhoon Kim, Kyle Min, Yezhou Yang
In the evolving landscape of text-to-image (T2I) diffusion models, the remarkable capability to generate high-quality images from textual descriptions faces challenges with the potential misuse of reproducing sensitive content.
1 code implementation • CVPR 2024 • Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, Giovanni Maria Farinella
We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos.
3 code implementations • 18 Jun 2023 • Kyle Min
This report introduces our novel method named STHG for the Audio-Visual Diarization task of the Ego4D Challenge 2023.
1 code implementation • CVPR 2024 • Changhoon Kim, Kyle Min, Maitreya Patel, Sheng Cheng, Yezhou Yang
This paper introduces a novel approach to model fingerprinting that assigns responsibility for the generated images, thereby serving as a potential countermeasure to model misuse.
1 code implementation • CVPR 2023 • Yi Li, Kyle Min, Subarna Tripathi, Nuno Vasconcelos
Do video-text transformers learn to model temporal relationships across frames?
Ranked #4 on Video Question Answering on AGQA 2.0 balanced (Average Accuracy metric)
1 code implementation • CVPR 2023 • Sayak Nag, Kyle Min, Subarna Tripathi, Amit K. Roy Chowdhury
The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG.
no code implementations • 14 Oct 2022 • Kyle Min
This report describes our approach for the Audio-Visual Diarization (AVD) task of the Ego4D Challenge 2022.
2 code implementations • 15 Jul 2022 • Kyle Min, Sourya Roy, Subarna Tripathi, Tanaya Guha, Somdeb Majumdar
Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows.
Ranked #1 on Node Classification on AVA
Active Speaker Detection Audio-Visual Active Speaker Detection +2
no code implementations • 2 Dec 2021 • Sourya Roy, Kyle Min, Subarna Tripathi, Tanaya Guha, Somdeb Majumdar
We address the problem of active speaker detection through a new framework, called SPELL, that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data.
Active Speaker Detection Audio-Visual Active Speaker Detection +1
1 code implementation • 8 Nov 2020 • Kyle Min, Jason J. Corso
In addition, we model the distribution of gaze fixations using a variational method.
Ranked #2 on Egocentric Activity Recognition on EGTEA
1 code implementation • ECCV 2020 • Kyle Min, Jason J. Corso
Two triplets of the feature space are considered in our approach: one triplet is used to learn discriminative features for each activity class, and the other one is used to distinguish the features where no activity occurs (i. e. background features) from activity-related features for each video.
1 code implementation • ICCV 2019 • Kyle Min, Jason J. Corso
It consists of two building blocks: first, the encoder network extracts low-resolution spatiotemporal features from an input clip of several consecutive frames, and then the following prediction network decodes the encoded features spatially while aggregating all the temporal information.
no code implementations • CVPR 2018 • Kibok Lee, Kimin Lee, Kyle Min, Yuting Zhang, Jinwoo Shin, Honglak Lee
The essential ingredients of our methods are confidence-calibrated classifiers, data relabeling, and the leave-one-out strategy for modeling novel classes under the hierarchical taxonomy.