no code implementations • 4 Apr 2025 • Yuyang Hu, Suhas Lohit, Ulugbek S. Kamilov, Tim K. Marks
In addition, we propose a novel multimodal diffusion bridge architecture with a two-branch backbone for multimodal image restoration, incorporating an efficient backbone and dedicated cross-modality fusion blocks to effectively extract and fuse features from synthetic aperture radar (SAR) and optical images.
no code implementations • 21 Mar 2025 • Vineet R. Shenoy, Shaoju Wu, Armand Comas, Tim K. Marks, Suhas Lohit, Hassan Mansour
In this paper, we present a modular, interpretable pipeline for pulse signal estimation from video of the face that achieves state-of-the-art results on publicly available datasets. Our imaging photoplethysmography (iPPG) system consists of three modules: face and landmark detection, time-series extraction, and pulse signal/pulse rate estimation.
no code implementations • 21 Mar 2025 • Vineet R Shenoy, Suhas Lohit, Hassan Mansour, Rama Chellappa, Tim K. Marks
Camera-based monitoring of vital signs, also known as imaging photoplethysmography (iPPG), has seen applications in driver-monitoring, perfusion assessment in surgical settings, affective computing, and more.
1 code implementation • CVPR 2024 • Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks
To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image.
no code implementations • CVPR 2024 • Zeyuan Yang, Jiageng Liu, Peihao Chen, Anoop Cherian, Tim K. Marks, Jonathan Le Roux, Chuang Gan
We leverage Large Language Models (LLM) for zeroshot Semantic Audio Visual Navigation (SAVN).
1 code implementation • ICCV 2023 • Nithin Gopalakrishnan Nair, Anoop Cherian, Suhas Lohit, Ye Wang, Toshiaki Koike-Akino, Vishal M. Patel, Tim K. Marks
To this end, and capitalizing on the powerful fine-grained generative control offered by the recent diffusion-based generative models, we introduce Steered Diffusion, a generalized framework for photorealistic zero-shot conditional image generation using a diffusion model trained for unconditional generation.
no code implementations • 22 Oct 2022 • Kei Ota, Hsiao-Yu Tung, Kevin A. Smith, Anoop Cherian, Tim K. Marks, Alan Sullivan, Asako Kanezaki, Joshua B. Tenenbaum
The world is filled with articulated objects that are difficult to determine how to use from vision alone, e. g., a door might open inwards or outwards.
no code implementations • 18 Feb 2022 • Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux
Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame.
Ranked #43 on
Video Question Answering
on NExT-QA
no code implementations • 1 Nov 2021 • Safa C. Medin, Bernhard Egger, Anoop Cherian, Ye Wang, Joshua B. Tenenbaum, Xiaoming Liu, Tim K. Marks
Recent advances in generative adversarial networks (GANs) have led to remarkable achievements in face image synthesis.
no code implementations • 13 Oct 2021 • Ankit P. Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori
In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8).
no code implementations • ICCV 2021 • Anoop Cherian, Goncalo Dias Pais, Siddarth Jain, Tim K. Marks, Alan Sullivan
To use our model for instance segmentation, we propose an instance pose encoder that learns to take in a generated depth image and reproduce the pose code vectors for all of the object instances.
1 code implementation • CVPR 2020 • Abhinav Kumar, Tim K. Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu, Chen Feng
In this paper, we present a novel framework for jointly predicting landmark locations, associated uncertainties of these predicted locations, and landmark visibilities.
Ranked #1 on
Face Alignment
on Menpo
no code implementations • 17 Jan 2020 • Anoop Cherian, Jue Wang, Chiori Hori, Tim K. Marks
To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame.
no code implementations • 14 Nov 2019 • Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sungjin Lee, Adam Atkinson, Baolin Peng, Hannes Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada, Minlie Huang, Luis Lastras, Jonathan K. Kummerfeld, Walter S. Lasecki, Chiori Hori, Anoop Cherian, Tim K. Marks, Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta
This paper introduces the Eighth Dialog System Technology Challenge.
2 code implementations • 25 Jan 2019 • Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
We introduce the task of scene-aware dialog.
no code implementations • 11 Jan 2019 • Koichiro Yoshino, Chiori Hori, Julien Perez, Luis Fernando D'Haro, Lazaros Polymenakos, Chulaka Gunasekara, Walter S. Lasecki, Jonathan K. Kummerfeld, Michel Galley, Chris Brockett, Jianfeng Gao, Bill Dolan, Xiang Gao, Huda Alamari, Tim K. Marks, Devi Parikh, Dhruv Batra
This paper introduces the Seventh Dialog System Technology Challenges (DSTC), which use shared datasets to explore the problem of building dialog systems.
2 code implementations • 21 Jun 2018 • Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh
We introduce a new dataset of dialogs about videos of human behaviors.
4 code implementations • 1 Jun 2018 • Huda Alamri, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Jue Wang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, Tim K. Marks, Chiori Hori
Scene-aware dialog systems will be able to have conversations with users about the objects and events around them.
no code implementations • 30 Mar 2018 • Varun Manjunatha, Srikumar Ramalingam, Tim K. Marks, Larry Davis
To accomplish this, we use a submodular set function to model the accuracy achievable on a new task when the features have been learned on a given subset of classes of the source dataset.
no code implementations • ICCV 2017 • Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R. Hershey, Tim K. Marks
Currently successful methods for video description are based on encoder-decoder sentence generation using recur-rent neural networks (RNNs).
no code implementations • CVPR 2016 • Bharat Singh, Tim K. Marks, Michael Jones, Oncel Tuzel, Ming Shao
We present a multi-stream bi-directional recurrent neural network for fine-grained action detection.
Action Recognition In Videos
Fine-Grained Action Detection
+2
no code implementations • 13 Nov 2015 • Oncel Tuzel, Tim K. Marks, Salil Tambe
Face alignment is particularly challenging when there are large variations in pose (in-plane and out-of-plane rotations) and facial expression.
no code implementations • CVPR 2015 • Ejaz Ahmed, Michael Jones, Tim K. Marks
Novel elements of our architecture include a layer that computes cross-input neighborhood differences, which capture local relationships among mid-level features that were computed separately from the two input images.
no code implementations • CVPR 2015 • Chavdar Papazov, Tim K. Marks, Michael Jones
The matched triangular surface patches in the training set are used to compute estimates of the 3D head pose and facial landmark positions in the input depth map.