Search Results for author: Tim K. Marks

Found 20 papers, 5 papers with code

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

1 code implementation25 Apr 2024 Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks

To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image.

Denoising Image to Video Generation

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

no code implementations ICCV 2023 Nithin Gopalakrishnan Nair, Anoop Cherian, Suhas Lohit, Ye Wang, Toshiaki Koike-Akino, Vishal M. Patel, Tim K. Marks

To this end, and capitalizing on the powerful fine-grained generative control offered by the recent diffusion-based generative models, we introduce Steered Diffusion, a generalized framework for photorealistic zero-shot conditional image generation using a diffusion model trained for unconditional generation.

Colorization Conditional Image Generation +2

H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding Object Articulations from Interactions

no code implementations22 Oct 2022 Kei Ota, Hsiao-Yu Tung, Kevin A. Smith, Anoop Cherian, Tim K. Marks, Alan Sullivan, Asako Kanezaki, Joshua B. Tenenbaum

The world is filled with articulated objects that are difficult to determine how to use from vision alone, e. g., a door might open inwards or outwards.

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

no code implementations18 Feb 2022 Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux

Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame.

Question Answering Spatio-temporal Scene Graphs +1

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

no code implementations13 Oct 2021 Ankit P. Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8).

Region Proposal

InSeGAN: A Generative Approach to Segmenting Identical Instances in Depth Images

no code implementations ICCV 2021 Anoop Cherian, Goncalo Dias Pais, Siddarth Jain, Tim K. Marks, Alan Sullivan

To use our model for instance segmentation, we propose an instance pose encoder that learns to take in a generated depth image and reproduce the pose code vectors for all of the object instances.

Generative Adversarial Network Instance Segmentation +2

Spatio-Temporal Ranked-Attention Networks for Video Captioning

no code implementations17 Jan 2020 Anoop Cherian, Jue Wang, Chiori Hori, Tim K. Marks

To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame.

Video Captioning

Class Subset Selection for Transfer Learning using Submodularity

no code implementations30 Mar 2018 Varun Manjunatha, Srikumar Ramalingam, Tim K. Marks, Larry Davis

To accomplish this, we use a submodular set function to model the accuracy achievable on a new task when the features have been learned on a given subset of classes of the source dataset.

Image Classification Transfer Learning

Attention-Based Multimodal Fusion for Video Description

no code implementations ICCV 2017 Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R. Hershey, Tim K. Marks

Currently successful methods for video description are based on encoder-decoder sentence generation using recur-rent neural networks (RNNs).

Decoder Sentence +1

Robust Face Alignment Using a Mixture of Invariant Experts

no code implementations13 Nov 2015 Oncel Tuzel, Tim K. Marks, Salil Tambe

Face alignment is particularly challenging when there are large variations in pose (in-plane and out-of-plane rotations) and facial expression.

Face Alignment regression +1

Real-Time 3D Head Pose and Facial Landmark Estimation From Depth Images Using Triangular Surface Patch Features

no code implementations CVPR 2015 Chavdar Papazov, Tim K. Marks, Michael Jones

The matched triangular surface patches in the training set are used to compute estimates of the 3D head pose and facial landmark positions in the input depth map.

Face Alignment Head Pose Estimation

An Improved Deep Learning Architecture for Person Re-Identification

no code implementations CVPR 2015 Ejaz Ahmed, Michael Jones, Tim K. Marks

Novel elements of our architecture include a layer that computes cross-input neighborhood differences, which capture local relationships among mid-level features that were computed separately from the two input images.

Person Re-Identification Small Data Image Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.