Search Results for author: Tanzila Rahman

Found 10 papers, 5 papers with code

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

no code implementations18 Feb 2024 Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal

We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a bi-product, latent masks.

Image Generation

Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models

no code implementations19 Dec 2023 Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, Leonid Sigal

Further, we leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image.

Image Generation Prompt Engineering

Make-A-Story: Visual Memory Conditioned Consistent Story Generation

1 code implementation CVPR 2023 Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, Leonid Sigal

Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.

Sentence Story Generation +1

TriBERT: Human-centric Audio-visual Representation Learning

1 code implementation NeurIPS 2021 Tanzila Rahman, Mengyu Yang, Leonid Sigal

In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention.

Pose Retrieval Representation Learning +1

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

1 code implementation26 Oct 2021 Tanzila Rahman, Mengyu Yang, Leonid Sigal

In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention.

Pose Retrieval Representation Learning +1

Weakly-supervised Audio-visual Sound Source Detection and Separation

no code implementations25 Mar 2021 Tanzila Rahman, Leonid Sigal

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task.

Audio Source Separation Denoising +5

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

no code implementations ICCV 2019 Tanzila Rahman, Bicheng Xu, Leonid Sigal

Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning.

Convolutional Temporal Attention Model for Video-based Person Re-identification

no code implementations9 Apr 2019 Tanzila Rahman, Mrigank Rochan, Yang Wang

A common approach for person re-identification is to first extract image features for all frames in the video, then aggregate all the features to form a video-level feature.

Semantic Segmentation Video-Based Person Re-Identification

Cannot find the paper you are looking for? You can Submit a new open access paper.