Search Results for author: Shih-Fu Chang

Found 151 papers, 56 papers with code

Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks

no code implementations CSRR (ACL) 2022 Yue Wan, Yueen Ma, Haoxuan You, Zhecan Wang, Shih-Fu Chang

Large-scale visual-linguistic pre-training aims to capture the generic representations from multimodal features, which are essential for downstream vision-language tasks.

Informativeness Type prediction +1

Coreference by Appearance: Visually Grounded Event Coreference Resolution

no code implementations CRAC (ACL) 2021 Liming Wang, Shengyu Feng, Xudong Lin, Manling Li, Heng Ji, Shih-Fu Chang

Event coreference resolution is critical to understand events in the growing number of online news with multiple modalities including text, video, speech, etc.

coreference-resolution Event Coreference Resolution +2

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

no code implementations27 Mar 2024 Ali Zare, Yulei Niu, Hammad Ayyubi, Shih-Fu Chang

(3) Annotation cost: Annotating instructional videos with step-level labels (i. e., timestamp) or sequence-level labels (i. e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets. In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined.

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

1 code implementation18 Mar 2024 Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji

This survey paper serves as a comprehensive resource for researchers and practitioners in the fields of natural language processing, computer vision, and data analysis, providing valuable insights and directions for future research in chart understanding leveraging large foundation models.

Data Visualization

SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

no code implementations3 Mar 2024 Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, Shih-Fu Chang

We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations.

Contrastive Learning

Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

2 code implementations15 Dec 2023 Kung-Hsiang Huang, Mingyang Zhou, Hou Pong Chan, Yi R. Fung, Zhenhailong Wang, Lingyu Zhang, Shih-Fu Chang, Heng Ji

This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions.

Factual Inconsistency Detection in Chart Captioning Image Captioning +1

Characterizing Video Question Answering with Sparsified Inputs

no code implementations27 Nov 2023 Shiyuan Huang, Robinson Piramuthu, Vicente Ordonez, Shih-Fu Chang, Gunnar A. Sigurdsson

From our experiments, we have observed only 5. 2%-5. 8% loss of performance with only 10% of video lengths, which corresponds to 2-4 frames selected from each video.

Question Answering Video Question Answering

Ferret: Refer and Ground Anything Anywhere at Any Granularity

1 code implementation11 Oct 2023 Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, BoWen Zhang, ZiRui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions.

Hallucination Language Modelling +1

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

1 code implementation3 Jul 2023 Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang

However, we find visual and textual fine-grained information, e. g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding.

Image-text matching Sentence +2

Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs

no code implementations29 May 2023 Mingyang Zhou, Yi R. Fung, Long Chen, Christopher Thomas, Heng Ji, Shih-Fu Chang

Building cross-model intelligence that can understand charts and communicate the salient information hidden behind them is an appealing challenge in the vision and language(V+L) community.

Chart Question Answering Question Answering +1

Non-Sequential Graph Script Induction via Multimedia Grounding

1 code implementation27 May 2023 Yu Zhou, Sha Li, Manling Li, Xudong Lin, Shih-Fu Chang, Mohit Bansal, Heng Ji

To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks.

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

1 code implementation24 May 2023 Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer.

Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

no code implementations7 Apr 2023 Hung-Ting Su, Yulei Niu, Xudong Lin, Winston H. Hsu, Shih-Fu Chang

Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video.

Question Answering Question Generation +3

Supervised Masked Knowledge Distillation for Few-Shot Transformers

1 code implementation CVPR 2023 Han Lin, Guangxing Han, Jiawei Ma, Shiyuan Huang, Xudong Lin, Shih-Fu Chang

Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features.

Few-Shot Learning Inductive Bias +1

DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection

1 code implementation CVPR 2023 Jiawei Ma, Yulei Niu, Jincheng Xu, Shiyuan Huang, Guangxing Han, Shih-Fu Chang

Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data.

Few-Shot Object Detection object-detection

In Defense of Structural Symbolic Representation for Video Event-Relation Prediction

no code implementations6 Jan 2023 Andrew Lu, Xudong Lin, Yulei Niu, Shih-Fu Chang

Understanding event relationships in videos requires a model to understand the underlying structures of events (i. e. the event type, the associated argument roles, and corresponding entities) and factual knowledge for reasoning.

Relation

TempCLR: Temporal Alignment Representation with Contrastive Learning

1 code implementation28 Dec 2022 Yuncong Yang, Jiawei Ma, Shiyuan Huang, Long Chen, Xudong Lin, Guangxing Han, Shih-Fu Chang

For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly.

Contrastive Learning Dynamic Time Warping +7

Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

no code implementations14 Dec 2022 Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, Shih-Fu Chang

We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions.

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

no code implementations10 Nov 2022 Zhecan Wang, Haoxuan You, Yicheng He, Wenhao Li, Kai-Wei Chang, Shih-Fu Chang

Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described.

Video Event Extraction via Tracking Visual States of Arguments

no code implementations3 Nov 2022 Guang Yang, Manling Li, Jiajie Zhang, Xudong Lin, Shih-Fu Chang, Heng Ji

Video event extraction aims to detect salient events from a video and identify the arguments for each event as well as their semantic roles.

Event Extraction

Weakly-Supervised Temporal Article Grounding

1 code implementation22 Oct 2022 Long Chen, Yulei Niu, Brian Chen, Xudong Lin, Guangxing Han, Christopher Thomas, Hammad Ayyubi, Heng Ji, Shih-Fu Chang

Specifically, given an article and a relevant video, WSAG aims to localize all ``groundable'' sentences to the video, and these sentences are possibly at different semantic scales.

Natural Language Queries Sentence +1

Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy

1 code implementation15 Oct 2022 Shiyuan Huang, Robinson Piramuthu, Shih-Fu Chang, Gunnar A. Sigurdsson

Specifically, we insert a lightweight Feature Compression Module (FeatComp) into a VideoQA model which learns to extract task-specific tiny features as little as 10 bits, which are optimal for answering certain types of questions.

Feature Compression Question Answering +1

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

1 code implementation26 Jul 2022 Haoxuan You, Luowei Zhou, Bin Xiao, Noel Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan

Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space.

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

1 code implementation CVPR 2023 Xudong Lin, Simran Tiwari, Shiyuan Huang, Manling Li, Mike Zheng Shou, Heng Ji, Shih-Fu Chang

We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data.

Retrieval Sentence +2

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

1 code implementation22 May 2022 Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, ZiYi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction.

Attribute Automatic Speech Recognition +6

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

no code implementations22 Apr 2022 Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai, Bin Xiao, Jianwei Yang, Haoxuan You, Kai-Wei Chang, Shih-Fu Chang, Lu Yuan

Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.

Question Answering Visual Commonsense Reasoning +2

Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting

no code implementations16 Apr 2022 Guangxing Han, Long Chen, Jiawei Ma, Shiyuan Huang, Rama Chellappa, Shih-Fu Chang

Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning to learn generalizable few-shot and zero-shot object detection models respectively without fine-tuning.

Few-Shot Learning Few-Shot Object Detection +3

Fine-Grained Visual Entailment

1 code implementation29 Mar 2022 Christopher Thomas, YiPeng Zhang, Shih-Fu Chang

In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.

Multimodal Reasoning Visual Entailment

Few-Shot Object Detection with Fully Cross-Transformer

1 code implementation CVPR 2022 Guangxing Han, Jiawei Ma, Shiyuan Huang, Long Chen, Shih-Fu Chang

Inspired by the recent work on vision transformers and vision-language transformers, we propose a novel Fully Cross-Transformer based model (FCT) for FSOD by incorporating cross-transformer into both the feature backbone and detection head.

Few-Shot Object Detection Metric Learning +2

Learning To Recognize Procedural Activities with Distant Supervision

1 code implementation CVPR 2022 Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

In this paper we consider the problem of classifying fine-grained, multi-step activities (e. g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes.

Action Classification Language Modelling +1

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

no code implementations15 Jan 2022 Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, Lu Yuan

Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.

Question Answering Visual Commonsense Reasoning +2

CLIP-Event: Connecting Text and Images with Event Structures

1 code implementation CVPR 2022 Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, Shih-Fu Chang

Vision-language (V+L) pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text.

Contrastive Learning Event Extraction +2

MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

2 code implementations20 Dec 2021 Revanth Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, Alexander Schwing, Heng Ji

Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.

Answer Generation Data Augmentation +2

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

no code implementations16 Dec 2021 Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang

As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph.

Visual Commonsense Reasoning

PreViTS: Contrastive Pretraining with Video Tracking Supervision

no code implementations1 Dec 2021 Brian Chen, Ramprasaath R. Selvaraju, Shih-Fu Chang, Juan Carlos Niebles, Nikhil Naik

In this work, we propose PreViTS, an SSL framework that utilizes an unsupervised tracking signal for selecting clips containing the same object, which helps better utilize temporal transformations of objects.

Action Classification Self-Supervised Learning +1

MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training

no code implementations29 Sep 2021 Haoxuan You, Luowei Zhou, Bin Xiao, Noel C Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan

Large-scale multimodal contrastive pretraining has demonstrated great utility to support high performance in a range of downstream tasks by mapping multiple modalities into a shared embedding space.

Partner-Assisted Learning for Few-Shot Image Classification

no code implementations ICCV 2021 Jiawei Ma, Hanchen Xie, Guangxing Han, Shih-Fu Chang, Aram Galstyan, Wael Abd-Almageed

In this paper, we focus on the design of training strategy to obtain an elemental representation such that the prototype of each novel class can be estimated from a few labeled samples.

Classification Few-Shot Image Classification +1

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

2 code implementations NeurIPS 2021 Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.

Ranked #3 on Zero-Shot Video Retrieval on YouCook2 (text-to-video Mean Rank metric)

Action Classification Action Recognition In Videos +9

Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment

2 code implementations15 Apr 2021 Guangxing Han, Shiyuan Huang, Jiawei Ma, Yicheng He, Shih-Fu Chang

To improve the fine-grained few-shot proposal classification, we propose a novel attentive feature alignment method to address the spatial misalignment between the noisy proposals and few-shot classes, thus improving the performance of few-shot object detection.

Few-Shot Learning Few-Shot Object Detection +3

Analogical Reasoning for Visually Grounded Compositional Generalization

no code implementations1 Jan 2021 Bo Wu, Haoyu Qin, Alireza Zareian, Carl Vondrick, Shih-Fu Chang

Children acquire language subconsciously by observing the surrounding world and listening to descriptions.

Language Acquisition

Task-Adaptive Negative Envision for Few-Shot Open-Set Recognition

1 code implementation CVPR 2022 Shiyuan Huang, Jiawei Ma, Guangxing Han, Shih-Fu Chang

In this paper, we instead propose task-adaptive negative class envision for FSOR to integrate threshold tuning into the learning process.

Few-Shot Learning Open Set Learning

Open-Vocabulary Object Detection Using Captions

1 code implementation CVPR 2021 Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, Shih-Fu Chang

Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models.

Object object-detection +2

Uncertainty-Aware Few-Shot Image Classification

no code implementations9 Oct 2020 Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen, Shih-Fu Chang

In this work, we propose Uncertainty-Aware Few-Shot framework for image classification by modeling uncertainty of the similarities of query-support pairs and performing uncertainty-aware optimization.

Classification Few-Shot Image Classification +3

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

1 code implementation3 Sep 2020 Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, Shih-Fu Chang

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals.

Referring Expression Vocal Bursts Valence Prediction

Analogical Reasoning for Visually Grounded Language Acquisition

no code implementations22 Jul 2020 Bo Wu, Haoyu Qin, Alireza Zareian, Carl Vondrick, Shih-Fu Chang

Children acquire language subconsciously by observing the surrounding world and listening to descriptions.

Language Acquisition

GAIA: A Fine-grained Multimedia Knowledge Extraction System

no code implementations ACL 2020 Manling Li, Alireza Zareian, Ying Lin, Xiaoman Pan, Spencer Whitehead, Brian Chen, Bo Wu, Heng Ji, Shih-Fu Chang, Clare Voss, Daniel Napierski, Marjorie Freedman

We present the first comprehensive, open source multimedia knowledge extraction system that takes a massive stream of unstructured, heterogeneous multimedia data from various sources and languages as input, and creates a coherent, structured knowledge base, indexing entities, relations, and events, following a rich, fine-grained ontology.

Learning Visual Commonsense for Robust Scene Graph Generation

2 code implementations ECCV 2020 Alireza Zareian, Zhecan Wang, Haoxuan You, Shih-Fu Chang

Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild.

Graph Generation Scene Graph Generation +1

Beyond Triplet Loss: Meta Prototypical N-tuple Loss for Person Re-identification

no code implementations8 Jun 2020 Zhizheng Zhang, Cuiling Lan, Wen-Jun Zeng, Zhibo Chen, Shih-Fu Chang

There is a lack of loss design which enables the joint optimization of multiple instances (of multiple classes) within per-query optimization for person ReID.

Classification General Classification +3

Deep Learning Guided Building Reconstruction from Satellite Imagery-derived Point Clouds

no code implementations19 May 2020 Bo Xu, Xu Zhang, Zhixin Li, Matt Leotta, Shih-Fu Chang, Jie Shan

For points that belong to the same roof shape, a multi-cue, hierarchical RANSAC approach is proposed for efficient and reliable segmenting and reconstructing the building point cloud.

3D Reconstruction

Cross-media Structured Common Space for Multimedia Event Extraction

no code implementations ACL 2020 Manling Li, Alireza Zareian, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, Shih-Fu Chang

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents.

Event Extraction

Training with Streaming Annotation

no code implementations11 Feb 2020 Tongtao Zhang, Heng Ji, Shih-Fu Chang, Marjorie Freedman

In this paper, we address a practical scenario where training data is released in a sequence of small-scale batches and annotation in earlier phases has lower quality than the later counterparts.

Event Extraction

Weakly Supervised Visual Semantic Parsing

1 code implementation CVPR 2020 Alireza Zareian, Svebor Karaman, Shih-Fu Chang

Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images, enabling deep understanding of visual content, with many applications such as visual reasoning and image retrieval.

Graph Generation Image Retrieval +5

Bridging Knowledge Graphs to Generate Scene Graphs

1 code implementation ECCV 2020 Alireza Zareian, Svebor Karaman, Shih-Fu Chang

Scene graphs are powerful representations that parse images into their abstract semantic elements, i. e., objects and their interactions, which facilitates visual comprehension and explainable reasoning.

Graph Generation Knowledge Graphs +1

General Partial Label Learning via Dual Bipartite Graph Autoencoder

no code implementations5 Jan 2020 Brian Chen, Bo Wu, Alireza Zareian, Hanwang Zhang, Shih-Fu Chang

Compared to the traditional Partial Label Learning (PLL) problem, GPLL relaxes the supervision assumption from instance-level -- a label set partially labels an instance -- to group-level: 1) a label set partially labels a group of instances, where the within-group instance-label link annotations are missing, and 2) cross-group links are allowed -- instances in a group may be partially linked to the label set from another group.

Partial Label Learning

Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation

no code implementations ICLR 2020 Jiawei Ma*, Zheng Shou*, Alireza Zareian, Hassan Mansour, Anthony Vetro, Shih-Fu Chang

In order to impute the missing values, state-of-the-art methods are built on Recurrent Neural Networks (RNN), which process each time stamp sequentially, prohibiting the direct modeling of the relationship between distant time stamps.

Imputation Machine Translation +2

Flow-Distilled IP Two-Stream Networks for Compressed Video Action Recognition

no code implementations10 Dec 2019 Shiyuan Huang, Xudong Lin, Svebor Karaman, Shih-Fu Chang

Recent works instead use modern compressed video modalities as an alternative to the RGB spatial stream and improve the inference speed by orders of magnitudes.

Action Recognition Optical Flow Estimation +3

Cross-lingual Structure Transfer for Relation and Event Extraction

no code implementations IJCNLP 2019 Ananya Subburathinam, Di Lu, Heng Ji, Jonathan May, Shih-Fu Chang, Avirup Sil, Clare Voss

The identification of complex semantic structures such as events and entity relations, already a challenging Information Extraction task, is doubly difficult from sources written in under-resourced and under-annotated languages.

Event Extraction Relation +1

Towards Train-Test Consistency for Semi-supervised Temporal Action Localization

no code implementations24 Oct 2019 Xudong Lin, Zheng Shou, Shih-Fu Chang

The inconsistent strategy makes it hard to explicitly supervise the action localization model with temporal boundary annotations at training time.

Multiple Instance Learning Video Classification +2

Context-Gated Convolution

1 code implementation ECCV 2020 Xudong Lin, Lin Ma, Wei Liu, Shih-Fu Chang

As such, being aware of the global context, the modulated convolution kernel of our proposed CGC can better extract representative local patterns and compose discriminative features.

Ranked #61 on Image Classification on ObjectNet (using extra training data)

Action Recognition Image Classification +1

Detecting and Simulating Artifacts in GAN Fake Images

1 code implementation15 Jul 2019 Xu Zhang, Svebor Karaman, Shih-Fu Chang

By using the simulated images to train a spectrum based classifier, even without seeing the fake images produced by the targeted GAN model during training, our approach achieves state-of-the-art performances on detecting fake images generated by popular GAN models such as CycleGAN.

GAN image forensics

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

no code implementations8 Jul 2019 Yulei Niu, Hanwang Zhang, Zhiwu Lu, Shih-Fu Chang

Specifically, our framework exploits the reciprocal relation between the referent and context, i. e., either of them influences estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced.

Multiple Instance Learning Referring Expression

CDSA: Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation

2 code implementations23 May 2019 Jiawei Ma, Zheng Shou, Alireza Zareian, Hassan Mansour, Anthony Vetro, Shih-Fu Chang

In order to jointly capture the self-attention across multiple dimensions, including time, location and the sensor measurements, while maintain low computational complexity, we propose a novel approach called Cross-Dimensional Self-Attention (CDSA) to process each dimension sequentially, yet in an order-independent manner.

Imputation Machine Translation +2

Unsupervised Embedding Learning via Invariant and Spreading Instance Feature

1 code implementation CVPR 2019 Mang Ye, Xu Zhang, Pong C. Yuen, Shih-Fu Chang

This paper studies the unsupervised embedding learning problem, which requires an effective similarity measurement between samples in low-dimensional embedding space.

Data Augmentation

Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval

no code implementations4 Mar 2019 Svebor Karaman, Xudong Lin, Xuefeng Hu, Shih-Fu Chang

We propose an unsupervised hashing method which aims to produce binary codes that preserve the ranking induced by a real-valued representation.

Image Retrieval Re-Ranking +1

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

no code implementations ICCV 2019 Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, ShiLiang Pu, Shih-Fu Chang

CMAT is a multi-agent policy gradient method that frames objects as cooperative agents, and then directly maximizes a graph-level metric as the reward.

counterfactual Graph Generation +2

Multi-granularity Generator for Temporal Action Proposal

no code implementations CVPR 2019 Yuan Liu, Lin Ma, Yifeng Zhang, Wei Liu, Shih-Fu Chang

In this paper, we propose a multi-granularity generator (MGG) to perform the temporal action proposal from different granularity perspectives, relying on the video visual features equipped with the position embedding information.

Action Recognition Temporal Action Proposal Generation

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

1 code implementation CVPR 2019 Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, Shih-Fu Chang

Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity.

Language Modelling Phrase Grounding +2

Incorporating Background Knowledge into Video Description Generation

no code implementations EMNLP 2018 Spencer Whitehead, Heng Ji, Mohit Bansal, Shih-Fu Chang, Clare Voss

We develop an approach that uses video meta-data to retrieve topically related news documents for a video and extracts the events and named entities from these documents.

Text Generation Video Captioning +1

Heated-Up Softmax Embedding

1 code implementation ICLR 2019 Xu Zhang, Felix Xinnan Yu, Svebor Karaman, Wei zhang, Shih-Fu Chang

Metric learning aims at learning a distance which is consistent with the semantic meaning of the samples.

Metric Learning

Multimodal Social Media Analysis for Gang Violence Prevention

no code implementations23 Jul 2018 Philipp Blandfort, Desmond Patton, William R. Frey, Svebor Karaman, Surabhi Bhargava, Fei-Tzin Lee, Siddharth Varia, Chris Kedzie, Michael B. Gaskell, Rossano Schifanella, Kathleen McKeown, Shih-Fu Chang

In this paper we partnered computer scientists with social work researchers, who have domain expertise in gang violence, to analyze how public tweets with images posted by youth who mention gang associations on Twitter can be leveraged to automatically detect psychosocial factors and conditions that could potentially assist social workers and violence outreach workers in prevention and early intervention programs.

General Classification

AutoLoc: Weakly-supervised Temporal Action Localization

1 code implementation22 Jul 2018 Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, Shih-Fu Chang

In this paper, we first develop a novel weakly-supervised TAL framework called AutoLoc to directly predict the temporal boundary of each action instance.

Weakly-supervised Temporal Action Localization Weakly Supervised Temporal Action Localization

Entity-aware Image Caption Generation

no code implementations EMNLP 2018 Di Lu, Spencer Whitehead, Lifu Huang, Heng Ji, Shih-Fu Chang

Current image captioning approaches generate descriptions which lack specific information, such as named entities that are involved in the images.

Image Captioning

Grounding Referring Expressions in Images by Variational Context

1 code implementation CVPR 2018 Hanwang Zhang, Yulei Niu, Shih-Fu Chang

This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context --- visual attributes (e. g., "largest", "baby") and relationships (e. g., "behind") that help to distinguish the referent from other objects, especially those of the same category.

Multiple Instance Learning Referring Expression

Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks

1 code implementation CVPR 2018 Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, Shih-Fu Chang

We propose a novel framework called Semantics-Preserving Adversarial Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where test images and their classes are both unseen during training.

General Classification Zero-Shot Learning

Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation

no code implementations5 Sep 2017 Yulei Niu, Zhiwu Lu, Ji-Rong Wen, Tao Xiang, Shih-Fu Chang

In this paper, we address two main issues in large-scale image annotation: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept; 2) how to annotate an image with the optimal number of class labels.

Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks

3 code implementations ICLR 2018 Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, Shih-Fu Chang

We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph.

More cat than cute? Interpretable Prediction of Adjective-Noun Pairs

1 code implementation21 Aug 2017 Delia Fernandez, Alejandro Woodward, Victor Campos, Xavier Giro-i-Nieto, Brendan Jou, Shih-Fu Chang

This work aims at disentangling the contributions of the `adjectives' and `nouns' in the visual prediction of ANPs.

Learning Spread-out Local Feature Descriptors

2 code implementations ICCV 2017 Xu Zhang, Felix X. Yu, Sanjiv Kumar, Shih-Fu Chang

We propose a simple, yet powerful regularization technique that can be used to significantly improve both the pairwise and triplet losses in learning local feature descriptors.

ConvNet Architecture Search for Spatiotemporal Feature Learning

1 code implementation16 Aug 2017 Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, Manohar Paluri

Learning image representations with ConvNets by pre-training on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and image captioning.

Action Classification Action Recognition +5

PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN

no code implementations ICCV 2017 Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, Shih-Fu Chang

We aim to tackle a novel vision task called Weakly Supervised Visual Relation Detection (WSVRD) to detect "subject-predicate-object" relations in an image with object relation groundtruths available only at the image level.

Object object-detection +2

Localizing Actions from Video Labels and Pseudo-Annotations

no code implementations28 Jul 2017 Pascal Mettes, Cees G. M. Snoek, Shih-Fu Chang

The goal of this paper is to determine the spatio-temporal location of actions in video.

Action Localization

Learning Discriminative and Transformation Covariant Local Feature Detectors

1 code implementation CVPR 2017 Xu Zhang, Felix X. Yu, Svebor Karaman, Shih-Fu Chang

Specifically, we extend the covariant constraint proposed by Lenc and Vedaldi by defining the concepts of "standard patch" and "canonical feature" and leverage these to train a novel robust covariant detector.

Image Retrieval

Learning discriminative and transformation covariant local feature detectors.

1 code implementation Computer Vision and Pattern Recognition 2017 Xu Zhang, Felix X. Yu, Svebor Karaman, Shih-Fu Chang

Specifically, we extend the covariant constraint proposed by Lenc and Vedaldi [8] by defining the concepts of “standard patch” and “canonical feature” and leverage these to train a novel robust covariant detector.

Image Retrieval

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

no code implementations14 Jun 2017 Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, xiangyang xue, Shih-Fu Chang

More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion and audio signals to extract their corresponding features.

General Classification Video Classification

PatternNet: Visual Pattern Mining with Deep Neural Network

no code implementations18 Mar 2017 Hongzhi Li, Joseph G. Ellis, Lei Zhang, Shih-Fu Chang

In this paper, we study the problem of visual pattern mining and propose a novel deep neural network architecture called PatternNet for discovering these patterns that are both discriminative and representative.

Image Classification

Model-Driven Feed-Forward Prediction for Manipulation of Deformable Objects

no code implementations15 Jul 2016 Yinxiao Li, Yan Wang, Yonghao Yue, Danfei Xu, Michael Case, Shih-Fu Chang, Eitan Grinspun, Peter Allen

A fully featured 3D model of the garment is constructed in real-time and volumetric features are then used to obtain the most similar model in the database to predict the object category and pose.

Object Pose Estimation +1

Deep Image Set Hashing

no code implementations16 Jun 2016 Jie Feng, Svebor Karaman, I-Hong Jhuo, Shih-Fu Chang

Learning-based hashing is often used in large scale image retrieval as they provide a compact representation of each sample and the Hamming distance can be used to efficiently compare two samples.

Image Retrieval Retrieval

Interactive Segmentation on RGBD Images via Cue Selection

no code implementations CVPR 2016 Jie Feng, Brian Price, Scott Cohen, Shih-Fu Chang

While these methods achieve better results than color-based methods, they are still limited in either using depth as an additional color channel or simply combining depth with color in a linear way.

Image Retrieval Image Segmentation +5

Going Deeper for Multilingual Visual Sentiment Detection

no code implementations30 May 2016 Brendan Jou, Shih-Fu Chang

In the original MVSO release, adjective-noun pair (ANP) detectors were trained for the six languages using an AlexNet-styled architecture by fine-tuning from DeepSentiBank.

TAG

EventNet Version 1.1 Technical Report

no code implementations24 May 2016 Dongang Wang, Zheng Shou, Hongyi Liu, Shih-Fu Chang

Finally, EventNet version 1. 1 contains 67, 641 videos, 500 events, and 5, 028 event-specific concepts.

Generic Instance Search and Re-identification from One Example via Attributes and Categories

no code implementations23 May 2016 Ran Tao, Arnold W. M. Smeulders, Shih-Fu Chang

Searching among instances from the same category as the query, the category-specific attributes outperform existing approaches by a large margin on shoes and cars and perform on par with the state-of-the-art on buildings.

Instance Search Person Re-Identification

Deep Cross Residual Learning for Multitask Visual Recognition

1 code implementation5 Apr 2016 Brendan Jou, Shih-Fu Chang

We propose a novel extension of residual learning for deep networks that enables intuitive learning across multiple related tasks using cross-connections called cross-residuals.

Object Recognition

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

1 code implementation CVPR 2016 Zheng Shou, Dongang Wang, Shih-Fu Chang

To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes on the learned classification network to localize each action instance.

Action Classification Classification +3

Event Specific Multimodal Pattern Mining with Image-Caption Pairs

no code implementations31 Dec 2015 Hongzhi Li, Joseph G. Ellis, Shih-Fu Chang

In this paper we describe a novel framework and algorithms for discovering image patch patterns from a large corpus of weakly supervised image-caption pairs generated from news events.

Descriptive Image Captioning

On Binary Embedding using Circulant Matrices

no code implementations20 Nov 2015 Felix X. Yu, Aditya Bhaskara, Sanjiv Kumar, Yunchao Gong, Shih-Fu Chang

To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix.

Learning to Hash for Indexing Big Data - A Survey

no code implementations17 Sep 2015 Jun Wang, Wei Liu, Sanjiv Kumar, Shih-Fu Chang

Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions.

Visual Affect Around the World: A Large-scale Multilingual Visual Sentiment Ontology

no code implementations16 Aug 2015 Brendan Jou, Tao Chen, Nikolaos Pappas, Miriam Redi, Mercan Topkara, Shih-Fu Chang

Our work expressly focuses on the uniqueness of culture and language in relation to human affect, specifically sentiment and emotion semantics, and how they manifest in social multimedia.

Cultural Vocal Bursts Intensity Prediction

EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video

no code implementations8 Jun 2015 Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, Shih-Fu Chang

Extensive experiments over the zero-shot event retrieval task when no training samples are available show that the EventNet concept library consistently and significantly outperforms the state-of-the-art (such as the 20K ImageNet concepts trained with CNN) by a large margin up to 207%.

Event Detection Retrieval

New Insights Into Laplacian Similarity Search

no code implementations CVPR 2015 Xiao-Ming Wu, Zhenguo Li, Shih-Fu Chang

Graph-based computer vision applications rely critically on similarity metrics which compute the pairwise similarity between any pair of vertices on graphs.

Image Retrieval Retrieval

Attributes and Categories for Generic Instance Search From One Example

no code implementations CVPR 2015 Ran Tao, Arnold W. M. Smeulders, Shih-Fu Chang

This paper aims for generic instance search from one example where the instance can be an arbitrary 3D object like shoes, not just near-planar and one-sided instances like buildings and logos.

Instance Search

Compact Nonlinear Maps and Circulant Extensions

no code implementations12 Mar 2015 Felix X. Yu, Sanjiv Kumar, Henry Rowley, Shih-Fu Chang

This leads to much more compact maps without hurting the performance.

Deep Transfer Network: Unsupervised Domain Adaptation

no code implementations2 Mar 2015 Xu Zhang, Felix Xinnan Yu, Shih-Fu Chang, Shengjin Wang

In this paper, we propose a new domain adaptation framework named Deep Transfer Network (DTN), where the highly flexible deep neural networks are used to implement such a distribution matching process.

Unsupervised Domain Adaptation

Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks

no code implementations25 Feb 2015 Yu-Gang Jiang, Zuxuan Wu, Jun Wang, xiangyang xue, Shih-Fu Chang

In this paper, we study the challenging problem of categorizing videos according to high-level semantics such as the existence of a particular human action or a complex event.

An exploration of parameter redundancy in deep networks with circulant projections

no code implementations ICCV 2015 Yu Cheng, Felix X. Yu, Rogerio S. Feris, Sanjiv Kumar, Alok Choudhary, Shih-Fu Chang

We explore the redundancy of parameters in deep neural networks by replacing the conventional linear projection in fully-connected layers with the circulant projection.

Discrete Graph Hashing

no code implementations NeurIPS 2014 Wei Liu, Cun Mu, Sanjiv Kumar, Shih-Fu Chang

Hashing has emerged as a popular technique for fast nearest neighbor search in gigantic databases.

Video Event Detection by Inferring Temporal Instance Labels

no code implementations CVPR 2014 Kuan-Ting Lai, Felix X. Yu, Ming-Syan Chen, Shih-Fu Chang

To solve this problem, we propose a large-margin formulation which treats the instance labels as hidden latent variables, and simultaneously infers the instance labels as well as the instance-level classification model.

Event Detection

Locally Linear Hashing for Extracting Non-Linear Manifolds

no code implementations CVPR 2014 Go Irie, Zhenguo Li, Xiao-Ming Wu, Shih-Fu Chang

Previous efforts in hashing intend to preserve data variance or pairwise affinity, but neither is adequate in capturing the manifold structures hidden in most visual data.

Quantization

Hash-SVM: Scalable Kernel Machines for Large-Scale Visual Classification

no code implementations CVPR 2014 Yadong Mu, Gang Hua, Wei Fan, Shih-Fu Chang

This paper presents a novel algorithm which uses compact hash bits to greatly improve the efficiency of non-linear kernel SVM in very large scale visual classification problems.

Classification General Classification

Circulant Binary Embedding

no code implementations13 May 2014 Felix X. Yu, Sanjiv Kumar, Yunchao Gong, Shih-Fu Chang

To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix.

Building A Large Concept Bank for Representing Events in Video

no code implementations29 Mar 2014 Yin Cui, Dong Liu, Jiawei Chen, Shih-Fu Chang

In this paper, we propose to build Concept Bank, the largest concept library consisting of 4, 876 concepts specifically designed to cover 631 real-world events.

Event Detection Retrieval

On Learning from Label Proportions

1 code implementation24 Feb 2014 Felix X. Yu, Krzysztof Choromanski, Sanjiv Kumar, Tony Jebara, Shih-Fu Chang

Learning from Label Proportions (LLP) is a learning setting, where the training data is provided in groups, or "bags", and only the proportion of each class in each bag is known.

Marketing

Analyzing the Harmonic Structure in Graph-Based Learning

no code implementations NeurIPS 2013 Xiao-Ming Wu, Zhenguo Li, Shih-Fu Chang

We show that either explicitly or implicitly, various well-known graph-based models exhibit a common significant \emph{harmonic} structure in its target function -- the value of a vertex is approximately the weighted average of the values of its adjacent neighbors.

$\propto$SVM for learning with label proportions

no code implementations4 Jun 2013 Felix X. Yu, Dong Liu, Sanjiv Kumar, Tony Jebara, Shih-Fu Chang

We study the problem of learning with label proportions in which the training data is provided in groups and only the proportion of each class in each group is known.

Label Propagation from ImageNet to 3D Point Clouds

no code implementations CVPR 2013 Yan Wang, Rongrong Ji, Shih-Fu Chang

Our approach shows further major gains in accuracy when the training data from the target scenes is used, outperforming state-ofthe-art approaches with far better efficiency.

Sample-Specific Late Fusion for Visual Category Recognition

no code implementations CVPR 2013 Dong Liu, Kuan-Ting Lai, Guangnan Ye, Ming-Syan Chen, Shih-Fu Chang

However, the existing methods generally use a fixed fusion weight for all the scores of a classifier, and thus fail to optimally determine the fusion weight for the individual samples.

Robust Object Co-detection

no code implementations CVPR 2013 Xin Guo, Dong Liu, Brendan Jou, Mojun Zhu, Anni Cai, Shih-Fu Chang

Object co-detection aims at simultaneous detection of objects of the same category from a pool of related images by exploiting consistent visual patterns present in candidate objects in the images.

Clustering Object +2

Hash Bit Selection: A Unified Solution for Selection Problems in Hashing

no code implementations CVPR 2013 Xianglong Liu, Junfeng He, Bo Lang, Shih-Fu Chang

We represent the bit pool as a vertx- and edge-weighted graph with the candidate bits as vertices.

Designing Category-Level Attributes for Discriminative Visual Recognition

no code implementations CVPR 2013 Felix X. Yu, Liangliang Cao, Rogerio S. Feris, John R. Smith, Shih-Fu Chang

In this paper, we propose a novel formulation to automatically design discriminative "category-level attributes", which can be efficiently encoded by a compact category-attribute matrix.

Attribute Transfer Learning +1

Distributed Low-rank Subspace Segmentation

no code implementations20 Apr 2013 Ameet Talwalkar, Lester Mackey, Yadong Mu, Shih-Fu Chang, Michael. I. Jordan

Vision problems ranging from image clustering to motion segmentation to semi-supervised learning can naturally be framed as subspace segmentation problems, in which one aims to recover multiple low-dimensional subspaces from noisy and corrupted input data.

Clustering Event Detection +4

Learning with Partially Absorbing Random Walks

no code implementations NeurIPS 2012 Xiao-Ming Wu, Zhenguo Li, Anthony M. So, John Wright, Shih-Fu Chang

We prove that under proper absorption rates, a random walk starting from a set $\mathcal{S}$ of low conductance will be mostly absorbed in $\mathcal{S}$.

Cannot find the paper you are looking for? You can Submit a new open access paper.