Search Results for author: Davide Modolo

Found 30 papers, 6 papers with code

Self-Supervised Multi-Object Tracking with Path Consistency

1 code implementation8 Apr 2024 Zijia Lu, Bing Shuai, Yanbei Chen, Zhenlin Xu, Davide Modolo

In this paper, we propose a novel concept of path consistency to learn robust object matching without using manual object identity supervision.

Multi-Object Tracking Object

Hyperbolic Learning with Synthetic Captions for Open-World Detection

no code implementations7 Apr 2024 Fanjie Kong, Yanbei Chen, Jiarui Cai, Davide Modolo

Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts.

Hallucination Novel Concepts +3

Early Action Recognition with Action Prototypes

no code implementations11 Dec 2023 Guglielmo Camporese, Alessandro Bergamo, Xunyu Lin, Joseph Tighe, Davide Modolo

For example, on early recognition observing only the first 10% of each video, our method improves the SOTA by +2. 23 Top-1 accuracy on Something-Something-v2, +3. 55 on UCF-101, +3. 68 on SSsub21, and +5. 03 on EPIC-Kitchens-55, where prior work used either multi-modal inputs (e. g. optical-flow) or batched inference.

Action Recognition Optical Flow Estimation

SemiGPC: Distribution-Aware Label Refinement for Imbalanced Semi-Supervised Learning Using Gaussian Processes

no code implementations3 Nov 2023 Abdelhak Lemkhenter, Manchen Wang, Luca Zancato, Gurumurthy Swaminathan, Paolo Favaro, Davide Modolo

We show that SemiGPC improves performance when paired with different Semi-Supervised methods such as FixMatch, ReMixMatch, SimMatch and FreeMatch and different pre-training strategies including MSN and Dino.

Gaussian Processes

Denoising and Selecting Pseudo-Heatmaps for Semi-Supervised Human Pose Estimation

no code implementations29 Sep 2023 Zhuoran Yu, Manchen Wang, Yanbei Chen, Paolo Favaro, Davide Modolo

First, we introduce a denoising scheme to generate reliable pseudo-heatmaps as targets for learning from unlabeled data.

Denoising Pose Estimation +1

SkeleTR: Towrads Skeleton-based Action Recognition in the Wild

no code implementations20 Sep 2023 Haodong Duan, Mingze Xu, Bing Shuai, Davide Modolo, Zhuowen Tu, Joseph Tighe, Alessandro Bergamo

It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios.

Action Classification Action Recognition +2

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

no code implementations28 Jun 2023 Zhenlin Xu, Yi Zhu, Tiffany Deng, Abhay Mittal, Yanbei Chen, Manchen Wang, Paolo Favaro, Joseph Tighe, Davide Modolo

This paper introduces innovative benchmarks to evaluate Vision-Language Models (VLMs) in real-world zero-shot recognition tasks, focusing on the granularity and specificity of prompting text.

Benchmarking Specificity +1

ScaleDet: A Scalable Multi-Dataset Object Detector

no code implementations CVPR 2023 Yanbei Chen, Manchen Wang, Abhay Mittal, Zhenlin Xu, Paolo Favaro, Joseph Tighe, Davide Modolo

Our results show that ScaleDet achieves compelling strong model performance with an mAP of 50. 7 on LVIS, 58. 8 on COCO, 46. 8 on Objects365, 76. 2 on OpenImages, and 71. 8 on ODinW, surpassing state-of-the-art detectors with the same backbone.

 Ranked #1 on Object Detection on OpenImages-v6 (using extra training data)

Object object-detection +1

Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

1 code implementation11 May 2023 Zhaoyang Zhang, Yantao Shen, Kunyu Shi, Zhaowei Cai, Jun Fang, Siqi Deng, Hao Yang, Davide Modolo, Zhuowen Tu, Stefano Soatto

We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks which may interfere with each other, resulting in a single model which we named Musketeer.

Language Modelling

SkeleTR: Towards Skeleton-based Action Recognition in the Wild

no code implementations ICCV 2023 Haodong Duan, Mingze Xu, Bing Shuai, Davide Modolo, Zhuowen Tu, Joseph Tighe, Alessandro Bergamo

It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in the wild.

Action Classification Action Recognition +3

Semi-supervised Vision Transformers at Scale

1 code implementation11 Aug 2022 Zhaowei Cai, Avinash Ravichandran, Paolo Favaro, Manchen Wang, Davide Modolo, Rahul Bhotika, Zhuowen Tu, Stefano Soatto

We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks.

Inductive Bias Semi-Supervised Image Classification

What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions

no code implementations CVPR 2022 A S M Iftekhar, Hao Chen, Kaustav Kundu, Xinyu Li, Joseph Tighe, Davide Modolo

We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions.

Decoder Human-Object Interaction Detection +1

Transfer of Representations to Video Label Propagation: Implementation Factors Matter

no code implementations10 Mar 2022 Daniel McKee, Zitong Zhan, Bing Shuai, Davide Modolo, Joseph Tighe, Svetlana Lazebnik

This work studies feature representations for dense label propagation in video, with a focus on recently proposed methods that learn video correspondence using self-supervised signals such as colorization or temporal cycle consistency.


Multi-Object Tracking with Hallucinated and Unlabeled Videos

no code implementations19 Aug 2021 Daniel McKee, Bing Shuai, Andrew Berneshawi, Manchen Wang, Davide Modolo, Svetlana Lazebnik, Joseph Tighe

Next, to tackle harder tracking cases, we mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data.

Multi-Object Tracking Object

MaCLR: Motion-aware Contrastive Learning of Representations for Videos

1 code implementation17 Jun 2021 Fanyi Xiao, Joseph Tighe, Davide Modolo

We present MaCLR, a novel method to explicitly perform cross-modal self-supervised video representations learning from visual and motion modalities.

Action Detection Action Recognition +2

Selective Feature Compression for Efficient Activity Recognition Inference

no code implementations ICCV 2021 Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, Joseph Tighe

In this work, we focus on improving the inference efficiency of current action recognition backbones on trimmed videos, and illustrate that one action model can also cover then informative region by dropping non-informative features.

Action Recognition Feature Compression

Multi-Object Tracking with Siamese Track-RCNN

no code implementations16 Apr 2020 Bing Shuai, Andrew G. Berneshawi, Davide Modolo, Joseph Tighe

Multi-object tracking systems often consist of a combination of a detector, a short term linker, a re-identification feature extractor and a solver that takes the output from these separate components and makes a final prediction.

Multi-Object Tracking Object

Understanding the impact of mistakes on background regions in crowd counting

no code implementations30 Mar 2020 Davide Modolo, Bing Shuai, Rahul Rama Varior, Joseph Tighe

Our results show that (i) mistakes on background are substantial and they are responsible for 18-49% of the total error, (ii) models do not generalize well to different kinds of backgrounds and perform poorly on completely background images, and (iii) models make many more mistakes than those captured by the standard Mean Absolute Error (MAE) metric, as counting on background compensates considerably for misses on foreground.

Crowd Counting

Combining detection and tracking for human pose estimation in videos

no code implementations CVPR 2020 Manchen Wang, Joseph Tighe, Davide Modolo

Our approach consists of three components: (i) a Clip Tracking Network that performs body joint detection and tracking simultaneously on small video clips; (ii) a Video Tracking Pipeline that merges the fixed-length tracklets produced by the Clip Tracking Network to arbitrary length tracks; and (iii) a Spatial-Temporal Merging procedure that refines the joint locations based on spatial and temporal smoothing terms.

Pose Estimation Pose Tracking

Action recognition with spatial-temporal discriminative filter banks

no code implementations ICCV 2019 Brais Martinez, Davide Modolo, Yuanjun Xiong, Joseph Tighe

In this work we focus on how to improve the representation capacity of the network, but rather than altering the backbone, we focus on improving the last layers of the network, where changes have low impact in terms of computational cost.

Ranked #36 on Action Recognition on Something-Something V1 (using extra training data)

Action Classification Action Recognition +1

Multi-Scale Attention Network for Crowd Counting

no code implementations17 Jan 2019 Rahul Rama Varior, Bing Shuai, Joseph Tighe, Davide Modolo

In crowd counting datasets, people appear at different scales, depending on their distance from the camera.

Crowd Counting

Objects as context for detecting their semantic parts

no code implementations CVPR 2018 Abel Gonzalez-Garcia, Davide Modolo, Vittorio Ferrari

We present a semantic part detection approach that effectively leverages object information. We use the object appearance and its class as indicators of what parts to expect.

Object Semantic Part Detection

Learning Semantic Part-Based Models from Google Images

no code implementations11 Sep 2016 Davide Modolo, Vittorio Ferrari

We evaluate our models on the challenging PASCAL-Part dataset [1] and show how their performance increases at every step of the learning, with the final models more than doubling the performance of directly training from images retrieved by querying for part names (from 12. 9 to 27. 2 AP).

Object object-detection +1

Do semantic parts emerge in Convolutional Neural Networks?

no code implementations13 Jul 2016 Abel Gonzalez-Garcia, Davide Modolo, Vittorio Ferrari

We also investigate the other direction: we determine which semantic parts are the most discriminative and whether they correspond to those parts emerging in the network.

Object Recognition

Context Forest for efficient object detection with large mixture models

no code implementations3 Mar 2015 Davide Modolo, Alexander Vezhnevets, Vittorio Ferrari

We present Context Forest (ConF), a technique for predicting properties of the objects in an image based on its global appearance.

object-detection Object Detection

Joint calibration of Ensemble of Exemplar SVMs

no code implementations CVPR 2015 Davide Modolo, Alexander Vezhnevets, Olga Russakovsky, Vittorio Ferrari

We formulate joint calibration as a constrained optimization problem and devise an efficient optimization algorithm to find its global optimum.

object-detection Object Detection

Cannot find the paper you are looking for? You can Submit a new open access paper.