Search Results for author: Brais Martinez

Found 43 papers, 13 papers with code

Multi-scale Image Super Resolution with a Single Auto-Regressive Model

no code implementations5 Jun 2025 Enrique Sanchez, Isma Hadji, Adrian Bulat, Christos Tzelepis, Brais Martinez, Georgios Tzimiropoulos

We address these limitations through two novel components: a) a Hierarchical Image Tokenization approach with a multi-scale image tokenizer that progressively represents images at different scales while simultaneously enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the LR and HR tokenizations, encourages the transformer to produce the latter over the former.

Image Super-Resolution

Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning

no code implementations CVPR 2025 Mehdi Noroozi, Isma Hadji, Victor Escorcia, Anestis Zaganidis, Brais Martinez, Georgios Tzimiropoulos

To maintain a high visual quality on such low compute budget, we introduce a number of training strategies: (i) A novel conditioning mechanism on the low resolution input, coined bidirectional conditioning, which tailors the SD model for the SR task.

Decoder Image Super-Resolution

VladVA: Discriminative Fine-tuning of LVLMs

no code implementations CVPR 2025 Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Brais Martinez, Georgios Tzimiropoulos

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning.

Image-text Retrieval Representation Learning +1

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

no code implementations CVPR 2025 Haosen Yang, Adrian Bulat, Isma Hadji, Hai X. Pham, Xiatian Zhu, Georgios Tzimiropoulos, Brais Martinez

We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works.

Image Generation

A Bayesian Approach to Data Point Selection

no code implementations6 Nov 2024 Xinnuo Xu, Minyoung Kim, Royson Lee, Brais Martinez, Timothy Hospedales

Data point selection (DPS) is becoming a critical topic in deep learning due to the ease of acquiring uncurated training data compared to the difficulty of obtaining curated or processed data.

MobileQuant: Mobile-friendly Quantization for On-device Language Models

1 code implementation25 Aug 2024 Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez

We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization.

Quantization

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

no code implementations19 Aug 2024 Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment.

Hallucination zero-shot-classification +1

You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation

no code implementations30 Jan 2024 Mehdi Noroozi, Isma Hadji, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

We show that the combination of spatially distilled U-Net and fine-tuned decoder outperforms state-of-the-art methods requiring 200 steps with only one single step.

Decoder Image Super-Resolution

Graph Guided Question Answer Generation for Procedural Question-Answering

no code implementations24 Jan 2024 Hai X. Pham, Isma Hadji, Xinnuo Xu, Ziedune Degutyte, Jay Rainey, Evangelos Kazakos, Afsaneh Fazly, Georgios Tzimiropoulos, Brais Martinez

The key technological enabler is a novel mechanism for automatic question-answer generation from procedural text which can ingest large amounts of textual instructions and produce exhaustive in-domain QA training data.

Answer Generation Question-Answer-Generation +1

Black Box Few-Shot Adaptation for Vision-Language models

1 code implementation ICCV 2023 Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners.

Contrastive Learning Prompt Learning +1

ReGen: A good Generative Zero-Shot Video Classifier Should be Rewarded

no code implementations ICCV 2023 Adrian Bulat, Enrique Sanchez, Brais Martinez, Georgios Tzimiropoulos

Specifically, we propose ReGen, a novel reinforcement learning based framework with a three-fold objective and reward functions: (1) a class-level discrimination reward that enforces the generated caption to be correctly classified into the corresponding action class, (2) a CLIP reward that encourages the generated caption to continue to be descriptive of the input video (i. e. video-specific), and (3) a grammar reward that preserves the grammatical correctness of the caption.

Action Classification Action Recognition +4

FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training

no code implementations ICCV 2023 Adrian Bulat, Ricardo Guerrero, Brais Martinez, Georgios Tzimiropoulos

Importantly, we show that our system is not only more flexible than existing methods, but also, it makes a step towards satisfying desideratum (c).

Few-Shot Object Detection object-detection +1

REST: REtrieve & Self-Train for generative action recognition

no code implementations29 Sep 2022 Adrian Bulat, Enrique Sanchez, Brais Martinez, Georgios Tzimiropoulos

We evaluate REST on the problem of zero-shot action recognition where we show that our approach is very competitive when compared to contrastive learning-based methods.

Action Recognition Caption Generation +5

Efficient Attention-free Video Shift Transformers

no code implementations23 Aug 2022 Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

To address this gap, in this paper, we make the following contributions: (a) we construct a highly efficient \& accurate attention-free block based on the shift operator, coined Affine-Shift block, specifically designed to approximate as closely as possible the operations in the MHSA block of a Transformer layer.

Action Recognition Video Recognition

iBoot: Image-bootstrapped Self-Supervised Video Representation Learning

no code implementations16 Jun 2022 Fatemeh Saleh, Fuwen Tan, Adrian Bulat, Georgios Tzimiropoulos, Brais Martinez

Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets, compute is an order of magnitude larger, and the amount of spurious patterns the optimizer has to sieve through is multiplied several fold.

Data Augmentation Representation Learning +1

Knowledge Distillation Meets Open-Set Semi-Supervised Learning

1 code implementation13 May 2022 Jing Yang, Xiatian Zhu, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

The key idea is that we leverage the teacher's classifier as a semantic critic for evaluating the representations of both teacher and student and distilling the semantic knowledge with high-order structured information over all feature dimensions.

Face Recognition Knowledge Distillation

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

2 code implementations6 May 2022 Junting Pan, Adrian Bulat, Fuwen Tan, Xiatian Zhu, Lukasz Dudziak, Hongsheng Li, Georgios Tzimiropoulos, Brais Martinez

In this work, pushing further along this under-studied direction we introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs in the tradeoff between accuracy and on-device efficiency.

SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition

no code implementations10 Apr 2022 Victor Escorcia, Ricardo Guerrero, Xiatian Zhu, Brais Martinez

To overcome both limitations, we introduce Self-Supervised Learning Over Sets (SOS), an approach to pre-train a generic Objects In Contact (OIC) representation model from video object regions detected by an off-the-shelf hand-object contact detector.

Action Recognition Object +2

Space-time Mixing Attention for Video Transformer

1 code implementation NeurIPS 2021 Adrian Bulat, Juan-Manuel Perez-Rua, Swathikiran Sudhakaran, Brais Martinez, Georgios Tzimiropoulos

In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model.

Action Classification Action Recognition In Videos +1

Few-shot Action Recognition with Prototype-centered Attentive Learning

1 code implementation20 Jan 2021 Xiatian Zhu, Antoine Toisoul, Juan-Manuel Perez-Rua, Li Zhang, Brais Martinez, Tao Xiang

Extensive experiments on four standard few-shot action benchmarks show that our method clearly outperforms previous state-of-the-art methods, with the improvement particularly significant (10+\%) on the most challenging fine-grained action recognition benchmark.

Contrastive Learning Few-Shot action recognition +3

Knowledge distillation via softmax regression representation learning

no code implementations ICLR 2021 Jing Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

We advocate for a method that optimizes the output feature of the penultimate layer of the student network and hence is directly related to representation learning.

Knowledge Distillation Model Compression +2

Towards Practical Lipreading with Distilled and Efficient Models

1 code implementation13 Jul 2020 Pingchuan Ma, Brais Martinez, Stavros Petridis, Maja Pantic

However, our most promising lightweight models are on par with the current state-of-the-art while showing a reduction of 8. 2x and 3. 9x in terms of computational cost and number of parameters, respectively, which we hope will enable the deployment of lipreading models in practical applications.

Knowledge Distillation Lipreading

Egocentric Action Recognition by Video Attention and Temporal Context

no code implementations3 Jul 2020 Juan-Manuel Perez-Rua, Antoine Toisoul, Brais Martinez, Victor Escorcia, Li Zhang, Xiatian Zhu, Tao Xiang

In this challenge, action recognition is posed as the problem of simultaneously predicting a single `verb' and `noun' class label given an input trimmed video clip.

Action Recognition

Knowledge distillation via adaptive instance normalization

no code implementations9 Mar 2020 Jing Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

To this end, we propose a new knowledge distillation method based on transferring feature statistics, specifically the channel-wise mean and variance, from the teacher to the student.

Knowledge Distillation Model Compression

Lipreading using Temporal Convolutional Networks

2 code implementations23 Jan 2020 Brais Martinez, Pingchuan Ma, Stavros Petridis, Maja Pantic

We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively.

Lipreading Lip Reading

Action recognition with spatial-temporal discriminative filter banks

no code implementations ICCV 2019 Brais Martinez, Davide Modolo, Yuanjun Xiong, Joseph Tighe

In this work we focus on how to improve the representation capacity of the network, but rather than altering the backbone, we focus on improving the last layers of the network, where changes have low impact in terms of computational cost.

Ranked #37 on Action Recognition on Something-Something V1 (using extra training data)

Action Classification Action Recognition +1

Fusing Deep Learned and Hand-Crafted Features of Appearance, Shape, and Dynamics for Automatic Pain Estimation

no code implementations17 Jan 2017 Joy Egede, Michel Valstar, Brais Martinez

Automatic continuous time, continuous value assessment of a patient's pain from face video is highly sought after by the medical profession.

Deep Learning Time Series +1

A Functional Regression approach to Facial Landmark Tracking

no code implementations7 Dec 2016 Enrique Sánchez-Lozano, Georgios Tzimiropoulos, Brais Martinez, Fernando de la Torre, Michel Valstar

This paper presents a Functional Regression solution to the least squares problem, which we coin Continuous Regression, resulting in the first real-time incremental face tracker.

Face Detection Incremental Learning +2

Cascaded Continuous Regression for Real-time Incremental Face Tracking

no code implementations3 Aug 2016 Enrique Sánchez-Lozano, Brais Martinez, Georgios Tzimiropoulos, Michel Valstar

We then derive the incremental learning updates for CCR (iCCR) and show that it is an order of magnitude faster than standard incremental learning for cascaded regression, bringing the time required for the update from seconds down to a fraction of a second, thus enabling real-time tracking.

Face Alignment Incremental Learning +2

Learning to Transfer: Transferring Latent Task Structures and Its Application to Person-Specific Facial Action Unit Detection

no code implementations ICCV 2015 Timur Almaev, Brais Martinez, Michel Valstar

We thus consider a novel problem: all AU models for the target subject are to be learnt using person-specific annotated data for a reference AU (AU12 in our case), and no data or little data regarding the target AU.

Action Unit Detection Facial Action Unit Detection +1

TRIC-track: Tracking by Regression With Incrementally Learned Cascades

no code implementations ICCV 2015 Xiaomeng Wang, Michel Valstar, Brais Martinez, Muhammad Haris Khan, Tony Pridmore

This paper proposes a novel approach to part-based tracking by replacing local matching of an appearance model by direct prediction of the displacement between local image patches and part locations.

Incremental Learning regression

Cannot find the paper you are looking for? You can Submit a new open access paper.