Search Results for author: Rao Muhammad Anwer

Found 59 papers, 48 papers with code

Tracking Meets Large Multimodal Models for Driving Scenario Understanding

1 code implementation18 Mar 2025 Ayesha Ishaq, Jean Lahoud, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer

We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios.

Autonomous Driving

CLIMB-3D: Continual Learning for Imbalanced 3D Instance Segmentation

1 code implementation24 Feb 2025 Vishal Thengane, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Lu Yin, Xiatian Zhu, Salman Khan

Unlike prior methods, our framework minimizes ER usage, with KD preventing forgetting and supporting the IC module in compiling past class statistics to balance learning of rare classes during incremental updates.

3D Instance Segmentation Continual Learning +5

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

1 code implementation20 Feb 2025 Sara Ghaboura, Ketan More, Ritesh Thawkar, Wafa Alghallabi, Omkar Thawakar, Fahad Shahbaz Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery.

AIN: The Arabic INclusive Large Multimodal Model

1 code implementation31 Jan 2025 Ahmed Heakl, Sara Ghaboura, Omkar Thawkar, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan

While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding.

document understanding model +1

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

1 code implementation10 Jan 2025 Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan

The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps.

4k Visual Reasoning

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

1 code implementation25 Nov 2024 Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani, Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manjunath, Kamila Zhumakhanova, Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ketan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei, Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabrera, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoojan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Amrin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha, Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen, Thamar Solorio, Monojit Choudhury, Ivan Laptev, Mubarak Shah, Salman Khan, Fahad Khan

In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.

All Long Question Answer +3

AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning

1 code implementation10 Oct 2024 Muhammad Awais, Ali Husain Salem Abdulla Alharthi, Amandeep Kumar, Hisham Cholakkal, Rao Muhammad Anwer

In this work, we propose an approach to construct instruction-tuning data that harnesses vision-only data for the agriculture domain.

Language Modeling Language Modelling

DB-SAM: Delving into High Quality Universal Medical Image Segmentation

1 code implementation5 Oct 2024 Chao Qin, Jiale Cao, Huazhu Fu, Fahad Shahbaz Khan, Rao Muhammad Anwer

On 21 3D medical image segmentation tasks, our proposed DB-SAM achieves an absolute gain of 8. 8%, compared to a recent medical SAM adapter in the literature.

Image Segmentation Medical Image Segmentation +2

AgriCLIP: Adapting CLIP for Agriculture and Livestock via Domain-Specialized Cross-Model Alignment

1 code implementation2 Oct 2024 Umair Nawaz, Muhammad Awais, Hanan Gani, Muzammal Naseer, Fahad Khan, Salman Khan, Rao Muhammad Anwer

Further, this domain desires fine-grained feature learning due to the subtle nature of the downstream tasks (e. g, nutrient deficiency detection, livestock breed classification).

Self-Supervised Learning Zero-Shot Learning

Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking

1 code implementation2 Oct 2024 Ayesha Ishaq, Mohamed El Amine Boudjoghra, Jean Lahoud, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer

To address this limitation, we introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories.

3D Multi-Object Tracking Autonomous Driving +1

CDChat: A Large Multimodal Model for Remote Sensing Change Description

1 code implementation24 Sep 2024 Mubashir Noman, Noor Ahsan, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

In order to achieve this, we introduce a change description instruction dataset that can be utilized to finetune an LMM and provide better change descriptions for RS images.

BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning

1 code implementation14 Aug 2024 Asif Hanif, Fahad Shamshad, Muhammad Awais, Muzammal Naseer, Fahad Shahbaz Khan, Karthik Nandakumar, Salman Khan, Rao Muhammad Anwer

Inspired by the latest developments in learnable prompts, this work introduces a method to embed a backdoor into the medical foundation model during the prompt learning phase.

Backdoor Attack

BOrg: A Brain Organoid-Based Mitosis Dataset for Automatic Analysis of Brain Diseases

1 code implementation27 Jun 2024 Muhammad Awais, Mehaboobathunnisa Sahul Hameed, Bidisha Bhattacharya, Orly Reiner, Rao Muhammad Anwer

Quantifying cellular processes like mitosis in these organoids offers insights into neurodevelopmental disorders, but the manual analysis is time-consuming, and existing datasets lack specific details for brain organoid studies.

object-detection Object Detection

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

1 code implementation6 Jun 2024 Amandeep Kumar, Muhammad Awais, Sanath Narayan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

The LAE harnesses a pre-trained vision-language model to find text-guided attribute-specific editing direction in the latent space of any pre-trained 3D-aware GAN.

Attribute Language Modelling

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

1 code implementation4 Jun 2024 Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation.

3D Instance Segmentation 3D Open-Vocabulary Instance Segmentation +4

Composed Video Retrieval via Enriched Context and Discriminative Embeddings

1 code implementation CVPR 2024 Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah, Fahad Shahbaz Khan

Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases.

Composed Video Retrieval (CoVR) Retrieval

Semi-supervised Open-World Object Detection

1 code implementation25 Feb 2024 Sahal Shaji Mullappilly, Abhishek Singh Gehlot, Rao Muhammad Anwer, Fahad Shahbaz Khan, Hisham Cholakkal

We demonstrate the effectiveness of our SS-OWOD problem setting and approach for remote sensing object detection, proposing carefully curated splits and baseline performance evaluations.

Incremental Learning Object +2

BiMediX: Bilingual Medical Mixture of Experts LLM

1 code implementation20 Feb 2024 Sara Pieri, Sahal Shaji Mullappilly, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal

In this paper, we introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic.

Multiple-choice Open-Ended Question Answering

Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored Arabic LLM

1 code implementation14 Dec 2023 Sahal Shaji Mullappilly, Abdelrahman Shaker, Omkar Thawakar, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

To this end, we propose a light-weight Arabic Mini-ClimateGPT that is built on an open-source LLM and is specifically fine-tuned on a conversational-style instruction tuning curated Arabic dataset Clima500-Instruct with over 500k instructions about climate change and sustainability.

SA2-Net: Scale-aware Attention Network for Microscopic Image Segmentation

1 code implementation28 Sep 2023 Mustansar Fiaz, Moein Heidari, Rao Muhammad Anwer, Hisham Cholakkal

Specifically, we propose scale-aware attention (SA2) module designed to capture inherent variations in scales and shapes of microscopic regions, such as cells, for accurate segmentation.

Image Segmentation Semantic Segmentation

3D Indoor Instance Segmentation in an Open-World

1 code implementation NeurIPS 2023 Mohamed El Amine Boudjoghra, Salwa K. Al Khatib, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

We argue that such a closed-world assumption is restrictive and explore for the first time 3D indoor instance segmentation in an open-world setting, where the model is allowed to distinguish a set of known classes as well as identify an unknown object as unknown and then later incrementally learning the semantic category of the unknown when the corresponding category labels are available.

3D Instance Segmentation Segmentation +1

A Spatial-Temporal Deformable Attention based Framework for Breast Lesion Detection in Videos

1 code implementation9 Sep 2023 Chao Qin, Jiale Cao, Huazhu Fu, Rao Muhammad Anwer, Fahad Shahbaz Khan

Existing video-based breast lesion detection approaches typically perform temporal feature aggregation of deep backbone features based on the self-attention operation.

Decoder Lesion Detection

Foundational Models Defining a New Era in Vision: A Survey and Outlook

1 code implementation25 Jul 2023 Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Shahbaz Khan

Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.

Benchmarking

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

1 code implementation13 Jun 2023 Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, Fahad Shahbaz Khan

The latest breakthroughs in large vision-language models, such as Bard and GPT-4, have showcased extraordinary abilities in performing a wide range of tasks.

Language Modeling Language Modelling +1

DFormer: Diffusion-guided Transformer for Universal Image Segmentation

1 code implementation6 Jun 2023 Hefeng Wang, Jiale Cao, Rao Muhammad Anwer, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang

Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3. 6% on MS COCO val2017 set.

Decoder Denoising +4

Modulate Your Spectrum in Self-Supervised Learning

1 code implementation26 May 2023 Xi Weng, Yunhao Ni, Tengwei Song, Jie Luo, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan, Lei Huang

In this work, we introduce Spectral Transformation (ST), a framework to modulate the spectrum of embedding and to seek for functions beyond whitening that can avoid dimensional collapse.

object-detection Object Detection +1

Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection

1 code implementation CVPR 2023 Long Li, Junwei Han, Ni Zhang, Nian Liu, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan

Then, we use two types of pre-defined tokens to mine co-saliency and background information via our proposed contrast-induced pixel-to-token correlation and co-saliency token-to-token correlation modules.

Computational Efficiency Co-Salient Object Detection +3

Remote Sensing Change Detection With Transformers Trained from Scratch

1 code implementation13 Apr 2023 Mubashir Noman, Mustansar Fiaz, Hisham Cholakkal, Sanath Narayan, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

Current transformer-based change detection (CD) approaches either employ a pre-trained model trained on large-scale image classification ImageNet dataset or rely on first pre-training on another CD dataset and then fine-tuning on the target benchmark.

Change Detection Image Classification

Cross-modulated Few-shot Image Generation for Colorectal Tissue Classification

1 code implementation4 Apr 2023 Amandeep Kumar, Ankan Kumar Bhunia, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan

In this work, we propose a few-shot colorectal tissue image generation method for addressing the scarcity of histopathological training data for rare cancer tissues.

Data Augmentation Image Classification +1

Video Instance Segmentation in an Open-World

1 code implementation3 Apr 2023 Omkar Thawakar, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, Mubarak Shah, Fahad Shahbaz Khan

Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as `unknown' and then (b) it incrementally learns the class of an unknown as and when the corresponding semantic labels become available.

Instance Segmentation Semantic Segmentation +1

LEAPS: End-to-End One-Step Person Search With Learnable Proposals

no code implementations21 Mar 2023 Zhiqiang Dong, Jiale Cao, Rao Muhammad Anwer, Jin Xie, Fahad Khan, Yanwei Pang

Given a set of sparse and learnable proposals, LEAPS employs a dynamic person search head to directly perform person detection and corresponding re-id feature generation without non-maximum suppression post-processing.

Human Detection Person Search

3D Mitochondria Instance Segmentation with Spatio-Temporal Transformers

1 code implementation21 Mar 2023 Omkar Thawakar, Rao Muhammad Anwer, Jorma Laaksonen, Orly Reiner, Mubarak Shah, Fahad Shahbaz Khan

Accurate 3D mitochondria instance segmentation in electron microscopy (EM) is a challenging problem and serves as a prerequisite to empirically analyze their distributions and morphology.

Decoder Instance Segmentation +1

Person Image Synthesis via Denoising Diffusion Model

1 code implementation CVPR 2023 Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Jorma Laaksonen, Mubarak Shah, Fahad Shahbaz Khan

In this work, we show how denoising diffusion models can be applied for high-fidelity person image synthesis with strong sample diversity and enhanced mode coverage of the learnt data distribution.

Denoising Diversity +2

CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection

no code implementations13 Sep 2022 Dhanalaxmi Gaddam, Jean Lahoud, Fahad Shahbaz Khan, Rao Muhammad Anwer, Hisham Cholakkal

In this work, we propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework, which takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene at multiple levels to predict a set of object bounding-boxes along with their corresponding semantic labels.

3D Object Detection Object +2

Transformers in Remote Sensing: A Survey

no code implementations2 Sep 2022 Abdulaziz Amer Aleissaee, Amandeep Kumar, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal, Gui-Song Xia, Fahad Shahbaz Khan

Deep learning-based algorithms have seen a massive popularity in different areas of remote sensing image analysis over the past decade.

Survey

Multi-scale Feature Aggregation for Crowd Counting

no code implementations10 Aug 2022 Xiaoheng Jiang, Xinyi Wu, Hisham Cholakkal, Rao Muhammad Anwer, Jiale Cao Mingliang Xu, Bing Zhou, Yanwei Pang, Fahad Shahbaz Khan

The SkipAgg module directly propagates features with small receptive fields to features with much larger receptive fields.

Crowd Counting

3D Vision with Transformers: A Survey

1 code implementation8 Aug 2022 Jean Lahoud, Jiale Cao, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Ming-Hsuan Yang

The success of the transformer architecture in natural language processing has recently triggered attention in the computer vision field.

Pose Estimation Survey

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

8 code implementations21 Jun 2022 Muhammad Maaz, Abdelrahman Shaker, Hisham Cholakkal, Salman Khan, Syed Waqas Zamir, Rao Muhammad Anwer, Fahad Shahbaz Khan

Our EdgeNeXt model with 1. 3M parameters achieves 71. 2% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2. 2% with 28% reduction in FLOPs.

Image Classification Object Detection +1

PSTR: End-to-End One-Step Person Search With Transformers

1 code implementation CVPR 2022 Jiale Cao, Yanwei Pang, Rao Muhammad Anwer, Hisham Cholakkal, Jin Xie, Mubarak Shah, Fahad Shahbaz Khan

We propose a novel one-step transformer-based person search framework, PSTR, that jointly performs person detection and re-identification (re-id) in a single architecture.

Decoder Human Detection +1

Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer

1 code implementation24 Mar 2022 Omkar Thawakar, Sanath Narayan, Jiale Cao, Hisham Cholakkal, Rao Muhammad Anwer, Muhammad Haris Khan, Salman Khan, Michael Felsberg, Fahad Shahbaz Khan

When using the ResNet50 backbone, our MS-STS achieves a mask AP of 50. 1 %, outperforming the best reported results in literature by 2. 7 % and by 4. 8 % at higher overlap threshold of AP_75, while being comparable in model size and speed on Youtube-VIS 2019 val.

Instance Segmentation Semantic Segmentation +2

DoodleFormer: Creative Sketch Drawing with Transformers

no code implementations6 Dec 2021 Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan, Jorma Laaksonen, Michael Felsberg

Creative sketch image generation is a challenging vision problem, where the task is to generate diverse, yet realistic creative sketches possessing the unseen composition of the visual-world objects.

Decoder Image Generation

Structured Latent Embeddings for Recognizing Unseen Classes in Unseen Domains

no code implementations12 Jul 2021 Shivam Chandhok, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Vineeth N Balasubramanian, Fahad Shahbaz Khan, Ling Shao

The need to address the scarcity of task-specific annotated data has resulted in concerted efforts in recent years for specific settings such as zero-shot learning (ZSL) and domain generalization (DG), to separately address the issues of semantic shift and domain shift, respectively.

Domain Generalization Zero-Shot Learning +1

Handwriting Transformers

1 code implementation ICCV 2021 Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan, Mubarak Shah

We propose a novel transformer-based styled handwritten text image generation approach, HWT, that strives to learn both style-content entanglement as well as global and local writing style patterns.

Decoder Image Generation +1

SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation

1 code implementation ECCV 2020 Jiale Cao, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao

In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3. 0% (mask AP) under similar settings, while operating at comparable speed on a Titan Xp.

object-detection Object Detection +4

PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection

no code implementations25 Jan 2020 Jin Xie, Yanwei Pang, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan, Ling Shao

On the heavy occluded (\textbf{HO}) set of CityPerosns test set, our PSC-Net obtains an absolute gain of 4. 0\% in terms of log-average miss rate over the state-of-the-art with same backbone, input scale and without using additional VBB supervision.

Pedestrian Detection

Mask-Guided Attention Network for Occluded Pedestrian Detection

2 code implementations ICCV 2019 Yanwei Pang, Jin Xie, Muhammad Haris Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, Ling Shao

Our approach obtains an absolute gain of 9. 5% in log-average miss rate, compared to the best reported results on the heavily occluded (HO) pedestrian set of CityPersons test set.

Pedestrian Detection

Binary Patterns Encoded Convolutional Neural Networks for Texture Recognition and Remote Sensing Scene Classification

no code implementations5 Jun 2017 Rao Muhammad Anwer, Fahad Shahbaz Khan, Joost Van de Weijer, Matthieu Molinier, Jorma Laaksonen

To the best of our knowledge, we are the first to investigate Binary Patterns encoded CNNs and different deep network fusion architectures for texture recognition and remote sensing scene classification.

Aerial Scene Classification General Classification +2

Scale Coding Bag of Deep Features for Human Attribute and Action Recognition

no code implementations14 Dec 2016 Fahad Shahbaz Khan, Joost Van de Weijer, Rao Muhammad Anwer, Andrew D. Bagdanov, Michael Felsberg, Jorma Laaksonen

Most approaches to human attribute and action recognition in still images are based on image representation in which multi-scale local features are pooled across scale into a single, scale-invariant encoding.

Action Recognition In Still Images Attribute

Cannot find the paper you are looking for? You can Submit a new open access paper.