Hanoona Abdul Rasheed

I am an Ph.D. Computer Vision student at MBZUAI working under the supervision of Dr. Fahad and Dr. Salman.

My research is focused on developing multi-modal understanding from vision and text to improve common-sense reasoning of machines and its applications in open-vocabulary and open-world object detection. I am also exploring efficient neural networks for edge-computing devices (i.e. Jetson Nano).

I received my B.Sc. degreen in Electrical Enginerring from UET Lahore with honors in 2018. After my graduation I joined Confiz Limited as Computer Vision engineer where I worked on design and deployment of deep-learning driven computer vision solution for retail industry in Pakistan. In 2021, I joined to MBZUAI for pursuing my M.Sc. degree in Computer Vision.

Email / CV / Google Scholar / GitHub / LinkedIn

Research and Publications

* denotes equal contribution co-authorship

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection
Hanoona Rasheed*, Muhammad Maaz*, Muhammad Uzair Khattak Salman Khan, Fahad Shahbaz Khan,
NeurIPS, 2022
project page / arXiv / video

In this work, we propose to solve the Open-vocabulary detection (OVD) problem using pretrained CLIP model, adapting it for object-centric local regions using region-based distillation and image-level weak supervision. Specifically, we propose to utilize high-quality class-agnostic and class-specific object proposals via the pretrained mulit-modal vision transformers (MViT). The class-agnostic proposals are used to distill region-specific information from CLIP and class-specific proposals allows us to visually ground large vocabularies. We also introduce a region-conditioned weight transfer method to get complementary benefits from both region-based distillation and image-level supervision.

Class-agnostic Object Detection with Multi-modal Transformer
Muhammad Maaz*, Hanoona Rasheed*, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer Ming-Hsuan Yang
ECCV, 2022
project page / arXiv / video

In this work, we explore the potential of the recent Multi-modal Vision Transformers (MViTs) for class-agnostic object detection. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. We also develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention that can adaptively generate proposals given a specific language query.

MaPLe: Multi-modal Prompt Learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, Fahad Shahbaz Khan,
Under Review, 2022
project page / arXiv

In this work, we propose to learn prompts in both vision and language branches of pretrained CLIP for adapting it to different downstream tasks. Previous works only use prompting in either language or vision branch. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the flexibility to dynamically adjust both representation spaces on a downstream task. To this end, we propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Our design promotes strong coupling between the vision-language prompts to ensure mutual synergy and discourages learning independent uni-modal solutions.

You've probably seen this website template before, thanks to Jon Barron.
Last updated May 2020.