Akshita Guptaअक्षिता गुप्ता

I am an ELLIS PhD student at TU Darmstadt, co-supervised by Prof. Marcus Rohrbach and Dr. Federico Tombari at Google Zurich. I completed my MASc at the University of Guelph, where I was advised by Prof. Graham Taylor. During that time, I was also a student researcher at the Vector Institute.

I was fortunate to spend time as a research intern at Apple under Dr. Tatiana Likhomanenko, Microsoft under Gaurav Mittal and Mei Chen, Vector Institute under Dr. David Emerson, and as a scientist in residence at NextAI with Prof. Graham Taylor.

Before coming to academia, I worked as a Data Scientist at Bayanat, where I focused on projects related to detection and segmentation. Prior to that, I was a Research Engineer at the Inception Institute of Artificial Intelligence (IIAI), working with Dr. Sanath Narayan, Dr. Salman Khan, and Dr. Fahad Shahbaz Khan.

Email  /  Google Scholar  /  Twitter  /  Github  /  Resume/CV

profile photo

What's New

[Mar 2025] Excited to be an ELLIS PhD student at TU Darmstadt under Prof. Marcus Rohrbach and Dr. Federico Tombari (Google Zurich) 🎉
[Mar 2025] Graduated and Defended my Masters Thesis
[Nov 2024] Our paper Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis is now on ArXiv 2025!
[Jun 2024] Joined Apple as a Research Intern
[May 2024] Serving as a Scientist-in-Residence at NextAI.
[Jan 2024] Our paper Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization is accepted at WACV 2025 (Oral)! 🎤
[Dec 2023] Our work Open-Vocabulary Temporal Action Localization using Multimodal Guidance is accepted at BMVC 2024!
[Jun 2023] Our paper Generative Multi-Label Zero-Shot Learning is accepted at TPAMI 2023.
[Jun 2023] Started interning at Microsoft, ROAR team.
[Jan 2023] Interned at Vector Institute with AI Eng team.
[Sep 2022] Joined Prof. Graham Taylor's Lab and Vector Institute.
[Mar 2022] OW-DETR accepted at CVPR 2022.
[Sep 2021] Reviewer for CVPR 2023, CVPR 2022, ECCV 2022, ICCV 2021, TPAMI.
[Jul 2021] BiAM accepted at ICCV 2021.
[Feb 2021] Serving as a reviewer for ML Reproducibility Challenge 2020.
[Jan 2021] Paper out on arxiv: Generative Multi-Label Zero-Shot Learning
[Jul 2020] TF-VAEGAN accepted at ECCV 2020.
[Aug 2019] A Large-scale Instance Segmentation Dataset for Aerial Images (iSAID) is available for download .
[Aug 2018] One paper accepted at Interspeech, chime workshop 2018.
[May 2018] Selected as an Outreachy intern, with Mozilla.

Research Interest

I'm interested in developing models which can learn with limited data and few, zero or one training sample(s).

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
Akshita Gupta, Tatiana Likhomanenko*, Karren Dai Yang , Richard He Bai
Zakaria Aldeneh Navdeep Jaitly
Arxiv 2025
paper / code
  • Description:
  • Outcome:
LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization
Akshita Gupta*, Akshita Gupta*, Salman , Shahbaz Khan,
Cees G. M. Snoek, Ling Shao,
WACV 2025
paper / code
  • Description: Developed a generative feature synthesizing framework for zero-shot learning.
  • Outcome: Improved state-of-the-art performances on CUB, FLO, SUN, and AWA by 4.6%, 7.1%, 1.7%, and 3.1% harmonic mean by enforcing semantic consistency at all stages of zero-shot learning.
Open-Vocabulary Temporal Action Localization using Multimodal Guidance
Akshita Gupta*, Akshita Gupta*, Salman , Shahbaz Khan,
Cees G. M. Snoek, Ling Shao,
BMVC 2024
paper / code
  • Description: Developed a generative feature synthesizing framework for zero-shot learning.
  • Outcome: Improved state-of-the-art performances on CUB, FLO, SUN, and AWA by 4.6%, 7.1%, 1.7%, and 3.1% harmonic mean by enforcing semantic consistency at all stages of zero-shot learning.
Generative Multi-Label Zero-Shot Learning
Akshita Gupta*, Akshita Gupta*, Salman , Shahbaz Khan,
Cees G. M. Snoek, Ling Shao,
TPAMI 2023
paper / code
  • Description: Developed a generative feature synthesizing framework for zero-shot learning.
  • Outcome: Improved state-of-the-art performances on CUB, FLO, SUN, and AWA by 4.6%, 7.1%, 1.7%, and 3.1% harmonic mean by enforcing semantic consistency at all stages of zero-shot learning.
OW-DETR: Open-world Detection Transformer
Akshita Gupta*, Sanath Narayan*, Joseph KJ, Salman Khan, Fahad Shahbaz Khan,
Mubarak Shah
CVPR 2022
paper / code
  • Description: Developed multi-scale context aware detection framework with attention-driven psuedo-labelling.
  • Outcome: Improved state-of-the-art performances on MS-COCO dataset with absolute gains ranging from 1.8% to 3.3% in terms of unknown recall.
Discriminative Region-based Multi-Label Zero-Shot Learning
Sanath Narayan*, Akshita Gupta*, Salman Khan, Fahad Shahbaz Khan,
Ling Shao, Mubarak Shah
ICCV 2021
paper / code
  • Description: Developed a attention module which combines both region-level and global-level contextual information.
  • Outcome: Improved state-of-the-art performances on NUS-WIDE, OpenImages by 6.9% and 31.9% mAP score.
Generative Multi-Label Zero-Shot Learning
Akshita Gupta*, Sanath Narayan*, Salman Khan, Fahad Shahbaz Khan,
Ling Shao, Joost van de Weijer
Under Review in TPAMI
paper / code
  • Description: Developed a generative model that constructs multi-label features for (generalized) zero-shot learning.
  • Outcome: Improved state-of-the-art performances on NUS-WIDE, OpenImages and MS-COCO by 3.3%, 4.3% and 15.7% mAP score.
Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification
Sanath Narayan*, Akshita Gupta*, Salman Khan, Fahad Shahbaz Khan,
Cees G. M. Snoek, Ling Shao,
ECCV 2020
paper / code
  • Description: Developed a generative feature synthesizing framework for zero-shot learning.
  • Outcome: Improved state-of-the-art performances on CUB, FLO, SUN, and AWA by 4.6%, 7.1%, 1.7%, and 3.1% harmonic mean by enforcing semantic consistency at all stages of zero-shot learning.
iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images
Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, Xiang Bai
CVPR Workshop 2019 (Oral Presentation)
code / dataset
  • Description: Improved state of the art object detector (Mask-RCNN and PANet) for aerial imagery.
  • Outcome: Proposed a large scale instance segmentation and object detection dataset (iSAID) with benchmarking on mask-RCNN and PANet.
Acoustic features fusion using attentive multi-channel deep architecture
Gaurav Bhatt, Akshita Gupta, Aditya Arora, Balasubramanian Raman
Interspeech Workshop 2018
code
  • Description: Developed an attention based framework for acoustic scene recognition and audio tagging.
  • Outcome: Improved the equal error rate by atleast 3% over the Dcase challenge results.

I borrowed this website layout from here!