Simran Khanuja

Hi! I am a fourth-year PhD candidate at the Language Technologies Institute at Carnegie Mellon University. I am advised by Graham Neubig and have wonderful friends and collaborators at Neulab :) I will be on the industry job market later this year / early next year — please reach out if you think there's a good fit!

Broadly, I am passionate about making AI genuinely useful for real-world applications and a diverse set of users, drawing inspiration from what people actually need but these systems lack. I am always excited to explore new directions which serve this grander goal, but my work thus far has focused on three main areas:

(1) LLM/Multimodal Evaluation — evaluation, benchmarks, multilingual and multicultural NLP.
Evaluation is what truly drives progress, and it should be grounded in real-world needs. I've built benchmarks and metrics inspired by what people use (or could use) these models for — culturally translating images for applications like ads, education, and TV/film (image transcreation, Best Paper, EMNLP 2024); speech understanding across a long-tail set of languages (FLEURS, Best Paper, SLT 2022); and code-mixed inputs, common in multilingual communities (GLUECoS).

(2) Models/Methodological Interventions — interpretability, alignment, post-training, personalization.
I probe and improve model representations for broader linguistic and cultural coverage — MuRIL, Pangea and Cultural Pangea, and inference-time steering for controllable, culturally aware generation.

(3) Data Selection & Synthetic Data — data-efficient training, synthetic data.
Data has consistently been a bottleneck in my research — whether in the long tail of languages and cultures, or in applications like advertising and education where much relevant data is copyrighted. I've worked on data-efficient fine-tuning (DeMuX), and my proposed work builds synthetic data pipelines for diversity sampling in text-to-image models and for training multimodal generative models in these domains.

I've been fortunate to have my work recognized through fellowships and awards including MIT EECS Rising Star, Rising Star in AI (UMich), BITS 30 Under 30 (Research), CMU Waibel Presidential Fellowship, and two Best Paper Awards at EMNLP 2024 and SLT 2022.

I'm deeply grateful to the brilliant researchers whose mentorship has shaped my growth: Graham Neubig (CMU), Partha Talukdar (Google DeepMind), Sebastian Ruder (Google DeepMind), Alexis Conneau (Google DeepMind), Sunayana Sitaram (Microsoft Research), Monojit Choudhury (Microsoft Research), and Dr. Sreejith V (BITS Pilani).

For more information, check out my CV or reach out via email :)

July 2026 Our Steering LLMs for Culturally Localized Generation paper is accepted to COLM 2026!

May 2026 Selected to participate in the 13th Heidelberg Laureate Forum!

Apr 2026 Passed my PhD proposal and am now a PhD Candidate! (slides — note: proposed work slides not included, but feel free to reach out if you'd like to know more!)

Jan 2026 Received Honourable Mention for the Jane Street 2026 Fellowship!

Jan 2026 Our CAIRE paper is accepted to EACL 2026 (Main)

Jan 2026 Honored to be selected for BITS 30 Under 30 - Research Leaders!

Oct 2025 Presenting at the MIT EECS Rising Star Workshop! (poster)

Oct 2025 Invited keynote at the CEGIS workshop at ICCV '25 (slides)

Oct 2025 Invited talk as Rising Star in AI at AI for Science Symposium at UMich! (slides, video)

June 2025 Won the Imminent Translated Grant for our Human-AI Image Localization Platform!

June 2025 Invited Keynote at Demographic Diversity in CV workshop at CVPR '25 (slides, video)

Mar 2025 Interning at Google DeepMind this summer with Lun Wang

Mar 2025 Invited talk at UT Austin NLL Reading Group

Jan 2025 Paper on automatic evaluation for image transcreation accepted at NAACL 2025

Jan 2025 Pangea accepted at ICLR 2025

Dec 2024 Won best paper runner-up at IEEE Big Data 2024 for our image-editing platform!

Nov 2024 Won best paper at EMNLP 2024 for our work on image transcreation!

Oct 2024 Released Pangea-7B, an open-sourced multi-(lingual, modal, cultural) model

Sept 2024 Image transcreation paper accepted at EMNLP (Main) '24!

Sept 2024 Invited talk on image transcreation at Pinterest

June 2024 Invited keynote at AmericasNLP workshop at NAACL '24

Apr 2024 Supported by Waibel Presidential Fellowship for 2024-25!

Jan 2023 Received best paper award for FLEURS at SLT 2022!

Aug 2022 Started my PhD at CMU LTI!

Aug 2022 Presented our research and its application to Google Assistant at Decode with Google 2022!

Oct 2021 Attending ALPS 2022!

Aug 2021 Conducted a hands-on TensorFlow Tutorial at the 5th CVIT IIIT Summer School

Aug 2021 Hosted the NLP networking session at IKDD 2021 with Dr. Monojit Choudhury as guest speaker

May 2021 Work on merging multiple pre-trained LMs to appear in ACL 2021 Findings

Mar 2021 MuRIL technical write-up available on arxiv and model on HuggingFace

Nov 2020 Open-sourced MuRIL, a multilingual model for Indian languages on TFHub!

Sep 2020 Hosted a Fireside Chat with Jeff Dean on his virtual Google India visit!

Aug 2020 Joined Google Research India as a Pre-Doctoral Researcher with Dr. Partha Talukdar!

Aug 2020 Graduated from BITS Pilani Goa with a dual degree in Computer Science and Economics

July 2020 GLUECoS code and leaderboard are now open-sourced!

Apr 2020 Paper on building a benchmark for code-switched language processing at ACL 2020!

Mar 2020 Created a new dataset for code-mixed conversational NLI! Paper at CALCS, LREC 2020

Jul 2019 Started bachelor thesis at Microsoft Research India with Dr. Sunayana Sitaram!

Jun 2019 Work on generating code-mixed text to appear at TLT SyntaxFest 2019

Apr 2018 Summer internship at MT-NLP lab, IIIT Hyderabad with Dr. Dipti Misra Sharma

🏆

13th Heidelberg Laureate Forum

Selected as one of 200 young researchers worldwide to participate

2026

🏆

Jane Street Fellowship - Honourable Mention

Recognition for the Jane Street 2026 Graduate Research Fellowship

2026

🏆

BITS 30 Under 30 - Research Leaders

BITS Pilani Alumni Association recognition for outstanding achievements

2026

🏆

MIT EECS Rising Star

Selected for the prestigious MIT EECS Rising Stars workshop

2025

🏆

Rising Star in AI - University of Michigan

Invited speaker at the AI for Science Symposium

2025

🏅

Best Paper Award - EMNLP 2024

For "An image speaks a thousand words, but can everyone listen?" on image transcreation

2024

🏆

Waibel Presidential Fellowship

Carnegie Mellon University endowed fellowship

2024-2025

🏅

Best Paper Award - SLT 2022

For FLEURS: Few-Shot Learning Evaluation of Universal Representations of Speech

2022

🏆

ICSE National Rank 1

All India Topper, St. Mary's School, Pune

2013

Steering LLMs for Culturally Localized Generation

Simran Khanuja, Hongbin Liu, Shujian Zhang, John Lambert, Mingqing Chen, Rajiv Mathews, Lun Wang

COLM 2026 | Conference on Language Modeling

pdf

HILITe: Human-AI Collaborative Framework for Image Transcreation

Simran Khanuja, Yutong Zhang, Aayush Bheemaiah, Jainish Patel, Arya Pasumarthi, Armaan Sharma, Sophia Li, Yueqi Song, Michael Saxon, Diyi Yang, Graham Neubig

HCI+NLP@EMNLP '25 | Under Conference Submission

CAIRE: Cultural Attribution of Images by Retrieval-Augmented Evaluation

Arnav Yayavaram*, Siddharth Yayavaram*, Simran Khanuja*, Michael Saxon, Graham Neubig

EACL 2026 | European Chapter of the ACL Also presented at: CEGIS@ICCV '25

pdf code

Towards Automatic Evaluation for Image Transcreation

Simran Khanuja*, Vivek Iyer*, Claire He, Graham Neubig

NAACL 2025 | Annual Conference of the Nations of the Americas Chapter of the ACL

pdf code cite

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Xiang Yue*, Yueqi Song*, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig

ICLR 2025 | International Conference on Learning Representations

website pdf cite

Grounding Multilingual Multimodal LLMs With Cultural Knowledge (Cultural Pangea)

Jean de Dieu Nyandwi, Yueqi Song, Simran Khanuja, Graham Neubig

EMNLP 2025 | Conference on Empirical Methods in Natural Language Processing

pdf huggingface

🏆 Best Paper Runner-Up HILITE: Human-in-the-loop Interactive Tool for Image Editing

Arya Pasumarthi, Armaan Sharma, Jainish H. Patel, ..., Diyi Yang, Graham Neubig, Simran Khanuja

IEEE BigData 2024 | IEEE International Conference on Big Data (Undergraduate Symposium)

🏆 Best Paper An image speaks a thousand words, but can everyone listen? On translating images for cultural relevance

Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, Graham Neubig

EMNLP '24 | Conference on Empirical Methods in Natural Language Processing

website pdf code slides video cite

DeMuX: Data-efficient Multilingual Learning

Simran Khanuja, Srinivas Gowriraj, Lucio Dery, Graham Neubig

NAACL '24 | Conference of the North American Chapter of the ACL

pdf code slides poster video cite

GlobalBench: A Benchmark for Global Progress in Natural Language Processing

Yueqi Song, Catherine Cui, Simran Khanuja, Pengfei Liu, ..., Graham Neubig

EMNLP '23 | Conference on Empirical Methods in Natural Language Processing

pdf cite

Multi-lingual and Multi-cultural Figurative Language Understanding

Anubha Kabra*, Emmy Liu*, Simran Khanuja*, Alham Fikri Aji, Genta Indra Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, Graham Neubig

ACL '23 Findings | Annual Meeting of the ACL

pdf code cite

🏆 Best Paper FLEURS: Few-Shot Learning Evaluation of Universal Representations of Speech

Alexis Conneau*, Min Ma*, Simran Khanuja*, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna

SLT '22 | IEEE Spoken Language Technology Workshop

pdf cite

MergeDistill: Merging Pre-trained Language Models using Distillation

Simran Khanuja, Melvin Johnson, Partha Talukdar

Findings of ACL'21 | Annual Conference of the ACL

pdf slides cite

📰 Media Coverage MuRIL: Multilingual Representations for Indian Languages

Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, Partha Talukdar

tfhub huggingface pdf cite

Coverage: Economic Times | Indian Express | Google AI Blog

GLUECoS: An Evaluation Benchmark for Code-Switched NLP

Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury

ACL'20 | Annual Conference of the ACL

pdf code website slides video cite

A New Dataset for Natural Language Inference from Code-mixed Conversations

Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

CALCS, LREC'20 | International Conference on Language Resources and Evaluation

pdf data cite

Keynote: Cultural Inclusivity in Multimodal AI

CEGIS Workshop @ ICCV 2025

October 2025

slides

Rising Star in AI: Research Overview

AI for Science Symposium, University of Michigan

October 2025

slides video

Keynote: Image Transcreation for Cultural Diversity

Demographic Diversity in CV Workshop @ CVPR 2025

June 2025

slides video

Keynote: Cultural NLP and Image Transcreation

AmericasNLP Workshop @ NAACL 2024

June 2024

slides video

Image Transcreation for Cultural Relevance

Google Research, Microsoft Research, IISc, Microsoft IDC, University of Edinburgh, Pinterest

Jan - Sept 2024

slides

Decode With Google 2022

Google Research India

August 2022

video

Imminent Translated Research Grant

Translated | Human-AI Image Localization Platform

2025

Waibel Presidential Fellowship

Carnegie Mellon University

2024-2025

About

Updates

Awards & Honors

13th Heidelberg Laureate Forum

Jane Street Fellowship - Honourable Mention

BITS 30 Under 30 - Research Leaders

MIT EECS Rising Star

Rising Star in AI - University of Michigan

Best Paper Award - EMNLP 2024

Waibel Presidential Fellowship

Best Paper Award - SLT 2022

ICSE National Rank 1

Publications

Invited Talks

Grants & Funding