Hi! I am a third-year PhD student at the Language Technologies Institute at Carnegie Mellon University. I am advised by Graham Neubig and have wonderful friends and collaborators at Neulab :)

These days, I have been working on analysing the cultural capabilities of vision-language models. In particular, I'm revisiting the age old problem of translation, and exploring how it extends to multiple modalities including images and videos. Checkout this github repo where I've been collecting resources for cultural NLP and please feel free to send a PR to add relevant material!

We also won a best paper award at EMNLP 2024 for our work on image transcreation, where we introduce the task of localizing images using machine learning systems! I'm a mentor at OpenNLP Labs where we are doing cool projects related to this :) Check out their initiatives here. I've been working on several aspects of this problem in my research, including data curation, automatic evaluation and modeling. If anyone is interested to chat about it, please feel free to reach out on my email :)

I'm looking for summer 2025 internships on topics around geo-cultural diversity for vision-language models! Please reach out to me on my email if I might be a good fit!

Previously, I've had the good fortune to research multilingualism and low-resource NLP, especially for Indian languages, at Google Research India (GRI) and Microsoft Research India (MSR-I). I'm so grateful to have worked with brilliant researchers whose mentorship has been critical to my growth as a researcher: Graham Neubig (CMU), Partha Talukdar (GRI), Sebastian Ruder (GRI), Alexis Conneau (GRI), Sunayana Sitaram (MSR-I), Monojit Choudhury (MSR-I), Dr. Sreejith V (BITS-G).

Apart from spending my time doing fun research, I hold a keen interest in music, a passion I got to pursue as part of the Music Society at BITS, where I gave performances as a vocalist.

For more information, you can check out my CV here or reach out to me on my email :)

Updates

Nov 2024: We won best paper for our work on image transcreation at EMNLP 2024! Very honored and humbled :')
Oct 2024: We recently released Pangea-7B, an open-sourced multi-(lingual, modal, cultural) model!
Sept 2024: Our paper on image transcreation accepted at EMNLP (Main) '24 (tweet thread)!
Sept 2024: Invited talk on image transcreation at Pinterest!
June 2024: Invited keynote at the AmericasNLP workshop at NAACL '24! (slides, recording)
May 2024: Mentor for OpenNLP Labs where we are building technology to support cultural translation of stories for EduLang!
Apr 2024: Grateful to be supported by the Waibel Presidential Fellowship for 2024-25!
Mar 2024: Gave a talk on image transcreation at University of Edinburgh!
Mar 2024: Our work on data-efficient multilingual learning to appear in NAACL 2024! Reach out if you'd like to catch up in Mexico City :)
Jan 2024: Gave a talk on image transcreation (preprint, slides) at Google Research, Microsoft Research, IISc and Microsoft IDC!
Dec 2023: Gave a lecture at CMU 11737 (Multilingual NLP) on Image-Text Modeling for Multilingual NLP! (link to slides, tweet thread)
Aug 2023: Organized (and won a best paper at!) the Student Research Symposium at CMU LTI
May 2023: Our work on multi-cultural figurative language to appear in ACL 2023 Findings! We will be presenting at the LAW workshop :)
May 2023: I will be attending EACL 2023 in-person to present our work at SIGTYP and C3NLP!
Jan 2023: Honored to receive the best paper award for FLEURS at SLT 2022!
Aug 2022: I started my PhD at CMU, LTI!
Aug 2022: I presented our research and its application to Google Assistant at the Decode with Google 2022 event! Thanks to my amazing team at Google Research India for the opportunity!
Oct 2021: I'll be attending ALPS 2022! Feel free to get in touch if you'll be attending the same.
Aug 2021: Conducting a hands-on TensorFlow Tutorial session at the 5th CVIT IIIT Summer School!
Aug 2021: Hosting the NLP networking session at IKDD 2021 where Dr. Monojit Choudhury is our guest speaker!
May 2021: Our work on merging multiple pre-trained LMs to appear in ACL 2021 Findings.
Mar 2021: Technical write-up on MuRIL is now available on arxiv.
Mar 2021: The pre-trained MuRIL model (with MLM) is now available on HuggingFace.
Nov 2020: Open-sourced a multilingual model for Indian languages named MuRIL on TFHub!
Sep 2020: Hosted a Fireside Chat with Jeff Dean on his virtual Google India visit!
Aug 2020: I am joining the Google Research India lab as a Pre-Doctoral Researcher where I am working with Dr. Partha Talukdar!
Aug 2020: Graduated from BITS Pilani Goa with a dual degree in Computer Science and Economics.
July 2020: The GLUECoS code and leaderboard website are now open-sourced!
Apr 2020: Paper on building a benchmark for code-switched language processing to appear at ACL 2020! (Talk)
Mar 2020: We created a new dataset for code-mixed conversational NLI! Paper to appear in CALCS, LREC 2020.
Jul 2019: I am doing my bachelor thesis at the Microsoft Research India lab, where I am working with Dr. Sunayana Sitaram!
Jun 2019: Work done on generating code-mixed text in summer 2018 to appear at TLT SyntaxFest 2019
Apr 2018: Summer internship at the MT-NLP lab, IIIT Hyderabad where I will be working with Dr. Dipti Misra Sharma

Publications

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Xiang Yue*, Yueqi Song*, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig
Under Review
web| pdf| cite

🏆 Best Paper
An image speaks a thousand words, but can everyone listen?
On translating images for cultural relevance

Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, Graham Neubig
EMNLP '24 | Conference on Empirical Methods in Natural Language Processing
web| pdf| code| cite| slides| video

DeMuX: Data-efficient Multilingual Learning
Simran Khanuja, Srinivas Gowriraj, Lucio Dery, Graham Neubig
NAACL '24 | Conference of the North American Chapter of the Association for Computational Linguistics
pdf| code| slides| poster| video| cite

GlobalBench: A Benchmark for Global Progress in Natural Language Processing
Yueqi Song, Catherine Cui, Simran Khanuja, Pengfei Liu, ..., Graham Neubig
EMNLP '23 | Conference on Empirical Methods in Natural Language Processing
pdf| cite

Multi-lingual and Multi-cultural Figurative Language Understanding
Anubha Kabra*, Emmy Liu*, Simran Khanuja*, Alham Fikri Aji, Genta Indra Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, Graham Neubig
ACL '23 Findings | Annual Meeting of the Association for Computational Linguistics
pdf| code| cite

🏆 Best Paper
FLEURS: Few-Shot Learning Evaluation of Universal Representations of Speech
Alexis Conneau*, Min Ma*, Simran Khanuja*, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna
SLT '22 | IEEE Spoken Language Technology Workshop
pdf| cite

MergeDistill : Merging Pre-trained Language Models using Distillation
Simran Khanuja, Melvin Johnson, Partha Talukdar Findings of ACL'21 | Annual Conference of the Association for Computational Linguistics
pdf| abstract| slides| cite

🗞️ Media Coverage
MuRIL : Multilingual Representations for Indian Languages
Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, Partha Talukdar
tfhub| huggingface| pdf| abstract| cite
Coverage: Economic Times| Indian Express| Google AI Blog

GLUECoS: An Evaluation Benchmark for Code-Switched NLP
Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
ACL'20 | Annual Conference of the Association for Computational Linguistics
pdf| abstract| code| website| slides| video| cite

A New Dataset for Natural Language Inference from Code-mixed Conversations
Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
CALCS, LREC'20 | International Conference on Language Resources and Evaluation
pdf| abstract| data| cite

Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities
Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, Kalika Bali
ICON'19 | International Conference on Natural Language Processing
pdf| abstract| cite

Dependency Parser for Bengali-English Code-Mixed Data enhanced with a Synthetic Treebank
Urmi Ghosh, Simran Khanuja, Dipti Misra Sharma
TLT, SyntaxFest 2019
pdf| abstract| code| cite

Talks and Interviews

Decode With Google 2022
Speaker List| Talk (2:28:00 onwards), (registration required)

An Introduction to (Modern) TensorFlow
Simran Khanuja, Ameya Daigavane
CVIT Summer School, IIIT Hyderabad
slides

Journey into Research
Rotaract Club, BITS Hyderabad
interview

🏆 National Rank 1
ICSE National Topper
St. Mary's School, Pune
Media Coverage: India Times| Times of India| Indian Express