Hi! I am a second-year PhD student at the Language Technologies Institute at Carnegie Mellon University. I am advised by Prof. Graham Neubig and have wonderful friends and collaborators at Neulab :)

These days, I have been working on: a) analysing the cultural capabilities of vision-language models; and b) data-efficient multilingual fine-tuning. In particular, I'm revisiting the age old problem of translation, and exploring how it extends to multiple modalities including images and videos. I am also looking for internships for Summer 2024. Feel free to drop me an email if there's a good fit!

Previously, I was a Pre-Doctoral Researcher at Google Research India, working in the Natural Language Understanding team, mentored by Dr. Partha Talukdar. At Google, I researched multilinguality for low [text] resource languages and explored leveraging multimodal signals for the same. I applied my research to improve neural semantic parsing for eight Indian languages, for the Google Assistant.

I have had the good fortune to complete my bachelor thesis at Microsoft Research India where I was advised by Dr. Sunayana Sitaram and Dr. Monojit Choudhury. I have also spent some wonderful summers interning at the MT-NLP Lab, IIIT Hyderabad with Dr. Dipti Misra Sharma and Liveweaver, Pune.

I graduated with a B.E. (Hons.) in Computer Science and a MSc. (Hons.) in Economics from BITS Pilani, Goa, India in 2020. During my time there, I worked on multimodal emotion recognition in healthcare with Dr. Sreejith V. Apart from spending my time doing fun research, I hold a keen interest in music, a passion I got to pursue as part of the Music Society at BITS, where I gave performances as a vocalist.

For more information, you can check out my CV here or reach out to me on my email :)

Updates

Apr 2024: Grateful to be supported by the Waibel Presidential Fellowship for 2024-25!
Mar 2024: Gave a talk on image transcreation at University of Edinburgh!
Mar 2024: Our work on data-efficient multilingual learning to appear in NAACL 2024! Reach out if you'd like to catch up in Mexico City :)
Jan 2024: Gave a talk on image transcreation (preprint, slides) at Google Research, Microsoft Research, IISc and Microsoft IDC!
Dec 2023: Gave a lecture at CMU 11737 (Multilingual NLP) on Image-Text Modeling for Multilingual NLP! (link to slides, tweet thread)
Aug 2023: Organized (and won a best paper at!) the Student Research Symposium at CMU LTI
May 2023: Our work on multi-cultural figurative language to appear in ACL 2023 Findings! We will be presenting at the LAW workshop :)
May 2023: I will be attending EACL 2023 in-person to present our work at SIGTYP and C3NLP!
Jan 2023: Honored to receive the best paper award for FLEURS at SLT 2022!
Aug 2022: I started my PhD at CMU, LTI!
Aug 2022: I presented our research and its application to Google Assistant at the Decode with Google 2022 event! Thanks to my amazing team at Google Research India for the opportunity!
Oct 2021: I'll be attending ALPS 2022! Feel free to get in touch if you'll be attending the same.
Aug 2021: Conducting a hands-on TensorFlow Tutorial session at the 5th CVIT IIIT Summer School!
Aug 2021: Hosting the NLP networking session at IKDD 2021 where Dr. Monojit Choudhury is our guest speaker!
May 2021: Our work on merging multiple pre-trained LMs to appear in ACL 2021 Findings.
Mar 2021: Technical write-up on MuRIL is now available on arxiv.
Mar 2021: The pre-trained MuRIL model (with MLM) is now available on HuggingFace.
Nov 2020: Open-sourced a multilingual model for Indian languages named MuRIL on TFHub!
Sep 2020: Hosted a Fireside Chat with Jeff Dean on his virtual Google India visit!
Aug 2020: I am joining the Google Research India lab as a Pre-Doctoral Researcher where I am working with Dr. Partha Talukdar!
Aug 2020: Graduated from BITS Pilani Goa with a dual degree in Computer Science and Economics.
July 2020: The GLUECoS code and leaderboard website are now open-sourced!
Apr 2020: Paper on building a benchmark for code-switched language processing to appear at ACL 2020! (Talk)
Mar 2020: We created a new dataset for code-mixed conversational NLI! Paper to appear in CALCS, LREC 2020.
Jul 2019: I am doing my bachelor thesis at the Microsoft Research India lab, where I am working with Dr. Sunayana Sitaram!
Jun 2019: Work done on generating code-mixed text in summer 2018 to appear at TLT SyntaxFest 2019
Apr 2018: Summer internship at the MT-NLP lab, IIIT Hyderabad where I will be working with Dr. Dipti Misra Sharma

Publications

An image speaks a thousand words, but can everyone listen?
On translating images for cultural relevance

Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, Graham Neubig
Under Review
pdf| code| cite

DeMuX: Data-efficient Multilingual Learning
Simran Khanuja, Srinivas Gowriraj, Lucio Dery, Graham Neubig
NAACL '24 | Conference of the North American Chapter of the Association for Computational Linguistics
pdf| code| cite

GlobalBench: A Benchmark for Global Progress in Natural Language Processing
Yueqi Song, Catherine Cui, Simran Khanuja, Pengfei Liu, ..., Graham Neubig
EMNLP '23 | Conference on Empirical Methods in Natural Language Processing
pdf| cite

Multi-lingual and Multi-cultural Figurative Language Understanding
Anubha Kabra*, Emmy Liu*, Simran Khanuja*, Alham Fikri Aji, Genta Indra Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, Graham Neubig
ACL '23 Findings | Annual Meeting of the Association for Computational Linguistics
pdf| code| cite

🏆 Best Paper
FLEURS: Few-Shot Learning Evaluation of Universal Representations of Speech
Alexis Conneau*, Min Ma*, Simran Khanuja*, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna
SLT '22 | IEEE Spoken Language Technology Workshop
pdf| cite

MergeDistill : Merging Pre-trained Language Models using Distillation
Simran Khanuja, Melvin Johnson, Partha Talukdar Findings of ACL'21 | Annual Conference of the Association for Computational Linguistics
pdf| abstract| slides| cite

🗞️ Media Coverage
MuRIL : Multilingual Representations for Indian Languages
Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, Partha Talukdar
tfhub| huggingface| pdf| abstract| cite
Coverage: Economic Times| Indian Express| Google AI Blog

GLUECoS: An Evaluation Benchmark for Code-Switched NLP
Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
ACL'20 | Annual Conference of the Association for Computational Linguistics
pdf| abstract| code| website| slides| video| cite

A New Dataset for Natural Language Inference from Code-mixed Conversations
Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
CALCS, LREC'20 | International Conference on Language Resources and Evaluation
pdf| abstract| data| cite

Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities
Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, Kalika Bali
ICON'19 | International Conference on Natural Language Processing
pdf| abstract| cite

Dependency Parser for Bengali-English Code-Mixed Data enhanced with a Synthetic Treebank
Urmi Ghosh, Simran Khanuja, Dipti Misra Sharma
TLT, SyntaxFest 2019
pdf| abstract| code| cite

Talks and Interviews

Decode With Google 2022
Speaker List| Talk (2:28:00 onwards), (registration required)

An Introduction to (Modern) TensorFlow
Simran Khanuja, Ameya Daigavane
CVIT Summer School, IIIT Hyderabad
slides

Journey into Research
Rotaract Club, BITS Hyderabad
interview

🏆 National Rank 1
ICSE National Topper
St. Mary's School, Pune
Media Coverage: India Times| Times of India| Indian Express