Hi! I am a first-year PhD student at the Language Technologies Institute at Carnegie Mellon University. I am advised by Prof. Graham Neubig and have wonderful friends and collaborators at Neulab :) Previously, I was a Pre-Doctoral Researcher at Google Research India, working in the Natural Language Understanding team, mentored by Dr. Partha Talukdar. At Google, I researched multilinguality for low [text] resource languages and explored leveraging multimodal signals for the same. I applied my research to improve neural semantic parsing for eight Indian languages, for the Google Assistant.

I have had the good fortune to complete my bachelor thesis at Microsoft Research India where I was advised by Dr. Sunayana Sitaram and Dr. Monojit Choudhury. I have also spent some wonderful summers interning at the MT-NLP Lab, IIIT Hyderabad with Dr. Dipti Misra Sharma and Liveweaver, Pune.

My research experiences have given me a unique opportunity to contribute towards developing technologies for the linguistically rich and diverse set of Indian languages, for which I am ever grateful. I have come to realise the importance of working on research problems that are applicable in the real world and strive to have this flavor in my research. I wish to draw inspiration from cognitive and behavioural sciences to build technologies that make human-computer interaction more seamless and natural.

I graduated with a B.E. (Hons.) in Computer Science and a MSc. (Hons.) in Economics from BITS Pilani, Goa, India in 2020. During my time there, I worked on multimodal emotion recognition in healthcare with Dr. Sreejith V. Apart from spending my time doing fun research, I hold a keen interest in music, a passion I got to pursue as part of the Music Society at BITS, where I gave performances as a vocalist.

For more information, you can check out my CV here or reach out to me on my email :)


Aug 2022: I started my PhD at CMU, LTI!
Oct 2021: I'll be attending ALPS 2022! Feel free to get in touch if you'll be attending the same.
Aug 2021: Conducting a hands-on TensorFlow Tutorial session at the 5th CVIT IIIT Summer School!
Aug 2021: Hosting the NLP networking session at IKDD 2021 where Dr. Monojit Choudhury is our guest speaker!
May 2021: Our work on merging multiple pre-trained LMs to appear in ACL 2021 Findings.
Mar 2021: Technical write-up on MuRIL is now available on arxiv.
Mar 2021: The pre-trained MuRIL model (with MLM) is now available on HuggingFace.
Nov 2020: Open-sourced a multilingual model for Indian languages named MuRIL on TFHub!
Sep 2020: Hosted a Fireside Chat with Jeff Dean on his virtual Google India visit!
Aug 2020: I am joining the Google Research India lab as a Pre-Doctoral Researcher where I am working with Dr. Partha Talukdar!
Aug 2020: Graduated from BITS Pilani Goa with a dual degree in Computer Science and Economics.
July 2020: The GLUECoS code and leaderboard website are now open-sourced!
Apr 2020: Paper on building a benchmark for code-switched language processing to appear at ACL 2020! (Talk)
Mar 2020: We created a new dataset for code-mixed conversational NLI! Paper to appear in CALCS, LREC 2020.
Jul 2019: I am doing my bachelor thesis at the Microsoft Research India lab, where I am working with Dr. Sunayana Sitaram!
Jun 2019: Work done on generating code-mixed text in summer 2018 to appear at TLT SyntaxFest 2019
Apr 2018: Summer internship at the MT-NLP lab, IIIT Hyderabad where I will be working with Dr. Dipti Misra Sharma


MergeDistill : Merging Pre-trained Language Models using Distillation
Simran Khanuja, Melvin Johnson, Partha Talukdar Findings of ACL'21 | Annual Conference of the Association for Computational Linguistics
pdf| abstract| slides| cite

MuRIL : Multilingual Representations for Indian Languages
Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, Partha Talukdar
tfhub| huggingface| pdf| abstract| cite
Coverage : Economic Times| Indian Express| Google AI Blog

GLUECoS: An Evaluation Benchmark for Code-Switched NLP
Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
ACL'20 | Annual Conference of the Association for Computational Linguistics
pdf| abstract| code| website| slides| video| cite

A New Dataset for Natural Language Inference from Code-mixed Conversations
Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
CALCS, LREC'20 | International Conference on Language Resources and Evaluation
pdf| abstract| data| cite

Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities
Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, Kalika Bali
ICON'19 | International Conference on Natural Language Processing
pdf| abstract| cite

Dependency Parser for Bengali-English Code-Mixed Data enhanced with a Synthetic Treebank
Urmi Ghosh, Simran Khanuja, Dipti Misra Sharma
TLT, SyntaxFest 2019
pdf| abstract| code| cite

Talks and Interviews

An Introduction to (Modern) TensorFlow
Simran Khanuja, Ameya Daigavane
CVIT Summer School, IIIT Hyderabad

Journey into Research
Rotaract Club, BITS Hyderabad

ICSE National Topper
St. Mary's School, Pune
India Times| Times of India| Indian Express| Wikipedia