CMU logo
Expand Menu
Close Menu

Language Technologies Thesis Defense

Speaker
CINDY XINYI WANG
Ph.D. Student
Language Technologies Institute
Carnegie Mellon University

When
-

Where
In Person and Virtual Presentation - ET

Description
The adoption of neural network models has lead to state-of-the-art performance in many NLP tasks on major languages that have large amounts of data (Devlin et al., 2019; Vaswani et al., 2017), but their improvements often lag behind for low-resource languages (Koehn and Knowles, 2017; Sennrich and Zhang, 2019a). This imbalance in NLP progress could lead to increasing disparities between people from different regions or in different socioeconomic conditions. The goal of this thesis is to develop methods which efficiently utilize the available data resources to build competitive NLP systems for all languages. We focus on multilingual training, a particularly effective strategy for improving the model quality of low-resource languages while training parameter-efficient models (Zoph et al., 2016; Neubig and Hu, 2018; Devlin et al., 2019; Conneau et al., 2020b). We identify three major challenges facing multilingual models. (1) The standard word embedding representation hinders the model’s generalization to training signals across different languages, mainly because it does not have good inductive biases to account for the lexical similarities and discrepancies between different languages. (2) Searching for good multilingual data selection and balancing strategies requires multiple runs of model retraining because multilingual datasets are often highly imbalanced across different languages. (3) It is challenging to adapt a multilingual model to languages with very limited resources. To tackle the first two challenges for multilingual training, we propose better word representation methods for multilingual data that encourage positive transfer between languages, and design automatic methods to select and balance multilingual training data. To tackle the third challenge, we explore novel methods that adapt multilingual models to support language varieties that are often overlooked in existing multilingual benchmarks through model ensembling and data augmentation. Thesis Committee: Graham Neubig (Chair) Alan Black Yulia Tsvetkov Sebastian Ruder (Google Research) Additional Information

Zoom Participation. See announcement.