CMU logo
Search
Expand Menu
Close Menu

Language Technologies Ph.D. Thesis Defense

Open in new window

Speaker
ADITI CHAUDHARY
Ph.D. Student
Language Technologies Institute
Carnegie Mellon University

When
-

Where
In Person and Virtual - ET

Description
Creating a language description which illustrates the salient points of the language is not only important for language understanding but is also an indispensable step for language documentation and preservation. Manually creating detailed descriptions for several languages that are usable by humans and machines can be challenging; therefore, in this thesis we explore whether we can automate some of the processes involved in the language description creation and create language descriptions in a format usable by both humans and machines. To achieve this goal, we develop a system AutoLEX which automatically extracts these linguistic insights,  covering aspects of morphology, syntax and lexical semantics, for several languages. In the first part of the thesis, we describe this general framework, which takes as input a text corpus of the language of interest and a linguistic question that we are interested in exploring. AutoLEX converts this into an NLP prediction task and produces a concise description which answers that question. We further demonstrate the application of these language descriptions in real-world settings of language analysis and education.  Thanks to advances in NLP research, we can automate some local aspects of linguistic analysis, such as identifying the syntactic function of a word (POS tagging) or identifying grammatical relations (dependency parsing), therefore, in the second part of the thesis, we describe how to improve the NLP building blocks that inform AutoLEX, particularly for under-resourced languages. Most state-of-the-art methods that are involved in the building blocks (e.g. POS tagging) require an abundance of labeled data, which is often not readily available for many languages. Therefore, we focus on improving these methods for such under-resourced languages by leveraging commonalities between languages (cross-lingual transfer learning) and by collecting new data efficiently (active learning) in the required language. Thesis Committee: Graham Neubig (Chair) Alan Black David R. Mortensen Antonios Anastasopoulos (George Mason University) Isabelle Augenstein (University of Copenhagen) Additional Information

In Person and Zoom Participation. See announcement.