Multilingual Natural Language Processing

CMU CS 11737, Fall 2020

Tue/Thur 5:10-6:30 pm ET

Graham Neubig (office hours: Fridays 4-5pm ET, location: Zoom), gneubig@cs.cmu.edu
Yulia Tsvetkov (office hours: Thursdays 4-5pm ET, location: Zoom), ytsvetko@cs.cmu.edu
Alan W Black (office hours: Wednesdays 12-1pm ET, location: Zoom), awb@cs.cmu.edu

Teaching Assistants:
Cindy Wang (office hours: Tuesdays 4-5pm ET, location: Zoom), xinyiw1@cs.cmu.edu
Sachin Kumar (office hours: Mondays 4-5pm ET, location: Zoom), sachink@cs.cmu.edu
Tanmay Parekh (office hours: Tuesdays 2-3pm ET, location: Zoom), tparekh@cs.cmu.edu

Forum: Piazza
Zoom links (for lectures and office hours) can be found on Piazza. The lecture videos can also be found online here.

Learning Goals

Students who take this course should be able to develop linguistically motivated solutions to core and applied NLP tasks for any language. This includes understanding and mitigating the difficulties posed by lack of data in low-resourced languages or language varieties, and the necessity to model particular properties of the language of interest such as complex morphology or syntax. The course will introduce modeling solutions to these issues such as multilingual or cross-lingual methods, linguistically informed NLP models, and methods for effectively bootstrapping systems with limited data or human intervention. The project work will involve building an end-to-end NLP pipeline in a language you don’t know.


Announcements


Pre-requisites

You must have taken an NLP class previously. Some examples include:

The assignments for the class will be done by creating neural network models, and examples will be provided using PyTorch. If you are not familiar with PyTorch, we suggest you attempt to familiarize yourself using online tutorials (for example “Deep Learning for NLP with PyTorch”) before starting the class.


Syllabus

The lecture plan and homework/project dates are tentative and subject to change.

Week Date Topics Readings Homeworks
1 Sept 1 Course Introduction
[1, 2, 3, Lang in 10, video]
 
Sept 3 Typology: The Space of Languages
[slides, video]
Choosing Transfer Languages  
2 Sept 8 Words, Parts of Speech, Morphology
[slides, video, Lang in 10]
Linguistic Fundamentals: Chapter 2  
Sept 10 Sequence Labeling/Classification 1
[slides, video]
Homework 1: BiLSTM for Multilingual POS Tagging
3 Sept 15 Sequence Labeling/Classification 2
[slides, video]
SDE, XLM, XTREME, LSTM-CRF  
Sept 17 Translation and Translation Data
[slides, video]
 
4 Sept 22 Translation Models 1
[slides, video]
Annotated Transformer Homework 1 Due (Sept 2123, 23:59)
Sept 24 Translation Models 2
[slides, video]
Data Augmentation for LR MT, Syntactic Divergences in MT Homework 2: Multilingual MT
5 Sept 29 Language Contact and Similarity Across Languages
[slides, video]
 
Oct 1 Multilingual Training and Cross-lingual Transfer
[slides, video]
 
6 Oct 6 Unsupervised Translation
[slides, video]
 
Oct 8 Code Switching, Pidgins & Creoles
[slides, video]
 
7 Oct 13 Speech
[slides, video]
Homework 2 Due (Oct 12, 23:59)
Oct 15 Automatic Speech Recognition
[slides, video]
 
8 Oct 20 Low Resource ASR
[slides, video]
Homework 3: Speech Recognition
Oct 22 Text-to-speech
[slides, video]
 
9 Oct 27 Morphological Analysis and Inflection
[slides, video]
 
Oct 29 Dependency Parsing
[slides, video]
 
10 Nov 3 Data Annotation
[slides, video]
Homework 3 Due (Nov 2, 23:59)
Nov 5 Active Learning
[slides, video]
 
11 Nov 10 Information Extraction: Lorelei
[slides, video]
 
Nov 12 Language Documentation and Revitalization (by Patrick Littell)
[video]
 
12 Nov 17 Guest Lecture by Orhan Firhat from Google
[video]
 
Nov 19 Guest Lecture by Burr Settles from Duolingo  
13 Nov 24 Guest Lecture by Shonosuke Ishiwatari (CEO of Mantra Inc)
 
Nov 26 Thanksgiving Day (No Class)  
14 Dec 1 Project Presentations day 1  
Dec 3 Project Presentations day 2 Final Project Due (Dec 16)

Class Format

Following safety guidelines due to the ongoing pandemic, all the lectures will be held on Zoom. To make the online lectures a good experience for the students, we will aim to make the lectures as interactive as possible. On most days, there will be a short ~30 minute lecture, with an optional reading assignment given beforehand. In every class, a group of students will be asked to introduce a new language in short 10 minute presentation (more details coming soon). At the end, we will have a 30 min discussion session using breakout rooms (these may be graded for participation) followed by a short 10 min summary.


Grading

This is a project based course and grading will be done based on class participation, three homework assignments and one final project with each of them contributing to your final grade as follows:

Final Project

The final project work will be expected to be a novel contribution to knowledge on multilingual language processing. In general, we will accept contributions in several categories, including the following:

  1. Proposal of a novel method for multilingual or low-resource language processing that is better (more accurate, computationally efficient, or data efficient) than other methods in the literature. It may be easier to do this as a follow-up on one of the assignments, but you are free to tackle other tasks as well.
  2. An extensive comparison of existing methods in the literature for tackling a particular multilingual or low-resource language processing task, analyzing their strengths and weaknesses and when you may expect them to succeed or fail (on multiple datasets).
  3. Building a state-of-the-art language tool for a language for which such technology does not currently exist. You should apply methods that you learned in class in a way that is specifically tailored to the language based on its unique linguistic properties, related languages, or data availability.

For all submissions, they should include a report of up to 8 pages describing:
a. Background of the task and problems involved therein.
b. The choice of methodology, demonstrating internalization of the knowledge obtained in class.
c. Experimental setting and results, including strong baselines and proposed improvements for (1) and (3).
d. Analysis of the results demonstrating the characteristics of the implemented methods (especially important for (2)).
e. Which members of the group contributed in which ways to the implementation.

Grading:

Negative Results: Sometimes experiments don’t work as planned. If you try hard to get positive results but are not successful, you may still get a good grade by clearly describing why you thought your methods would work, and then performing an analysis of why your initial assumptions were incorrect, leading to results that did not match your initial expectations. The bar for paper writing, experimentation, and analysis will be a bit higher in these cases, as we want to make sure that you really made a serious effort.


Policies

Academic honesty. Homework assignments are to be completed individually. Verbal collaboration on homework assignments is acceptable, as well as re-implementation of relevant algorithms from research papers, but everything you turn in must be your own work, and you must note the names of anyone you collaborated with on each problem and cite resources that you used to learn about the problem. Suspected violations of academic integrity rules will be handled in accordance with the CMU guidelines on collaboration and cheating.


Note to Students

Take care of yourself! As a student, you may experience a range of challenges that can interfere with learning, such as strained relationships, increased anxiety, substance use, feeling down, difficulty concentrating and/or lack of motivation. All of us benefit from support during times of struggle. There are many helpful resources available on campus and an important part of having a healthy life is learning how to ask for help. Asking for support sooner rather than later is almost always helpful. CMU services are available, and treatment does work. You can learn more about confidential mental health services available on campus at: http://www.cmu.edu/counseling/. Support is always available (24/7) from Counseling and Psychological Services: 412-268-2922.

Accommodations for Students with Disabilities:

If you have a disability and have an accommodations letter from the Disability Resources office, I encourage you to discuss your accommodations and needs with us as early in the semester as possible. I will work with you to ensure that accommodations are provided as appropriate. If you suspect that you may have a disability and would benefit from accommodations but are not yet registered with the Office of Disability Resources, I encourage you to contact them at access@andrew.cmu.edu.