Computational Ethics for NLP

CMU CS 11830, Spring 2019

T/Th 10:30-11:50am, POS 146

Yulia Tsvetkov (office hours: Tuesday 12-1pm, GHC 6405),
Alan W Black (office hours: Wednesday 12-1pm, GHC 5701),
TA: Anjalie Field (office hours: Thursday 3-4pm, GHC 6609),

HW 1: Corpus Exploration

Due 11:59pm, Thursday 2/7

Submission: Email your assignment to Attach 2 separate files: your write-up (titled Lastname_Firstname.pdf) and a zip/tar folder containing your code. Code will not be graded


While many NLP tasks have standard data sets (e.g. SQuAD for Q&A), social-oriented tasks often involve using real-world data that has few annotations and unclear research questions. The goals of this assignment are: (1) Practice methods for exploring unstructured corpora (2) Conduct an analysis of language used to describe social groups (3) Consider the ethical implications of conducting analyses on sensitive social issues and unknown corpora


The data for this assignment is available here

In this homework you will explore how the language used to describe and report news about the LGBTQ community has changed over time by analyzing newspaper articles from 1986 to 2015.

You are given two corpora: a corpus of articles focused on the LGBTQ community (lgbt_news_corpus.csv) and a background corpus (background_news_corpus.csv). The primary data for analysis is in the “Text” column, and the remaining columns are metadata which you can optionally use.

Basic Requirements

Completing the basic requirements will earn a passing (B-range) grade

Analyze context: We provide a seed list (seed_list.txt) of words which are used to refer to members of the LGBTQ community. (Note: this list was not created by the organizers of this course. You can modify this list.) Analyze the context of the seed words with a window size of 3 and visualize analysis in the form of word clouds. Eg: If the sentence is: The pastoral guidelines remind priests that in dealing with homosexuals they must attentively adhere to the church’s magisterium, or traditional teachings, according to a church official. and the word to be considered from the seed list is homosexuals then you will analyze the words in, dealing, with, they, must, attentively. Consider how to subset the corpus to obtain interesting word clouds.

Analyze sentiment: Explore the sentiment used in the articles related to the LGBTQ community and compare with the background articles. We provide a lexicon of positive and negative sentiment words (sentiment_lexicon.csv). Analyze the intersection of these sentiment words with the articles that describe the LGBTQ community and the background topics, and report the differences that you find.


Advanced Analysis

For full credit, complete an additional advanced analysis of this data set. Feel free to explore any further analysis directions. Here are a few references to give you ideas:

Additional Tips

Grading (100 points)


Thank you Julia Mendelsohn ( for sharing the data used in this assignment.