Computational Ethics for NLP

CMU CS 11830, Spring 2020

T/Th 10:30-11:50am, SH 214

Yulia Tsvetkov (office hours by appointment),
Alan W Black (office hours: Wednesdays 12-1pm, Zoom link on Piazza),
TA: Anjalie Field (office hours by appointment),
TA: Michael Miller Yoder (office hours by appointment),

HW 1: Crowdsourced Annotations

Due 11:59pm, Tuesday 2/11

Zip the code and a PDF write-up into a single tar/zip file and submit through Canvas. Code will not be graded.


Crowdsourcing annotations has become a fundamental aspect of NLP research. The goal of this assignment is to explore the ethical implications of soliciting crowdsourced data, specifically social biases that may emerge when asking for generated sentences.


In this homework, you will perform a “bias audit” of an NLP dataset produced by crowdsourcing. You will attempt to measure the presence of social stereotypes in this dataset that may have harmful effects if used to train classifiers in downstream tasks.

You will use pointwise mutual information (PMI) to find which associations are being made with identity labels. PMI can be used as a measure of word association in a corpus, i.e. how frequently two words co-occur above what might just be expected based on their frequencies. See the PMI Wikipedia page for more details. Here we use PMI to measure which words co-occur with labels for identities. This allows us to see associations that may perpetuate stereotypes.

After this analysis, you will present specific examples from the data that you speculate could be particularly biased and problematic. In the optional advanced analysis, you will expand this analysis to another corpus.

Data and Resources

Basic Requirements

Completing the basic requirements will earn a passing (B-range) grade.

Word association analysis: First, build a tool for calculating pointwise mutual information (PMI) between unigram frequencies in the SNLI dataset. Your tool should take a unigram, with word frequencies relative to a corpus, as input and give a list of other unigrams in the corpus ranked by PMI. Terms that occur less than 10 times in the corpus should not be considered; optionally you can consider other thresholds. For preprocessing, lowercase, remove stopwords and tokenize the data. Note that there are duplicate premises and hypotheses in the data; remove these and just look at unique utterances.

Here’s how you can calculate PMI. Let be the count of word in the corpus and be the number of times that and occur in the same premise or hypothesis. If they co-occur more than once within a premise or hypothesis, you can still just calculate that as one. With as the number of documents (premises or hypotheses) in the corpus, we define as the word frequency . Then PMI is:

Compute PMI between the identity labels in the provided list and all other words in the SNLI training corpus (see details in the Data and Resources section above. Look at the top associated words for identity labels of your choice. Do you see any that may reflect social stereotypes? It is helpful to compare the top PMI words for certain identity terms with other related ones (such as men compared with women). Note that some terms in the list do not occur in the data; they are included for advanced analysis on possible other corpora.

Calculate PMI separately for identity terms in the premises, which are the original provided captions from the Flickr30k image captioning dataset, and identity terms in the hypotheses, which were elicited in a crowdworking task. You will compare the associations made in the write-up.

Qualitative analysis: Find specific hypotheses from the dataset where an identity label occurs with a top-associated term that shows some social bias or does not. Look at 1-2 examples for at least 5 different identity labels. Also note the label (entailment, contradiction, neutral) and consider whether the impact of asking annotators for certain types of inference.

Crowdsourcing set-up: Read about what crowdworkers were asked to do in constructing the SNLI corpus in the SNLI paper. Come up with at least one idea about how the designers of the crowdsourcing task might have mitigated any social bias you found in your analysis. For example, are there certain topics that often led to biased hypotheses? Could the task have been structured differently or different instructions given to mitigate bias?

Advanced Analysis

Choose one of the options below for advanced analysis.

New corpus

Choose another crowdsourced NLP or ML dataset and perform a similar bias audit based on identity terms. Datasets to consider include:

Modify the identity list by adding and/or dropping labels that do or do not occur above a frequency threshold in this new dataset and run the PMI word association analysis. Similar to the basic requirements, discuss stereotypes found in this corpus and give specific examples. Are there differences in the type or degree of stereotypes found compared with the SNLI corpus? Read about the annotation procedure for these datasets. How might these crowdsource tasks have set themselves up–or not–for responses that reflect stereotypes in ways that are similar or different from SNLI? Discuss potential implications for new crowdsourced data collection in the write-up.


Expand the PMI analysis to higher-order ngrams and possibly syntactic phrases. Note that you will want to also expand the identity list to possibly include combinations of identity types (such as asian man). Refer to Rudinger et al. 2017 for ideas here.

Association measures

Perhaps PMI is not the best lexical association measure to see social bias with identity terms. Explore lexical association measures other than PMI; see Pecina 2010 for ideas.


Each student should submit their own 2-3 page report (ACL format). Please do not submit more than 4 pages, though you can put large tables and figures in an appendix beyond that if necessary. The report should include:

Grading (100 points)