Computational Ethics for NLP

CMU CS 11830, Spring 2020

T/Th 10:30-11:50am, SH 214

Yulia Tsvetkov (office hours by appointment),
Alan W Black (office hours: Wednesdays 12-1pm, Zoom link on Piazza),
TA: Anjalie Field (office hours by appointment),
TA: Michael Miller Yoder (office hours by appointment),

HW 2: Civility in Communication

Due 11:59pm, Tuesday 3/3

Submission: Submit your assignment through Canvas. Include 3-4 separate files in a zipped into zip/tar file: your write-up (titled FirstName_LastName_hw2.pdf), your predictions over test.tsv titled FirstName_LastName_test.tsv, your improved predictions over test.tsv titled FirstName_LastName_advanced.tsv (if completed), and your code. Code will not be graded


As we have discussed in class, abusive language on online platforms has become a major concern in the past few years. However, developing automated methods for flagging and censoring abusive language has proved to be difficult and prone to unwanted biases. The goals of this assignment are to (1) explore the challenges and ethical issues behind developing classifier for identifying offensive language (2) develop techincal solutions that aim to address these challenges.


In this assignment, you will explore an off-the-shelf toxicity classifier as well as build your own models. In general, you will evaluate models using two criteria: (1) performance over hate speech detection (Accuracy and F1 Score where “NOT” is considered the positive label) and (2) False Positive Rate (FPR), how often the model misclassifiers non-toxic speech as toxic, specifically for comments associated with different demographic dialects. Poor performance over hate speech classification suggests that the model is not accurate enough to be useful, while poor or imbalanced FPR indicates that the model may impose racial biases.

The primary data for this assignment is available here . Please note that the data contains offensive or sensitive content, including profanity and racial slurs.

We provide data drawn from two sources. The first (files "train.tsv" and "dev.tsv") consists of tweets annotated for offensiveness taken from the 2019 SemEval task on offensive language detection. In the files "train.tsv" and "dev.tsv", the first column (text) contains the text of a tweet, the second column (label) contains an offensiveness label:

The file “offenseval-annotation.txt” provides additional details on the annotation scheme.

We additionally provide a data set of tweets proxy-labelled for race in the file titled “mini_demographic_dev.tsv”. This data is taken from the TwitterAAE data set and uses posterior proportions of demographic topics as a proxy for racial dialect (details). The first column (“text”) contains the text of the tweet, and the second column (“demographic”) contains a label: “AA” (for “African American”), “White”, “Hispanic”, or “Other”. For this assignment, we assume that no tweet in the TwitterAAE data set contains toxic language. Thus, any tweet in this file that is classified as toxic is a false positive.

Finally, both development sets (“dev.tsv” and “mini_demographic_dev.tsv”) contain a column “perspective_score”, which contains a toxicity score. These scores were obtain using the PerspectiveAPI tool released by Alphabet. This tool is intended to help “developers and publishers…give realtime feedback to commenters or help moderators do their job”

In all data sets, user mentions have been replaced with the token @USER.

Basic Requirements

Completing the basic requirements will earn a passing (B-range) grade

Off-the-shelf Model Exploration

Custom Model Exploration

Test Set Predictions


Submit a 2-3 page report (ACL format) titled FirstName_LastName_hw2.pdf. Please do not submit more than 4 pages. The report should include:

Be sure to cite all references.

Advanced Analysis

Improve your preliminary classifier. You may aim to improve accuracy/F1 of hate speech classification, or FPR, or to improve both metrics simultaneously. If you choose to focus on one metric, still report results for the other metric and discuss any trade-offs. Creative model architectures or feature crafting will receive full credit, even if they do not improve results.

In your report, include a description of your model and results over dev.tsv. Additionally, use your improved classifier to predict results over test.tsv and place these predictions in a file titled FirstName_LastName_advanced.tsv

In order to facitilate analysis, we provide a larger data set here . This extended data set contains full training and dev sets from the TwitterAAE data set, as well as additional data annotated for hate speech drawn from a different paper (ICWSM, 2017). Note that user mentions have not been replaced in this data set. You are free to explore any ideas you have. We provide a few pointers for inspiration.

If you choose to maximize performance over offensiveness classification, you may choose to develop a more sophisiticated model for hate speech detection. Some prior work includes:

Models from prior SemEval tasks may also be helpful. Additionally, the provided "train.tsv" file contains annotations for different types of offensive language (e.g. untargeted vs. targeted, labels are in the third column titled "category"), which you may also consider leveraging.

If you choose to improve FPR, you may wish to leverage the provided demographic_train.tsv file. Data from this file could be used to balance your training data or to train a model with an adversarial object. Some related work includes:

Grading (100 points)

Implementation Tips