Yulia Tsvetkov (office hours: Tuesday 12-1pm, GHC 6405), email@example.com Alan W Black (office hours: Wednesday 12-1pm, GHC 5701), firstname.lastname@example.org
TA: Anjalie Field (office hours: Thursday 3-4pm, GHC 6609), email@example.com
HW 2: Crowdsourced Annotations
Due 11:59pm, Thursday 2/21
Submission: Email your assignment to firstname.lastname@example.org. Attach 2 separate files: your write-up (titled Lastname_Firstname.pdf) and a zip/tar folder containing your code. Code will not be graded
Crowdsourcing annotations has become a fundamental aspect of NLP research, but there are many ethical concerns around this type of data collection. The goal of this assignment is to explore: (1) the challenges behind creating an annotation scheme and (2) the ethical implications of soliciting crowdsourced data and reporting results.
In this homework, we provide a data set of comments written in response to TED talks. Our goal is to look for gender bias in this data set: how are comments on videos with male speakers different than comments on videos with female speakers? We provide an annotation interface with a preliminary annotation scheme. You will first annotate the data set using the provided scheme and then analyze your annotations in order to improve the scheme. This assignment requires you to compare annotations with 1 or 2 other students.
Completing the basic requirements will earn a passing (B-range) grade
Round 1 Annotations:
First, follow the instructions below (under “Technical Details”) to run the annotation interface and annotate the data in data/common1.csv. Then, collaborate with 1-2 other students in the class in order to:
Calculate the inter-annotator agreement (Fleiss’s Kappa) for the attributes of PosterTone, ExpertiseScale, EncouragingDiscouraging, and RespectfulDisrespectful among your group
For the attribute PosterTone, report at least 2 different metrics for measuring agreement. Try to come up with the worst agreement and the best agreement scores you can (Hint: consider metrics like pairwise Cohen’s Kappa or percent agreement and try collapsing categories)
Discuss any challenges you faced in conducting the annotations and analyze where you agreed or disagreed
Round 2 Annotations:
Based on your results from Round 1:
Propose at least 3 ways to improve the annotation scheme
Annotate the data in data/common2.csv using your revised scheme (you do not have to modify the actual interface, you may conduct the annotations in a spreadsheet if you prefer)
Calculate inter-annotator agreement over the new annotations and compare your scheme to the original scheme.
Each student should submit their own 2-4 page report (ACL format). The report should include:
An analysis of inter-annotator agreement from Round 1, including a discussion of the possible ways to report agreement and their ethical implications
A critique of the original annotation scheme and your proposed changes
An analysis of inter-annotator agreement from Round 2
A description of your methodology and findings from the Advanced Analysis (if completed)
A brief discussion of the ethics of crowdsourcing annotations that broadly addresses the question: what factors should be considered when designing an annotation task? Here are some dimensions to consider:
Jaroszewski, Samantha, et al. "Genderfluid or Attack Helicopter: Responsible HCI Research Practice with Non-binary Gender Variation in Online Communities." Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 2018. [link]
Schmidt, Florian Alexander. "The good, the bad and the ugly: Why crowdsourcing needs ethics." Cloud and Green Computing (CGC), 2013 Third International Conference on. IEEE, 2013. [link]
Wauthier, Fabian L., and Michael I. Jordan. "Bayesian bias mitigation for crowdsourcing." Advances in neural information processing systems. 2011. [link]
We have provided annotations over a larger portion of data from the same data set (aggregate.csv). This data set includes annotator ratings for EncouragingDiscouraging, ExpertiseScale, PosterTone, RespectfulDisrespectful as well as the actual gender of the TED talk speaker.
Using these annotations, conduct an analysis that contrasts traits of comments addressed towards female and male speakers. You may analyze any one of the annotation traits (i.e. Respect), but justify your choice of trait.
Grading (100 points)
20 points - Submitting assignment
40 points - Completing basic requirements
20 points - Write up is well-written, presents meaningful analysis, and contains all requested information
15 points - Advanced analysis
5 points - Discussion of ethical implications is particularly thoughtful or well-researched
The annotation interface is located inside the folder “annotation”. NOTE: there is no way to pause the annotation interface and resume where you left off later. You will need to complete each set of annotations all at once.
To run the annotation interface:
Make sure that the name of the file you want to annotate is correct in the run.py script on line 9: read_data(f='filename')
Run the interface using: python run.py (The program requires python 3 and Google Chrome browser to run correctly. You will also need to install flask)
Open http://127.0.0.1:5000/ on your chrome browser.
Follow the instructions to complete the annotations. The interface is constructed such that you have to make a selection in each category, else the program will not proceed.
After completing the annotations, the final comment will read “DONE”. Then, click the "Download as CSV" button to download the annotations in a csv file with name "myplot.csv"
Thank you Rob Voigt for creating the annotation interface.