Submission: Email your assignment to firstname.lastname@example.org. Attach 2 separate files: your write-up (titled Lastname_Firstname.pdf) and a zip/tar folder containing your code. Code will not be graded
While many NLP tasks have standard data sets (e.g. SQuAD for Q&A), social-oriented tasks often involve using real-world data that has few annotations and unclear research questions. The goals of this assignment are: (1) Practice methods for exploring unstructured corpora (2) Conduct an analysis of language used to describe social groups (3) Consider the ethical implications of conducting analyses on sensitive social issues and unknown corpora
The data for this assignment is available here
In this homework you will explore how the language used to describe and report news about the LGBTQ community has changed over time by analyzing newspaper articles from 1986 to 2015.
You are given two corpora: a corpus of articles focused on the LGBTQ community (lgbt_news_corpus.csv) and a background corpus (background_news_corpus.csv). The primary data for analysis is in the “Text” column, and the remaining columns are metadata which you can optionally use.
Completing the basic requirements will earn a passing (B-range) grade
Analyze context: We provide a seed list (seed_list.txt) of words which are used to refer to members of the LGBTQ community. (Note: this list was not created by the organizers of this course. You can modify this list.) Analyze the context of the seed words with a window size of 3 and visualize analysis in the form of word clouds. Eg: If the sentence is: The pastoral guidelines remind priests that in dealing with homosexuals they must attentively adhere to the church’s magisterium, or traditional teachings, according to a church official. and the word to be considered from the seed list is homosexuals then you will analyze the words in, dealing, with, they, must, attentively. Consider how to subset the corpus to obtain interesting word clouds.
Analyze sentiment: Explore the sentiment used in the articles related to the LGBTQ community and compare with the background articles. We provide a lexicon of positive and negative sentiment words (sentiment_lexicon.csv). Analyze the intersection of these sentiment words with the articles that describe the LGBTQ community and the background topics, and report the differences that you find.
For full credit, complete an additional advanced analysis of this data set. Feel free to explore any further analysis directions. Here are a few references to give you ideas:
Thank you Julia Mendelsohn (email@example.com) for sharing the data used in this assignment.