Syllabus

# Homework 2

Due 11:59pm on Thursday, Mar. 6, 2014 (Thursday before Spring Break)

Automatic evaluation of machine translation output relative to a human-generated reference translation (or, if available, a set of reference translations) is a core challenge in machine translation. Not only are such evaluation measures useful for determining whether a change to an MT system improved or degraded its performance. Being able to solve the translation evaluation problem provides a simpler version of the translation problem itself: an effective evaluation measure must determine if a hypothesis is fluent and if it has the same meaning as the reference. These roughly correspond to the language model and translation model in the noisy channel decomposition of the translation problem we talked about. As such, studying evaluation is a way to study the translation problem.

Additionally, as we will see later in the semester, automatic MT evaluation plays a central role in learning the parameters of a translation model. An evaluation measure can be used to assign “partial credit” to translations produced by the MT system that do not precisely match the provided reference translation. This partial credit can enable more robust discriminative learning from fewer training instances.

## Research Question: Evaluating Morphologically Rich Languages

Automatic evaluation is by no means a solved problem. In particular, evaluating translations into morphologically rich languages (MRLs) is an open research question. MRLs have productive word formation processes that build new words out of a smaller inventory of underlying morphemes. While all languages can form new words, MRL create words with a higher frequency using regular rules, and they often express grammatical relations or semantic properties that would be expressed using word order or independent lexical items in morphologically simpler languages, such as English or Mandarin. As a result, a translation into a MRL will generally have a higher density of singleton words than a translation into a morphologically simple language. Since evaluation metrics generally work by matching tokens between source and reference translations, they are less reliable in MRLs since similar translations may use different word types, due to morphological variation.

We focus on the evaluation of English$\rightarrow$Czech translations in this homework (although you are welcome to develop evaluation tools that are not language specific). We have chosen Czech because it exhibits considerable inflectional complexity (every verb, noun, and adjective may have dozens or hundreds of inflected forms), none of you speak it as a native language (so no one has an unfair advantage!), and because there are a large number of human judgments of actual translation outputs. We will use these human judgements to evaluate the quality of your automatic evaluation systems and provide some data to learn from (we even provide the learning algorithms!).

This assignment is an opportunity for a publication (by default, we assume you will participate, but you are welcome to opt out; notify us by email or on Piazza). The 2014 Workshop on Machine Translation, which will be held in conjunction with ACL this summer, is hosting a “shared task” (a bake-off for MT researchers) on the challenge of developing new machine translation metrics. These will be evaluated relative to human judgements of system quality. We will be participating in this task by combining all of the features you develop a metric whose weights are learned in a learning-to-rank framework.

## Getting started

Go to your clone of your course GitHub repository on the machine where you will be doing this assignment, and run the following command to obain the code and data you will need:

./tools/get-new-assignments

In this homework, you will study the translation evaluation problem as follows. You will be given a Czech-language translation hypothesis ($\textbf{h}$) and a reference translation ($\textbf{r}$) that was generated by a human translator. Your task will be to develop a scoring function $f(\textbf{h},\textbf{r}) \mapsto \mathbb{R}$ that measures the translation quality (adequacy, fluency, etc.) where higher scores indicate better quality.

To illustrate (using a German$\rightarrow$English example), consider the suppose that $\textbf{h}_1$ and $\textbf{h}_2$ are the translation outputs from two different translation systems of the following input:

src = Chris hat die Prüfung bestanden, jedoch nur knapp.

where the two systems produce:

$\textbf{h}_1 =$Chris has insisted on the test, but just barely.
$\textbf{h}_2 =$Chris has passed the test, but just barely.

Let us assume that we are given the following reference translation:

$\textbf{r} =$Chris just barely passed the test.

We would hope that our automatic evaluation function $f$ would have the property that $f(\textbf{h}_1,\textbf{r}) < f(\textbf{h}_2,\textbf{r})$, indicating that $\textbf{h}_2$ is a better translation.

## Learning to Rank

To solve this problem we are going to use a learning to rank framework based on a proportional odds model where we provide you with pairwise ranking judgements for two translation alternatives for a single source sentence, a maximum likelihood learning algorithm, and your job is to develop features that will generalize well to held-out data (another write-up about the proportional odds model).

We will parameterize the evaluation function $f$ using a $d$-dimensional feature function vector $\boldsymbol{\phi}(\textbf{h},\textbf{r}) \mapsto \mathbb{R}^d$ that computes features of an input/hypothesis pair, and a $d$-dimensional weight vector. The evaluation function $f(\textbf{h},\textbf{r})$ is defined to be $\boldsymbol{w}^{\top} \boldsymbol{\phi}(\textbf{h},\textbf{r})$.

To set the parameters, we will provide you with a set of training data: human judgements of translation quality. In particular, we will provide judgements about pairs of translation hypotheses, given a reference. Annotators had three options: $\textbf{h}_1 < \textbf{h}_2$, $\textbf{h}_1 \approx \textbf{h}_2$, or $\textbf{h}_1 > \textbf{h}_2$. We designate these outcomes $y=\{-1,0,1\}$, respectively.

We will relate between translation function score $f = \boldsymbol{w}^{\top} \boldsymbol{\phi}(\textbf{h},\textbf{r})$ and the judgements using the following model:

\begin{align*} Z &= f(\textbf{h}_2, \textbf{r}) - f(\textbf{h}_1, \textbf{r}) \\ &= \boldsymbol{w}^{\top} \boldsymbol{\phi}(\textbf{h}_2,\textbf{r}) - \boldsymbol{w}^{\top} \boldsymbol{\phi}(\textbf{h}_1,\textbf{r}) \\ &= \boldsymbol{w}^{\top} \left( \boldsymbol{\phi}(\textbf{h}_2,\textbf{r}) - \boldsymbol{\phi}(\textbf{h}_1,\textbf{r}) \right) \\ Y &= \begin{cases} -1 & \textrm{if }Z \in (-\infty,\alpha_0] \\ 0 & \textrm{if }Z \in (\alpha_0,\alpha_1] \\ 1 & \textrm{if }Z \in (\alpha_1,+\infty) \end{cases} \\ p(Y < j \mid \textbf{h}_1, \textbf{h}_2, \textbf{r}) &= \mathrm{logit}(\alpha_{j-1} - \boldsymbol{w}^{\top}\left( \boldsymbol{\phi}(\textbf{h}_2,\textbf{r}) - \boldsymbol{\phi}(\textbf{h}_1,\textbf{r}) \right)) \end{align*}

In this model, the intercepts $-\infty < \alpha_0 < \alpha_1 < \infty$ that indicate where the boundaries between the ranking ordering categories lie. We provide code to learn the weights $(\boldsymbol{w},\alpha_0,\alpha_1)$ given a set of pairwise judgements so as to maximize the likelihood according to the above model. We encourage you to use this code, but you are welcome to devise another learning-to-rank algorithm.

## Running the Tools

Your task will be to extract features from pairs of references $(\textbf{h},\textbf{r})$ that are effective at ranking hypotheses. You will be given a series of inputs of the form:

This is my reference translation ||| That are my hypothesized translation
This is another reference ||| This be a further translation

For each line, you will extract a feature map of the following form:

{"Feature1":1.0, "SomeOtherFeature":-9.8}
{"Feature2":-0.5, "Feature1":2.2}

It’s as simple as that! There is a baseline feature extractor called ./extract and code to fit the data. Note: You will need to install the creg regression package to run the fitting code, download the source tarball. Building creg will require installing the Boost C++ libraries or using a machine where they are installed.

Important: in order for the fit command described below to function, you set the CREG environment variable to point to your creg binary. This looks something like the following:

export CREG=/path/to/creg/creg/creg

The workflow for this project involves four parts: (1) feature extraction, (2) parameter fitting, (3) computing the scores for the training/dev/test data, and (4) evaluation. You will find scripts to do all of these things:

./extract > myfeatures.json
./fit --l2 1.0 < myfeatures.json > weights.json
./score -w weights.json < myfeatures.json > output.txt
./evaluate < output.txt

The evaluate command scores the quality of your evaluation function relative to human judgements of translation quality and reports Kendell’s $\tau$ (this is also the official evaluation metric for the shared task).

## The Challenge: Features for Evaluation

Your task for this assignment is to improve the accuracy of your translation evaluator relative to human judgements of translation quality as much as possible by engineering features (i.e., modifying extract) that help predict whether a hypothesized translation is good. You do not need to make any changes to fit, score, or evaluate (although you may want to try different values for the --l2 option).

The baseline system required to earn seven points is the simple morphological METEOR score is enough to earn seven points. The simple morphological METEOR score is defined as the weighted harmonic mean of the following four quantities: the precision and recall of full word matches and the precision and recall of the hypothesis and reference words truncated just to their first 6 letters in each hypothesis against the reference.

This is a very simple baseline to implement, and we are particularly ethusiastic for you to experiment with new techniques for assessing translation quality. The sky is the limit! Here are some ideas that might inspire you:

• Compute string similarity using string subsequence kernels
• Use unsupervised morphology (here’s one tool, here’s another) to split complex words into simpler pieces.
• Use an $n$-gram language model to better assess fluency
• Develop a single-sentence variant of BLEU
• Use vector space representations of words to assess semantic similarity.
• Use vector space representations of sentences to assess semantic similarity.
• Use a dependency parser to assess syntactic wellformedness (paper)
• Use a lexical translation model (e.g., from your previous assignment) to “translate” (i.e., paraphrase) the reference into the hypothesis, and compute the perplexity or likelihood
• See what evaluation measures other people have implemented (paper)

Be creative!

You may find it useful to use extra data or tools for this task, for example:

## Ground Rules

• You may work in independently or in groups of any size, under these conditions:

• You must notify us by posting a public note to piazza including the e-mails of everyone who will be working in the group (max=3).
• Everyone in the group will receive the same grade on the assignment.
• Once you have formed a group, you may not disband until the next homework.
• You must turn in the following by submitting to the public GitHub repository

• hw2/output.txt - your scored (reference, hypothesis) pairs (1 per line)
• hw2/README.md - a brief description of the algorithms you tried.
• hw2/... - your source code and revision history. We want to see evidence of regular progress over the course of the project. You don’t have to git push to the public repository unless you want to, but you should be committing changes with git add and git commit. We expect to see evidence that you are trying things out and learning from what you see.

You should feel free to use additional data resources such as thesauruses, WordNet, or parallel data. You are also free (and encouraged!) to use additional codebases, tools, and libraries except for those designed to evaluate machine translation systems. You must write your own evaluation function. However, if you want your evaluation to depend on lemmatizers, stemmers, automatic parsers, or part-of-speech taggers, or you would like to learn a metric using a general machine learning toolkit, that is fine. But translation evaluators including (but not limited too) available implementations of BLEU, METEOR, TER, NIST, and others are not permitted. You may of course inspect these systems if you want to understand how they work, although they tend to include other functionality that is not the focus of this assignment. It is possible to complete the assignment with a very modest amount of code. If you aren’t sure whether something is permitted, ask us.

## Acknowledgements

This assignment is inspired by an assignment given in the course developed by Adam Lopez, Chris Callison-Burch, and Matt Post.

 Unless otherwise indicated, this content has been adapted from this course by Chris Dyer. Both the original and new content are licensed under a Creative Commons Attribution 3.0 Unported License.