Due 11:59pm on Thursday, Mar. 6, 2014 (Thursday before Spring Break)
Automatic evaluation of machine translation output relative to a human-generated reference translation (or, if available, a set of reference translations) is a core challenge in machine translation. Not only are such evaluation measures useful for determining whether a change to an MT system improved or degraded its performance. Being able to solve the translation evaluation problem provides a simpler version of the translation problem itself: an effective evaluation measure must determine if a hypothesis is fluent and if it has the same meaning as the reference. These roughly correspond to the language model and translation model in the noisy channel decomposition of the translation problem we talked about. As such, studying evaluation is a way to study the translation problem.
Additionally, as we will see later in the semester, automatic MT evaluation plays a central role in learning the parameters of a translation model. An evaluation measure can be used to assign “partial credit” to translations produced by the MT system that do not precisely match the provided reference translation. This partial credit can enable more robust discriminative learning from fewer training instances.
Automatic evaluation is by no means a solved problem. In particular, evaluating translations into morphologically rich languages (MRLs) is an open research question. MRLs have productive word formation processes that build new words out of a smaller inventory of underlying morphemes. While all languages can form new words, MRL create words with a higher frequency using regular rules, and they often express grammatical relations or semantic properties that would be expressed using word order or independent lexical items in morphologically simpler languages, such as English or Mandarin. As a result, a translation into a MRL will generally have a higher density of singleton words than a translation into a morphologically simple language. Since evaluation metrics generally work by matching tokens between source and reference translations, they are less reliable in MRLs since similar translations may use different word types, due to morphological variation.
We focus on the evaluation of English$\rightarrow$Czech translations in this homework (although you are welcome to develop evaluation tools that are not language specific). We have chosen Czech because it exhibits considerable inflectional complexity (every verb, noun, and adjective may have dozens or hundreds of inflected forms), none of you speak it as a native language (so no one has an unfair advantage!), and because there are a large number of human judgments of actual translation outputs. We will use these human judgements to evaluate the quality of your automatic evaluation systems and provide some data to learn from (we even provide the learning algorithms!).
This assignment is an opportunity for a publication (by default, we assume you will participate, but you are welcome to opt out; notify us by email or on Piazza). The 2014 Workshop on Machine Translation, which will be held in conjunction with ACL this summer, is hosting a “shared task” (a bake-off for MT researchers) on the challenge of developing new machine translation metrics. These will be evaluated relative to human judgements of system quality. We will be participating in this task by combining all of the features you develop a metric whose weights are learned in a learning-to-rank framework.
Go to your clone of your course GitHub repository on the machine where you will be doing this assignment, and run the following command to obain the code and data you will need:
./tools/get-new-assignments
In this homework, you will study the translation evaluation problem as follows. You will be given a Czech-language translation hypothesis ($\textbf{h}$) and a reference translation ($\textbf{r}$) that was generated by a human translator. Your task will be to develop a scoring function $f(\textbf{h},\textbf{r}) \mapsto \mathbb{R}$ that measures the translation quality (adequacy, fluency, etc.) where higher scores indicate better quality.
To illustrate (using a German$\rightarrow$English example), consider the suppose that $\textbf{h}_1$ and $\textbf{h}_2$ are the translation outputs from two different translation systems of the following input:
where the two systems produce:
Let us assume that we are given the following reference translation:
We would hope that our automatic evaluation function $f$ would have the property that $f(\textbf{h}_1,\textbf{r}) < f(\textbf{h}_2,\textbf{r})$, indicating that $\textbf{h}_2$ is a better translation.
To solve this problem we are going to use a learning to rank framework based on a proportional odds model where we provide you with pairwise ranking judgements for two translation alternatives for a single source sentence, a maximum likelihood learning algorithm, and your job is to develop features that will generalize well to held-out data (another write-up about the proportional odds model).
We will parameterize the evaluation function $f$ using a $d$-dimensional feature function vector $\boldsymbol{\phi}(\textbf{h},\textbf{r}) \mapsto \mathbb{R}^d$ that computes features of an input/hypothesis pair, and a $d$-dimensional weight vector. The evaluation function $f(\textbf{h},\textbf{r})$ is defined to be $\boldsymbol{w}^{\top} \boldsymbol{\phi}(\textbf{h},\textbf{r})$.
To set the parameters, we will provide you with a set of training data: human judgements of translation quality. In particular, we will provide judgements about pairs of translation hypotheses, given a reference. Annotators had three options: $\textbf{h}_1 < \textbf{h}_2$, $\textbf{h}_1 \approx \textbf{h}_2$, or $\textbf{h}_1 > \textbf{h}_2$. We designate these outcomes $y=\{-1,0,1\}$, respectively.
We will relate between translation function score $f = \boldsymbol{w}^{\top} \boldsymbol{\phi}(\textbf{h},\textbf{r})$ and the judgements using the following model:
$$\begin{align*} Z &= f(\textbf{h}_2, \textbf{r}) - f(\textbf{h}_1, \textbf{r}) \\ &= \boldsymbol{w}^{\top} \boldsymbol{\phi}(\textbf{h}_2,\textbf{r}) - \boldsymbol{w}^{\top} \boldsymbol{\phi}(\textbf{h}_1,\textbf{r}) \\ &= \boldsymbol{w}^{\top} \left( \boldsymbol{\phi}(\textbf{h}_2,\textbf{r}) - \boldsymbol{\phi}(\textbf{h}_1,\textbf{r}) \right) \\ Y &= \begin{cases} -1 & \textrm{if }Z \in (-\infty,\alpha_0] \\ 0 & \textrm{if }Z \in (\alpha_0,\alpha_1] \\ 1 & \textrm{if }Z \in (\alpha_1,+\infty) \end{cases} \\ p(Y < j \mid \textbf{h}_1, \textbf{h}_2, \textbf{r}) &= \mathrm{logit}(\alpha_{j-1} - \boldsymbol{w}^{\top}\left( \boldsymbol{\phi}(\textbf{h}_2,\textbf{r}) - \boldsymbol{\phi}(\textbf{h}_1,\textbf{r}) \right)) \end{align*}$$
In this model, the intercepts $-\infty < \alpha_0 < \alpha_1 < \infty$ that indicate where the boundaries between the ranking ordering categories lie. We provide code to learn the weights $(\boldsymbol{w},\alpha_0,\alpha_1)$ given a set of pairwise judgements so as to maximize the likelihood according to the above model. We encourage you to use this code, but you are welcome to devise another learning-to-rank algorithm.
Your task will be to extract features from pairs of references $(\textbf{h},\textbf{r})$ that are effective at ranking hypotheses. You will be given a series of inputs of the form:
This is my reference translation ||| That are my hypothesized translation
This is another reference ||| This be a further translation
For each line, you will extract a feature map of the following form:
{"Feature1":1.0, "SomeOtherFeature":-9.8}
{"Feature2":-0.5, "Feature1":2.2}
It’s as simple as that! There is a baseline feature extractor called ./extract
and code to fit the data. Note: You will need to install the creg
regression package to run the fitting code, download the source tarball. Building creg
will require installing the Boost C++ libraries or using a machine where they are installed.
Important: in order for the fit
command described below to function, you set the CREG
environment variable to point to your creg
binary. This looks something like the following:
export CREG=/path/to/creg/creg/creg
The workflow for this project involves four parts: (1) feature extraction, (2) parameter fitting, (3) computing the scores for the training/dev/test data, and (4) evaluation. You will find scripts to do all of these things:
./extract > myfeatures.json
./fit --l2 1.0 < myfeatures.json > weights.json
./score -w weights.json < myfeatures.json > output.txt
./evaluate < output.txt
The evaluate
command scores the quality of your evaluation function relative to human judgements of translation quality and reports Kendell’s $\tau$ (this is also the official evaluation metric for the shared task).
Your task for this assignment is to improve the accuracy of your translation evaluator relative to human judgements of translation quality as much as possible by engineering features (i.e., modifying extract
) that help predict whether a hypothesized translation is good. You do not need to make any changes to fit
, score
, or evaluate
(although you may want to try different values for the --l2
option).
The baseline system required to earn seven points is the simple morphological METEOR score is enough to earn seven points. The simple morphological METEOR score is defined as the weighted harmonic mean of the following four quantities: the precision and recall of full word matches and the precision and recall of the hypothesis and reference words truncated just to their first 6 letters in each hypothesis against the reference.
This is a very simple baseline to implement, and we are particularly ethusiastic for you to experiment with new techniques for assessing translation quality. The sky is the limit! Here are some ideas that might inspire you:
Be creative!
You may find it useful to use extra data or tools for this task, for example:
You may work in independently or in groups of any size, under these conditions:
You must turn in the following by submitting to the public GitHub repository
hw2/output.txt
- your scored (reference, hypothesis) pairs (1 per line)hw2/README.md
- a brief description of the algorithms you tried.hw2/...
- your source code and revision history. We want to see evidence of regular progress over the course of the project. You don’t have to git push
to the public repository unless you want to, but you should be committing changes with git add
and git commit
. We expect to see evidence that you are trying things out and learning from what you see.You should feel free to use additional data resources such as thesauruses, WordNet, or parallel data. You are also free (and encouraged!) to use additional codebases, tools, and libraries except for those designed to evaluate machine translation systems. You must write your own evaluation function. However, if you want your evaluation to depend on lemmatizers, stemmers, automatic parsers, or part-of-speech taggers, or you would like to learn a metric using a general machine learning toolkit, that is fine. But translation evaluators including (but not limited too) available implementations of BLEU, METEOR, TER, NIST, and others are not permitted. You may of course inspect these systems if you want to understand how they work, although they tend to include other functionality that is not the focus of this assignment. It is possible to complete the assignment with a very modest amount of code. If you aren’t sure whether something is permitted, ask us.
This assignment is inspired by an assignment given in the course developed by Adam Lopez, Chris Callison-Burch, and Matt Post.
Unless otherwise indicated, this content has been adapted from this course by Chris Dyer. Both the original and new content are licensed under a Creative Commons Attribution 3.0 Unported License. |